[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=361256=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-361256 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 18/Dec/19 01:51 Start Date: 18/Dec/19 01:51 Worklog Time Spent: 10m Work Description: tvalentyn commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-566830300 Postcommits tests are failing with this change: https://issues.apache.org/jira/browse/BEAM-8988. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 361256) Time Spent: 19.5h (was: 19h 20m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 19.5h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=361126=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-361126 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 17/Dec/19 21:30 Start Date: 17/Dec/19 21:30 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 361126) Time Spent: 19h 20m (was: 19h 10m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 19h 20m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=360873=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-360873 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 17/Dec/19 12:48 Start Date: 17/Dec/19 12:48 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r358770537 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -1274,3 +1463,139 @@ def display_data(self): tableSpec) res['table'] = DisplayDataItem(tableSpec, label='Table') return res + + +class _PassThroughThenCleanup(PTransform): + """A PTransform that invokes a DoFn after the input PCollection has been +processed. + """ + def __init__(self, cleanup_dofn): +self.cleanup_dofn = cleanup_dofn + + def expand(self, input): +class PassThrough(beam.DoFn): + def process(self, element): +yield element + +output = input | beam.ParDo(PassThrough()).with_outputs('cleanup_signal', +main='main') +main_output = output['main'] +cleanup_signal = output['cleanup_signal'] + +_ = (input.pipeline + | beam.Create([None]) + | beam.ParDo(self.cleanup_dofn, beam.pvalue.AsSingleton( + cleanup_signal))) + +return main_output + + +@experimental() +class _ReadFromBigQuery(PTransform): + """Read data from BigQuery. + +This PTransform uses a BigQuery export job to take a snapshot of the table +on GCS, and then reads from each produced JSON file. + +Do note that currently this source does not work with DirectRunner. + + Args: +table (str, callable, ValueProvider): The ID of the table, or a callable + that returns it. The ID must contain only letters ``a-z``, ``A-Z``, + numbers ``0-9``, or underscores ``_``. If dataset argument is + :data:`None` then the table argument must contain the entire table + reference specified as: ``'DATASET.TABLE'`` + or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one + argument representing an element to be written to BigQuery, and return + a TableReference, or a string table name as specified above. +dataset (str): The ID of the dataset containing this table or + :data:`None` if the table reference is specified entirely by the table + argument. +project (str): The ID of the project containing this table. +query (str): A query to be used instead of arguments table, dataset, and + project. +validate (bool): If :data:`True`, various checks will be done when source + gets initialized (e.g., is table present?). This should be + :data:`True` for most scenarios in order to catch errors as early as + possible (pipeline construction instead of pipeline execution). It + should be :data:`False` if the table is created during pipeline + execution by a previous step. +coder (~apache_beam.coders.coders.Coder): The coder for the table + rows. If :data:`None`, then the default coder is + _JsonToDictCoder, which will interpret every row as a JSON + serialized dictionary. +use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL + dialect for this query. The default value is :data:`False`. + If set to :data:`True`, the query will use BigQuery's updated SQL + dialect with improved standards compliance. + This parameter is ignored for table inputs. +flatten_results (bool): Flattens all nested and repeated fields in the + query results. The default value is :data:`True`. +kms_key (str): Experimental. Optional Cloud KMS key name for use when + creating new temporary tables. +gcs_location (str): The name of the Google Cloud Storage bucket where + the extracted table should be written as a string or + a :class:`~apache_beam.options.value_provider.ValueProvider`. If + :data:`None`, then the temp_location parameter is used. + """ + def __init__(self, gcs_location=None, validate=False, *args, **kwargs): +if gcs_location: + if not isinstance(gcs_location, (str, unicode, ValueProvider)): +raise TypeError('%s: gcs_location must be of type string' +' or ValueProvider; got %r instead' +% (self.__class__.__name__, type(gcs_location))) + + if isinstance(gcs_location, (str, unicode)): +gcs_location = StaticValueProvider(str, gcs_location) +self.gcs_location = gcs_location +self.validate = validate + +self._args = args +self._kwargs = kwargs + + def _get_destination_uri(self, temp_location): +"""Returns
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=360641=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-360641 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 17/Dec/19 01:42 Start Date: 17/Dec/19 01:42 Worklog Time Spent: 10m Work Description: chamikaramj commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r358558194 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -1274,3 +1463,139 @@ def display_data(self): tableSpec) res['table'] = DisplayDataItem(tableSpec, label='Table') return res + + +class _PassThroughThenCleanup(PTransform): + """A PTransform that invokes a DoFn after the input PCollection has been +processed. + """ + def __init__(self, cleanup_dofn): +self.cleanup_dofn = cleanup_dofn + + def expand(self, input): +class PassThrough(beam.DoFn): + def process(self, element): +yield element + +output = input | beam.ParDo(PassThrough()).with_outputs('cleanup_signal', +main='main') +main_output = output['main'] +cleanup_signal = output['cleanup_signal'] + +_ = (input.pipeline + | beam.Create([None]) + | beam.ParDo(self.cleanup_dofn, beam.pvalue.AsSingleton( + cleanup_signal))) + +return main_output + + +@experimental() +class _ReadFromBigQuery(PTransform): + """Read data from BigQuery. + +This PTransform uses a BigQuery export job to take a snapshot of the table +on GCS, and then reads from each produced JSON file. + +Do note that currently this source does not work with DirectRunner. + + Args: +table (str, callable, ValueProvider): The ID of the table, or a callable + that returns it. The ID must contain only letters ``a-z``, ``A-Z``, + numbers ``0-9``, or underscores ``_``. If dataset argument is + :data:`None` then the table argument must contain the entire table + reference specified as: ``'DATASET.TABLE'`` + or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one + argument representing an element to be written to BigQuery, and return + a TableReference, or a string table name as specified above. +dataset (str): The ID of the dataset containing this table or + :data:`None` if the table reference is specified entirely by the table + argument. +project (str): The ID of the project containing this table. +query (str): A query to be used instead of arguments table, dataset, and + project. +validate (bool): If :data:`True`, various checks will be done when source + gets initialized (e.g., is table present?). This should be + :data:`True` for most scenarios in order to catch errors as early as + possible (pipeline construction instead of pipeline execution). It + should be :data:`False` if the table is created during pipeline + execution by a previous step. +coder (~apache_beam.coders.coders.Coder): The coder for the table + rows. If :data:`None`, then the default coder is + _JsonToDictCoder, which will interpret every row as a JSON + serialized dictionary. +use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL + dialect for this query. The default value is :data:`False`. + If set to :data:`True`, the query will use BigQuery's updated SQL + dialect with improved standards compliance. + This parameter is ignored for table inputs. +flatten_results (bool): Flattens all nested and repeated fields in the + query results. The default value is :data:`True`. +kms_key (str): Experimental. Optional Cloud KMS key name for use when + creating new temporary tables. +gcs_location (str): The name of the Google Cloud Storage bucket where + the extracted table should be written as a string or + a :class:`~apache_beam.options.value_provider.ValueProvider`. If + :data:`None`, then the temp_location parameter is used. + """ + def __init__(self, gcs_location=None, validate=False, *args, **kwargs): +if gcs_location: + if not isinstance(gcs_location, (str, unicode, ValueProvider)): +raise TypeError('%s: gcs_location must be of type string' +' or ValueProvider; got %r instead' +% (self.__class__.__name__, type(gcs_location))) + + if isinstance(gcs_location, (str, unicode)): +gcs_location = StaticValueProvider(str, gcs_location) +self.gcs_location = gcs_location +self.validate = validate + +self._args = args +self._kwargs = kwargs + + def _get_destination_uri(self, temp_location): +
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=360536=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-360536 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 16/Dec/19 22:32 Start Date: 16/Dec/19 22:32 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r358501366 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _CustomBigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) Review comment: It seems the reported estimated size should be the output bytes, but if that's not easy
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=360532=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-360532 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 16/Dec/19 22:30 Start Date: 16/Dec/19 22:30 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r358500733 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _CustomBigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) + + bq.clean_up_temporary_dataset(self.project) + + return size + + def split(self,
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=360529=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-360529 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 16/Dec/19 22:26 Start Date: 16/Dec/19 22:26 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r358499063 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -1274,3 +1463,139 @@ def display_data(self): tableSpec) res['table'] = DisplayDataItem(tableSpec, label='Table') return res + + +class _PassThroughThenCleanup(PTransform): + """A PTransform that invokes a DoFn after the input PCollection has been +processed. + """ + def __init__(self, cleanup_dofn): +self.cleanup_dofn = cleanup_dofn + + def expand(self, input): +class PassThrough(beam.DoFn): + def process(self, element): +yield element + +output = input | beam.ParDo(PassThrough()).with_outputs('cleanup_signal', +main='main') +main_output = output['main'] +cleanup_signal = output['cleanup_signal'] + +_ = (input.pipeline + | beam.Create([None]) + | beam.ParDo(self.cleanup_dofn, beam.pvalue.AsSingleton( + cleanup_signal))) + +return main_output + + +@experimental() +class _ReadFromBigQuery(PTransform): + """Read data from BigQuery. + +This PTransform uses a BigQuery export job to take a snapshot of the table +on GCS, and then reads from each produced JSON file. + +Do note that currently this source does not work with DirectRunner. + + Args: +table (str, callable, ValueProvider): The ID of the table, or a callable + that returns it. The ID must contain only letters ``a-z``, ``A-Z``, + numbers ``0-9``, or underscores ``_``. If dataset argument is + :data:`None` then the table argument must contain the entire table + reference specified as: ``'DATASET.TABLE'`` + or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one + argument representing an element to be written to BigQuery, and return + a TableReference, or a string table name as specified above. +dataset (str): The ID of the dataset containing this table or + :data:`None` if the table reference is specified entirely by the table + argument. +project (str): The ID of the project containing this table. +query (str): A query to be used instead of arguments table, dataset, and + project. +validate (bool): If :data:`True`, various checks will be done when source + gets initialized (e.g., is table present?). This should be + :data:`True` for most scenarios in order to catch errors as early as + possible (pipeline construction instead of pipeline execution). It + should be :data:`False` if the table is created during pipeline + execution by a previous step. +coder (~apache_beam.coders.coders.Coder): The coder for the table + rows. If :data:`None`, then the default coder is + _JsonToDictCoder, which will interpret every row as a JSON + serialized dictionary. +use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL + dialect for this query. The default value is :data:`False`. + If set to :data:`True`, the query will use BigQuery's updated SQL + dialect with improved standards compliance. + This parameter is ignored for table inputs. +flatten_results (bool): Flattens all nested and repeated fields in the + query results. The default value is :data:`True`. +kms_key (str): Experimental. Optional Cloud KMS key name for use when + creating new temporary tables. +gcs_location (str): The name of the Google Cloud Storage bucket where + the extracted table should be written as a string or + a :class:`~apache_beam.options.value_provider.ValueProvider`. If + :data:`None`, then the temp_location parameter is used. + """ + def __init__(self, gcs_location=None, validate=False, *args, **kwargs): +if gcs_location: + if not isinstance(gcs_location, (str, unicode, ValueProvider)): +raise TypeError('%s: gcs_location must be of type string' +' or ValueProvider; got %r instead' +% (self.__class__.__name__, type(gcs_location))) + + if isinstance(gcs_location, (str, unicode)): +gcs_location = StaticValueProvider(str, gcs_location) +self.gcs_location = gcs_location +self.validate = validate + +self._args = args +self._kwargs = kwargs + + def _get_destination_uri(self, temp_location): +"""Returns
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=358350=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-358350 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 12/Dec/19 06:20 Start Date: 12/Dec/19 06:20 Worklog Time Spent: 10m Work Description: chamikaramj commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r356974540 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -1274,3 +1463,139 @@ def display_data(self): tableSpec) res['table'] = DisplayDataItem(tableSpec, label='Table') return res + + +class _PassThroughThenCleanup(PTransform): + """A PTransform that invokes a DoFn after the input PCollection has been +processed. + """ + def __init__(self, cleanup_dofn): +self.cleanup_dofn = cleanup_dofn + + def expand(self, input): +class PassThrough(beam.DoFn): + def process(self, element): +yield element + +output = input | beam.ParDo(PassThrough()).with_outputs('cleanup_signal', +main='main') +main_output = output['main'] +cleanup_signal = output['cleanup_signal'] + +_ = (input.pipeline + | beam.Create([None]) + | beam.ParDo(self.cleanup_dofn, beam.pvalue.AsSingleton( + cleanup_signal))) + +return main_output + + +@experimental() +class _ReadFromBigQuery(PTransform): + """Read data from BigQuery. + +This PTransform uses a BigQuery export job to take a snapshot of the table +on GCS, and then reads from each produced JSON file. + +Do note that currently this source does not work with DirectRunner. + + Args: +table (str, callable, ValueProvider): The ID of the table, or a callable + that returns it. The ID must contain only letters ``a-z``, ``A-Z``, + numbers ``0-9``, or underscores ``_``. If dataset argument is + :data:`None` then the table argument must contain the entire table + reference specified as: ``'DATASET.TABLE'`` + or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one + argument representing an element to be written to BigQuery, and return + a TableReference, or a string table name as specified above. +dataset (str): The ID of the dataset containing this table or + :data:`None` if the table reference is specified entirely by the table + argument. +project (str): The ID of the project containing this table. +query (str): A query to be used instead of arguments table, dataset, and + project. +validate (bool): If :data:`True`, various checks will be done when source + gets initialized (e.g., is table present?). This should be + :data:`True` for most scenarios in order to catch errors as early as + possible (pipeline construction instead of pipeline execution). It + should be :data:`False` if the table is created during pipeline + execution by a previous step. +coder (~apache_beam.coders.coders.Coder): The coder for the table + rows. If :data:`None`, then the default coder is + _JsonToDictCoder, which will interpret every row as a JSON + serialized dictionary. +use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL + dialect for this query. The default value is :data:`False`. + If set to :data:`True`, the query will use BigQuery's updated SQL + dialect with improved standards compliance. + This parameter is ignored for table inputs. +flatten_results (bool): Flattens all nested and repeated fields in the + query results. The default value is :data:`True`. +kms_key (str): Experimental. Optional Cloud KMS key name for use when + creating new temporary tables. +gcs_location (str): The name of the Google Cloud Storage bucket where + the extracted table should be written as a string or + a :class:`~apache_beam.options.value_provider.ValueProvider`. If + :data:`None`, then the temp_location parameter is used. + """ + def __init__(self, gcs_location=None, validate=False, *args, **kwargs): +if gcs_location: + if not isinstance(gcs_location, (str, unicode, ValueProvider)): +raise TypeError('%s: gcs_location must be of type string' +' or ValueProvider; got %r instead' +% (self.__class__.__name__, type(gcs_location))) + + if isinstance(gcs_location, (str, unicode)): +gcs_location = StaticValueProvider(str, gcs_location) +self.gcs_location = gcs_location +self.validate = validate + +self._args = args +self._kwargs = kwargs + + def _get_destination_uri(self, temp_location): +
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=357857=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-357857 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 11/Dec/19 14:16 Start Date: 11/Dec/19 14:16 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-564562180 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 357857) Time Spent: 18h 10m (was: 18h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 18h 10m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=357807=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-357807 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 11/Dec/19 12:10 Start Date: 11/Dec/19 12:10 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-564512207 Run Python 2 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 357807) Time Spent: 18h (was: 17h 50m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 18h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=357776=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-357776 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 11/Dec/19 11:08 Start Date: 11/Dec/19 11:08 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-564492437 Run Python 2 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 357776) Time Spent: 17h 50m (was: 17h 40m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 17h 50m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=357537=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-357537 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 11/Dec/19 00:11 Start Date: 11/Dec/19 00:11 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-564317779 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 357537) Time Spent: 17h 40m (was: 17.5h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 17h 40m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=357069=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-357069 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 10/Dec/19 13:20 Start Date: 10/Dec/19 13:20 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-564028930 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 357069) Time Spent: 17.5h (was: 17h 20m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 17.5h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=357004=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-357004 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 10/Dec/19 11:06 Start Date: 10/Dec/19 11:06 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-563983582 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 357004) Time Spent: 17h 20m (was: 17h 10m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 17h 20m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=356905=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-356905 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 10/Dec/19 09:01 Start Date: 10/Dec/19 09:01 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-563934421 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 356905) Time Spent: 17h 10m (was: 17h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 17h 10m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355311=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355311 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 06/Dec/19 16:34 Start Date: 06/Dec/19 16:34 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-562644116 Run Python 2 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 355311) Time Spent: 17h (was: 16h 50m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 17h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355257=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355257 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 06/Dec/19 15:15 Start Date: 06/Dec/19 15:15 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-562611582 Jira issue to switch to avro: https://issues.apache.org/jira/browse/BEAM-8910 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 355257) Time Spent: 16h 50m (was: 16h 40m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 16h 50m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355247=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355247 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 06/Dec/19 15:01 Start Date: 06/Dec/19 15:01 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-562605478 Thanks @robertwb for your comments! > Why does this not work on the direct runners. Is it an issue of needing to be split first? Yes. I've already created a jira for this: https://issues.apache.org/jira/browse/BEAM-8528 > would it make sense to implement this as an SDF instead? My first attempt was a regular (non splittable) DoFn that triggers export job followed by `MatchAll` and `ReadMatches` transforms. This worked, but I had troubles with implementing the rest: waiting for query job, waiting for export job and removing json files after reading. Using Source API turned out to be simpler. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 355247) Time Spent: 16.5h (was: 16h 20m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 16.5h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355248=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355248 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 06/Dec/19 15:01 Start Date: 06/Dec/19 15:01 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-562605478 Thanks @robertwb for your comments! > Why does this not work on the direct runners. Is it an issue of needing to be split first? Yes. I've already created a jira for this: https://issues.apache.org/jira/browse/BEAM-8528 > would it make sense to implement this as an SDF instead? My first attempt was a regular (non splittable) DoFn that triggers export job followed by `MatchAll` and `ReadMatches` transforms. This worked, but I had troubles with implementing the rest: waiting for query job, waiting for export job and removing json files after reading. Using Source API turned out to be simpler. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 355248) Time Spent: 16h 40m (was: 16.5h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 16h 40m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355207=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355207 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 06/Dec/19 14:18 Start Date: 06/Dec/19 14:18 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r354853510 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _CustomBigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) Review comment: Input. And to be more precise - that's the number of bytes which must be read from
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355200=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355200 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 06/Dec/19 14:07 Start Date: 06/Dec/19 14:07 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r354848637 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _CustomBigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql Review comment: It was copied-pasted from the native BigQuery source (`BigQuerySource` class), because their interfaces are mostly the same. I'll check if this can be improved. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 355200) Time Spent: 16h 10m (was: 16h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 16h
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355196=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355196 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 06/Dec/19 14:05 Start Date: 06/Dec/19 14:05 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r354848017 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _CustomBigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) + + bq.clean_up_temporary_dataset(self.project) + + return size + + def split(self,
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355189=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355189 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 06/Dec/19 13:53 Start Date: 06/Dec/19 13:53 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r354842168 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _CustomBigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, Review comment: This one does not block due to dry_run being `True`. When dry_run is true, the job is not actually ran. Instead, we get some processing statistics (one of them is the number of bytes read by the query) > Is estimate_size() called during pipeline construction? Is it guaranteed to be called only (exactly) once? I think a runner is
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=355188=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-355188 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 06/Dec/19 13:45 Start Date: 06/Dec/19 13:45 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r354838895 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _CustomBigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) + + bq.clean_up_temporary_dataset(self.project) + + return size + + def split(self,
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354750=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354750 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 05/Dec/19 22:38 Start Date: 05/Dec/19 22:38 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r354545674 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _CustomBigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) + + bq.clean_up_temporary_dataset(self.project) + + return size + + def split(self,
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354752=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354752 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 05/Dec/19 22:38 Start Date: 05/Dec/19 22:38 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r354582319 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _CustomBigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) + + bq.clean_up_temporary_dataset(self.project) + + return size + + def split(self,
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354751=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354751 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 05/Dec/19 22:38 Start Date: 05/Dec/19 22:38 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r354581717 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _CustomBigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) Review comment: Is this input or output bytes?
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354749=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354749 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 05/Dec/19 22:38 Start Date: 05/Dec/19 22:38 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r354545299 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _CustomBigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, Review comment: Does this block? Is estimate_size() called during pipeline construction? Is it guaranteed to be called only (exactly) once? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354748=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354748 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 05/Dec/19 22:38 Start Date: 05/Dec/19 22:38 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r354545010 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _CustomBigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql Review comment: Was this copied-pasted from somewhere? If so, can we share code? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 354748) Time Spent: 15h (was: 14h 50m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 15h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354565=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354565 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 05/Dec/19 18:33 Start Date: 05/Dec/19 18:33 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-562256730 Run Python 3.5 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 354565) Time Spent: 14h 50m (was: 14h 40m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 14h 50m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354539=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354539 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 05/Dec/19 18:15 Start Date: 05/Dec/19 18:15 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-562249250 as a final note, please create issues to switch to avro, and to enable tests in other runners (e.g. direct runner). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 354539) Time Spent: 14h 40m (was: 14.5h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 14h 40m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354533=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354533 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 05/Dec/19 18:04 Start Date: 05/Dec/19 18:04 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r354465023 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) + + bq.clean_up_temporary_dataset(self.project) + + return size + + def split(self, desired_bundle_size,
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354493=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354493 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 05/Dec/19 17:34 Start Date: 05/Dec/19 17:34 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-562233145 Run Python 3.5 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 354493) Time Spent: 14h 20m (was: 14h 10m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 14h 20m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354443=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354443 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 05/Dec/19 16:38 Start Date: 05/Dec/19 16:38 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-562209713 I've tidied up commit history a bit. Also, I've renamed `ReadFromBigQuery` to `_ReadFromBigQuery` - I forgot to do this in an earlier commit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 354443) Time Spent: 14h 10m (was: 14h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 14h 10m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354439=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354439 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 05/Dec/19 16:35 Start Date: 05/Dec/19 16:35 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-562208319 Run Python 2 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 354439) Time Spent: 14h (was: 13h 50m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 14h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354438=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354438 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 05/Dec/19 16:34 Start Date: 05/Dec/19 16:34 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-562208201 Run Portable_Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 354438) Time Spent: 13h 50m (was: 13h 40m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 13h 50m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=354147=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-354147 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 05/Dec/19 08:57 Start Date: 05/Dec/19 08:57 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r354177031 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) + + bq.clean_up_temporary_dataset(self.project) + + return size + + def split(self, desired_bundle_size,
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=353687=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-353687 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 04/Dec/19 19:28 Start Date: 04/Dec/19 19:28 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r353239750 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) + + bq.clean_up_temporary_dataset(self.project) + + return size + + def split(self, desired_bundle_size,
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=353684=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-353684 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 04/Dec/19 19:27 Start Date: 04/Dec/19 19:27 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r353938086 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) + + bq.clean_up_temporary_dataset(self.project) + + return size + + def split(self, desired_bundle_size,
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=353680=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-353680 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 04/Dec/19 19:26 Start Date: 04/Dec/19 19:26 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r353937613 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -1268,3 +1461,140 @@ def display_data(self): tableSpec) res['table'] = DisplayDataItem(tableSpec, label='Table') return res + + +@experimental() +class PassThroughThenCleanup(PTransform): + """A PTransform that invokes a DoFn after the input PCollection has been +processed. + """ + def __init__(self, cleanup_dofn): +self.cleanup_dofn = cleanup_dofn + + def expand(self, input): +class PassThrough(beam.DoFn): + def process(self, element): +yield element + +output = input | beam.ParDo(PassThrough()).with_outputs('cleanup_signal', +main='main') +main_output = output['main'] +cleanup_signal = output['cleanup_signal'] + +_ = (input.pipeline + | beam.Create([None]) + | beam.ParDo(self.cleanup_dofn, beam.pvalue.AsSingleton( + cleanup_signal))) + +return main_output + + +@experimental() +class ReadFromBigQuery(PTransform): Review comment: Yes, that will be great to compare the Native Dataflow source with the new Beam source. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 353680) Time Spent: 13h 10m (was: 13h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 13h 10m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=353500=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-353500 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 04/Dec/19 14:50 Start Date: 04/Dec/19 14:50 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-561679198 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 353500) Time Spent: 13h (was: 12h 50m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 13h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=353297=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-353297 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 04/Dec/19 08:42 Start Date: 04/Dec/19 08:42 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-561536467 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 353297) Time Spent: 12h 50m (was: 12h 40m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 12h 50m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=352698=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352698 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 03/Dec/19 15:22 Start Date: 03/Dec/19 15:22 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-561216098 Run Python 2 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 352698) Time Spent: 12h 40m (was: 12.5h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 12h 40m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=352696=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352696 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 03/Dec/19 15:18 Start Date: 03/Dec/19 15:18 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r353239750 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) + + bq.clean_up_temporary_dataset(self.project) + + return size + + def split(self, desired_bundle_size,
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=352667=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352667 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 03/Dec/19 14:42 Start Date: 03/Dec/19 14:42 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r353217656 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: Review comment: Yes. I wanted to implement the same behavior as we already have when using default coder for the native BigQuerySource. There is a [test](https://github.com/apache/beam/blob/03f780c7329e0eca692baef44874056b7d263303/sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py#L232) that checks BigQuery data types conversions across both sources. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 352667) Time Spent: 12h 20m (was: 12h 10m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 12h 20m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=352643=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352643 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 03/Dec/19 14:15 Start Date: 03/Dec/19 14:15 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r353202013 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -700,30 +675,19 @@ def _export_files(self, bq): bigquery.TableSchema instance, a list of FileMetadata instances """ job_id = uuid.uuid4().hex -destination = self._get_destination_uri(self.gcs_bucket_name, job_id) -job_ref = bq.perform_extract_job([destination], job_id, +job_ref = bq.perform_extract_job([self.gcs_location], job_id, self.table_reference, bigquery_tools.ExportFileFormat.JSON, include_header=False) bq.wait_for_bq_job(job_ref) -metadata_list = FileSystems.match([destination])[0].metadata_list +metadata_list = FileSystems.match([self.gcs_location])[0].metadata_list Review comment: Yes. The thing I used is called `Single wildcard URI`[1]. In this case, an extract job creates one or many files and all of them are created in the same directory. [1] https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 352643) Time Spent: 12h (was: 11h 50m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 12h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=352644=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352644 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 03/Dec/19 14:15 Start Date: 03/Dec/19 14:15 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r353202269 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): Review comment: +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 352644) Time Spent: 12h 10m (was: 12h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 12h 10m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=352623=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352623 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 03/Dec/19 13:50 Start Date: 03/Dec/19 13:50 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r353188057 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -1268,3 +1461,140 @@ def display_data(self): tableSpec) res['table'] = DisplayDataItem(tableSpec, label='Table') return res + + +@experimental() +class PassThroughThenCleanup(PTransform): Review comment: I don't think it needs to be public. I'll make it private then. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 352623) Time Spent: 11h 50m (was: 11h 40m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 11h 50m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=352622=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352622 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 03/Dec/19 13:49 Start Date: 03/Dec/19 13:49 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r353187727 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -1268,3 +1461,140 @@ def display_data(self): tableSpec) res['table'] = DisplayDataItem(tableSpec, label='Table') return res + + +@experimental() +class PassThroughThenCleanup(PTransform): + """A PTransform that invokes a DoFn after the input PCollection has been +processed. + """ + def __init__(self, cleanup_dofn): +self.cleanup_dofn = cleanup_dofn + + def expand(self, input): +class PassThrough(beam.DoFn): + def process(self, element): +yield element + +output = input | beam.ParDo(PassThrough()).with_outputs('cleanup_signal', +main='main') +main_output = output['main'] +cleanup_signal = output['cleanup_signal'] + +_ = (input.pipeline + | beam.Create([None]) + | beam.ParDo(self.cleanup_dofn, beam.pvalue.AsSingleton( + cleanup_signal))) + +return main_output + + +@experimental() +class ReadFromBigQuery(PTransform): Review comment: I'm totally fine with it. Once this PR is merged, I'm going to make changes to the Chicago Taxi Example, so that the example would use this transform. That would be a great opportunity to check stability and measure performance of the transform This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 352622) Time Spent: 11h 40m (was: 11.5h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 11h 40m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=352611=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-352611 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 03/Dec/19 13:31 Start Date: 03/Dec/19 13:31 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r353178188 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) + + bq.clean_up_temporary_dataset(self.project) + + return size + + def split(self, desired_bundle_size,
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=350123=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-350123 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 27/Nov/19 00:44 Start Date: 27/Nov/19 00:44 Worklog Time Spent: 10m Work Description: chamikaramj commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r351048663 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) + + bq.clean_up_temporary_dataset(self.project) + + return size + + def split(self,
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=350119=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-350119 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 27/Nov/19 00:33 Start Date: 27/Nov/19 00:33 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r351045732 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) + + bq.clean_up_temporary_dataset(self.project) + + return size + + def split(self, desired_bundle_size,
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=350120=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-350120 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 27/Nov/19 00:33 Start Date: 27/Nov/19 00:33 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r351046099 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): + def __init__(self, gcs_location=None, table=None, dataset=None, + project=None, query=None, validate=False, coder=None, + use_standard_sql=False, flatten_results=True, kms_key=None): +if table is not None and query is not None: + raise ValueError('Both a BigQuery table and a query were specified.' + ' Please specify only one of these.') +elif table is None and query is None: + raise ValueError('A BigQuery table or a query must be specified') +elif table is not None: + self.table_reference = bigquery_tools.parse_table_reference( + table, dataset, project) + self.query = None + self.use_legacy_sql = True +else: + self.query = query + # TODO(BEAM-1082): Change the internal flag to be standard_sql + self.use_legacy_sql = not use_standard_sql + self.table_reference = None + +self.gcs_location = gcs_location +self.project = project +self.validate = validate +self.flatten_results = flatten_results +self.coder = coder or _JsonToDictCoder +self.kms_key = kms_key +self.split_result = None + + def estimate_size(self): +bq = bigquery_tools.BigQueryWrapper() +if self.table_reference is not None: + table = bq.get_table(self.table_reference.projectId, + self.table_reference.datasetId, + self.table_reference.tableId) + return int(table.numBytes) +else: + self._setup_temporary_dataset(bq) + job = bq._start_query_job(self.project, self.query, +self.use_legacy_sql, self.flatten_results, +job_id=uuid.uuid4().hex, dry_run=True, +kms_key=self.kms_key) + size = int(job.statistics.totalBytesProcessed) + + bq.clean_up_temporary_dataset(self.project) + + return size + + def split(self, desired_bundle_size,
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=350104=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-350104 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 26/Nov/19 23:24 Start Date: 26/Nov/19 23:24 Worklog Time Spent: 10m Work Description: ananvay commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-558860259 @robertwb /cc: @ananvay This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 350104) Time Spent: 11h (was: 10h 50m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 11h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=349937=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-349937 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 26/Nov/19 18:45 Start Date: 26/Nov/19 18:45 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r350905378 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: + # No need to do any conversion + pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): Review comment: Please rename it _CustomBigQuerySource - as we already have a BigQuerySource. To avoid confusion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 349937) Time Spent: 10h 40m (was: 10.5h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 10h 40m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=349938=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-349938 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 26/Nov/19 18:45 Start Date: 26/Nov/19 18:45 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r350905744 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -1268,3 +1461,140 @@ def display_data(self): tableSpec) res['table'] = DisplayDataItem(tableSpec, label='Table') return res + + +@experimental() +class PassThroughThenCleanup(PTransform): Review comment: Does this class need to be public? If it should be public, it should be in a different file. If not, let's make it private and only use it in this file. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 349938) Time Spent: 10h 40m (was: 10.5h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 10h 40m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=349936=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-349936 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 26/Nov/19 18:45 Start Date: 26/Nov/19 18:45 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r350918239 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -499,6 +509,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling +TableFieldSchema instances. +""" +if not table_field_schemas: + return [] + +return [FieldSchema(cls._convert_to_tuple(x.fields), x.mode, x.name, +x.type) +for x in table_field_schemas] + + def decode(self, value): +value = json.loads(value) +return self._decode_with_schema(value, self.fields) + + def _decode_with_schema(self, value, schema_fields): +for field in schema_fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + if field.type == 'RECORD': +value[field.name] = self._decode_with_schema(value[field.name], + field.fields) + else: +try: + converter = self._converters[field.type] + value[field.name] = converter(value[field.name]) +except KeyError: Review comment: Does this mean that for other data types, we pass them as they are? e.g. for datetime data, or other like that? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 349936) Time Spent: 10h 40m (was: 10.5h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 10h 40m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=349940=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-349940 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 26/Nov/19 18:45 Start Date: 26/Nov/19 18:45 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r346020240 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -700,30 +675,19 @@ def _export_files(self, bq): bigquery.TableSchema instance, a list of FileMetadata instances """ job_id = uuid.uuid4().hex -destination = self._get_destination_uri(self.gcs_bucket_name, job_id) -job_ref = bq.perform_extract_job([destination], job_id, +job_ref = bq.perform_extract_job([self.gcs_location], job_id, self.table_reference, bigquery_tools.ExportFileFormat.JSON, include_header=False) bq.wait_for_bq_job(job_ref) -metadata_list = FileSystems.match([destination])[0].metadata_list +metadata_list = FileSystems.match([self.gcs_location])[0].metadata_list Review comment: Is this enough to match the files in that location? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 349940) Time Spent: 10h 50m (was: 10h 40m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 10h 50m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=349939=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-349939 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 26/Nov/19 18:45 Start Date: 26/Nov/19 18:45 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r350906429 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -1268,3 +1461,140 @@ def display_data(self): tableSpec) res['table'] = DisplayDataItem(tableSpec, label='Table') return res + + +@experimental() +class PassThroughThenCleanup(PTransform): + """A PTransform that invokes a DoFn after the input PCollection has been +processed. + """ + def __init__(self, cleanup_dofn): +self.cleanup_dofn = cleanup_dofn + + def expand(self, input): +class PassThrough(beam.DoFn): + def process(self, element): +yield element + +output = input | beam.ParDo(PassThrough()).with_outputs('cleanup_signal', +main='main') +main_output = output['main'] +cleanup_signal = output['cleanup_signal'] + +_ = (input.pipeline + | beam.Create([None]) + | beam.ParDo(self.cleanup_dofn, beam.pvalue.AsSingleton( + cleanup_signal))) + +return main_output + + +@experimental() +class ReadFromBigQuery(PTransform): Review comment: This is looking great. I've discussed with Cham, and let's rename this as `_ReadFromBigQuery` (with underscore) to prevent users from picking it up before we have tested it. We have some tests that we'll run on it, and once we're confident of performance/functionality, we can remove the underscore, and rename to `ReadFromBigQuery`. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 349939) Time Spent: 10h 50m (was: 10h 40m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 10h 50m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=349897=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-349897 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 26/Nov/19 17:27 Start Date: 26/Nov/19 17:27 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-558735446 Looking once more. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 349897) Time Spent: 10.5h (was: 10h 20m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 10.5h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=348410=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-348410 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 22/Nov/19 22:51 Start Date: 22/Nov/19 22:51 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-557723473 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 348410) Time Spent: 10h 20m (was: 10h 10m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 10h 20m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=348409=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-348409 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 22/Nov/19 22:51 Start Date: 22/Nov/19 22:51 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-557723317 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 348409) Time Spent: 10h 10m (was: 10h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 10h 10m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=348408=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-348408 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 22/Nov/19 22:50 Start Date: 22/Nov/19 22:50 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-557723317 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 348408) Time Spent: 10h (was: 9h 50m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 10h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=347572=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-347572 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 21/Nov/19 18:15 Start Date: 21/Nov/19 18:15 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-557208165 Resolved merge conflicts. @pabloem Could you take a look once again? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 347572) Time Spent: 9h 50m (was: 9h 40m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 9h 50m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=341473=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-341473 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 11/Nov/19 21:03 Start Date: 11/Nov/19 21:03 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-552614792 I had been traveling. I'll take look now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 341473) Time Spent: 9h 40m (was: 9.5h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 9h 40m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=339983=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339983 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 07/Nov/19 15:00 Start Date: 07/Nov/19 15:00 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r343697148 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -496,6 +506,189 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +FieldSchema = collections.namedtuple('FieldSchema', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _JsonToDictCoder(coders.Coder): + """A coder for a JSON string to a Python dict.""" + + def __init__(self, table_schema): +self.fields = self._convert_to_tuple(table_schema.fields) +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + @classmethod + def _convert_to_tuple(cls, table_field_schemas): +"""Recursively converts the list of TableFieldSchema instances to the +list of tuples to prevent errors when pickling and unpickling Review comment: This error was quite interesting. It seems that it's impossible to serialize and deserialize nested `bigquery.TableFieldSchema` instances: ``` from apache_beam.internal import pickler from apache_beam.io.gcp.internal.clients import bigquery obj = bigquery.TableFieldSchema(fields=[bigquery.TableFieldSchema()]) pickler.loads(pickler.dumps(obj)) ``` This snippet triggers the following exception: `AttributeError: 'FieldList' object has no attribute '_FieldList__field'`. My workaround was to rewrite all TableFieldSchema instances to an equivalent tuple, which can be serialized and deserialized without problems. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339983) Time Spent: 9.5h (was: 9h 20m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 9.5h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=339797=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339797 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 07/Nov/19 08:17 Start Date: 07/Nov/19 08:17 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-550972444 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339797) Time Spent: 9h 20m (was: 9h 10m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 9h 20m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=339446=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339446 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 06/Nov/19 16:10 Start Date: 06/Nov/19 16:10 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-550380662 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339446) Time Spent: 9h (was: 8h 50m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 9h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=339447=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339447 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 06/Nov/19 16:10 Start Date: 06/Nov/19 16:10 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-550380716 Run Python 2 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339447) Time Spent: 9h 10m (was: 9h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 9h 10m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=339391=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339391 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 06/Nov/19 14:48 Start Date: 06/Nov/19 14:48 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-550342694 Run Python 2 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339391) Time Spent: 8h 50m (was: 8h 40m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 8h 50m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=339328=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339328 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 06/Nov/19 12:45 Start Date: 06/Nov/19 12:45 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-550293536 @pabloem Could you take a look once again? In addition to what you had suggested, I've added the functionality of removing JSON file after being read. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339328) Time Spent: 8.5h (was: 8h 20m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 8.5h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=339329=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339329 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 06/Nov/19 12:45 Start Date: 06/Nov/19 12:45 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-550293679 Run Python 3.7 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339329) Time Spent: 8h 40m (was: 8.5h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 8h 40m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=339324=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339324 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 06/Nov/19 12:41 Start Date: 06/Nov/19 12:41 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r343072498 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _BigQueryRowCoder(coders.Coder): Review comment: I looked at this and, sadly, it doesn't work in practice. For each field `TableRowJsonCoder` returns `JsonValue` instance which doesn't contain the name of the given field. This makes finding an appropriate field type and matching a conversion function quite hard. That's why I decided to stay with my own coder. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339324) Time Spent: 8h 20m (was: 8h 10m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 8h 20m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=336716=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-336716 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 31/Oct/19 10:35 Start Date: 31/Oct/19 10:35 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r341062618 ## File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py ## @@ -86,23 +117,61 @@ def create_table(self, tablename): table_schema.fields.append(table_field) table = bigquery.Table( tableReference=bigquery.TableReference( -projectId=self.project, -datasetId=self.dataset_id, -tableId=tablename), +projectId=cls.project, +datasetId=cls.dataset_id, +tableId=table_name), schema=table_schema) request = bigquery.BigqueryTablesInsertRequest( -projectId=self.project, datasetId=self.dataset_id, table=table) -self.bigquery_client.client.tables.Insert(request) +projectId=cls.project, datasetId=cls.dataset_id, table=table) +cls.bigquery_client.client.tables.Insert(request) table_data = [ {'number': 1, 'str': 'abc'}, {'number': 2, 'str': 'def'}, {'number': 3, 'str': u'你好'}, {'number': 4, 'str': u'привет'} ] -self.bigquery_client.insert_rows( -self.project, self.dataset_id, tablename, table_data) +cls.bigquery_client.insert_rows( +cls.project, cls.dataset_id, table_name, table_data) - def create_table_new_types(self, table_name): + def get_expected_data(self): +return [ +{'number': 1, 'str': 'abc'}, +{'number': 2, 'str': 'def'}, +{'number': 3, 'str': u'你好'}, +{'number': 4, 'str': u'привет'} +] + + @skip(['PortableRunner', 'FlinkRunner']) + @attr('IT') + def test_native_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource( + query=self.query, use_standard_sql=True))) + assert_that(result, equal_to(self.get_expected_data())) + + @skip(['DirectRunner', 'TestDirectRunner']) + @attr('IT') + def test_iobase_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.ReadFromBigQuery( + query=self.query, use_standard_sql=True, project=self.project, + gcs_bucket_name='gs://temp-storage-for-end-to-end-tests')) Review comment: Done: https://issues.apache.org/jira/browse/BEAM-8528 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 336716) Time Spent: 8h 10m (was: 8h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 8h 10m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335807=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335807 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 22:55 Start Date: 29/Oct/19 22:55 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-547665389 thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 335807) Time Spent: 8h (was: 7h 50m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 8h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335664=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335664 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 18:22 Start Date: 29/Oct/19 18:22 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r340250638 ## File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py ## @@ -86,23 +117,61 @@ def create_table(self, tablename): table_schema.fields.append(table_field) table = bigquery.Table( tableReference=bigquery.TableReference( -projectId=self.project, -datasetId=self.dataset_id, -tableId=tablename), +projectId=cls.project, +datasetId=cls.dataset_id, +tableId=table_name), schema=table_schema) request = bigquery.BigqueryTablesInsertRequest( -projectId=self.project, datasetId=self.dataset_id, table=table) -self.bigquery_client.client.tables.Insert(request) +projectId=cls.project, datasetId=cls.dataset_id, table=table) +cls.bigquery_client.client.tables.Insert(request) table_data = [ {'number': 1, 'str': 'abc'}, {'number': 2, 'str': 'def'}, {'number': 3, 'str': u'你好'}, {'number': 4, 'str': u'привет'} ] -self.bigquery_client.insert_rows( -self.project, self.dataset_id, tablename, table_data) +cls.bigquery_client.insert_rows( +cls.project, cls.dataset_id, table_name, table_data) - def create_table_new_types(self, table_name): + def get_expected_data(self): +return [ +{'number': 1, 'str': 'abc'}, +{'number': 2, 'str': 'def'}, +{'number': 3, 'str': u'你好'}, +{'number': 4, 'str': u'привет'} +] + + @skip(['PortableRunner', 'FlinkRunner']) + @attr('IT') + def test_native_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource( + query=self.query, use_standard_sql=True))) + assert_that(result, equal_to(self.get_expected_data())) + + @skip(['DirectRunner', 'TestDirectRunner']) + @attr('IT') + def test_iobase_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.ReadFromBigQuery( + query=self.query, use_standard_sql=True, project=self.project, + gcs_bucket_name='gs://temp-storage-for-end-to-end-tests')) Review comment: I see. Can you file a bug for this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 335664) Time Spent: 7h 50m (was: 7h 40m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 7h 50m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335564=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335564 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 15:25 Start Date: 29/Oct/19 15:25 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#issuecomment-547477815 @pabloem Thanks for your review. I've pushed first bunch of fixes. I'll keep you posted with on further progress. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 335564) Time Spent: 7h 40m (was: 7.5h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 7h 40m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335559=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335559 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 15:18 Start Date: 29/Oct/19 15:18 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r340142519 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _BigQueryRowCoder(coders.Coder): Review comment: Looks interesting. If I combine this coder with this function[1], maybe that would be the solution. I'll investigate this further — first I need to solve the problem that `bigquery.TableSchema` is unpickable. [1] https://github.com/apache/beam/blob/03f780c7329e0eca692baef44874056b7d263303/sdks/python/apache_beam/io/gcp/bigquery_tools.py#L892-L920 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 335559) Time Spent: 7.5h (was: 7h 20m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 7.5h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335522=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335522 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 13:46 Start Date: 29/Oct/19 13:46 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r339969300 ## File path: sdks/python/apache_beam/io/gcp/bigquery_tools.py ## @@ -695,10 +769,12 @@ def get_or_create_table( def run_query(self, project_id, query, use_legacy_sql, flatten_results, dry_run=False): -job_id, location = self._start_query_job(project_id, query, - use_legacy_sql, flatten_results, - job_id=uuid.uuid4().hex, - dry_run=dry_run) +job = self._start_query_job(project_id, query, use_legacy_sql, Review comment: Yes. It returns the whole job object because I needed its `statistics` property. See usage: https://github.com/apache/beam/blob/03f780c7329e0eca692baef44874056b7d263303/sdks/python/apache_beam/io/gcp/bigquery.py#L655-L658 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 335522) Time Spent: 7h 20m (was: 7h 10m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 7h 20m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335514=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335514 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 13:12 Start Date: 29/Oct/19 13:12 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r340063953 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -1265,3 +1501,14 @@ def display_data(self): tableSpec) res['table'] = DisplayDataItem(tableSpec, label='Table') return res + + +@experimental() +class ReadFromBigQuery(PTransform): + def __init__(self, *args, **kwargs): Review comment: +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 335514) Time Spent: 7h 10m (was: 7h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 7h 10m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335513=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335513 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 13:11 Start Date: 29/Oct/19 13:11 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r340063754 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _BigQueryRowCoder(coders.Coder): + """A coder for a table row (represented as a dict) from a JSON string which + applies additional conversions. + """ + + def __init__(self, table_schema): +# bigquery.TableSchema type is unpickable so we must translate it to a +# pickable type +self.fields = [SchemaFields(x.fields, x.mode, x.name, x.type) + for x in table_schema.fields] +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + def decode(self, value): +value = json.loads(value) +for field in self.fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + try: +converter = self._converters[field.type] +value[field.name] = converter(value[field.name]) + except KeyError: +# No need to do any conversion +pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): + """Read data from BigQuery. + +This source uses a BigQuery export job to take a snapshot of the table +on GCS, and then reads from each produced JSON file. + +Do note that currently this source does not work with DirectRunner. + + Args: +table (str, callable, ValueProvider): The ID of the table, or a callable + that returns it. The ID must contain only letters ``a-z``, ``A-Z``, + numbers ``0-9``, or underscores ``_``. If dataset argument is + :data:`None` then the table argument must contain the entire table + reference specified as: ``'DATASET.TABLE'`` + or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one + argument representing an element to be written to BigQuery, and return + a TableReference, or a string table name as specified above. +dataset (str): The ID of the dataset containing this table or + :data:`None` if the table reference is specified entirely by the table + argument. +project (str): The ID of the project containing this table. +query (str): A query to be used instead of arguments table, dataset, and + project. + validate (bool): If :data:`True`, various checks will be done when source + gets initialized (e.g., is table present?). This should be + :data:`True` for most scenarios in order to catch errors as early as + possible (pipeline construction instead of pipeline execution). It + should be :data:`False` if the table is created during pipeline + execution by a previous step. +coder (~apache_beam.coders.coders.Coder): The coder for the table + rows. If :data:`None`, then the default coder is + :class:`~apache_beam.io.gcp.bigquery._BigQueryRowCoder`, + which will interpret every line in a file as a JSON serialized + dictionary. This argument needs a value only in special cases when + returning table rows as dictionaries is not desirable. +use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL + dialect for this query. The default value is :data:`False`. + If set to :data:`True`, the query will use BigQuery's updated SQL + dialect with improved standards compliance. + This parameter is ignored for table inputs. +flatten_results (bool): Flattens all nested and repeated fields in the + query results. The default value is :data:`True`. +kms_key (str): Experimental. Optional Cloud KMS key name for use when + creating new tables. +
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335467=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335467 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 10:31 Start Date: 29/Oct/19 10:31 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r339997229 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _BigQueryRowCoder(coders.Coder): + """A coder for a table row (represented as a dict) from a JSON string which + applies additional conversions. + """ + + def __init__(self, table_schema): +# bigquery.TableSchema type is unpickable so we must translate it to a +# pickable type +self.fields = [SchemaFields(x.fields, x.mode, x.name, x.type) + for x in table_schema.fields] +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + def decode(self, value): +value = json.loads(value) +for field in self.fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + try: +converter = self._converters[field.type] +value[field.name] = converter(value[field.name]) + except KeyError: +# No need to do any conversion +pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): + """Read data from BigQuery. + +This source uses a BigQuery export job to take a snapshot of the table +on GCS, and then reads from each produced JSON file. + +Do note that currently this source does not work with DirectRunner. + + Args: +table (str, callable, ValueProvider): The ID of the table, or a callable + that returns it. The ID must contain only letters ``a-z``, ``A-Z``, + numbers ``0-9``, or underscores ``_``. If dataset argument is + :data:`None` then the table argument must contain the entire table + reference specified as: ``'DATASET.TABLE'`` + or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one + argument representing an element to be written to BigQuery, and return + a TableReference, or a string table name as specified above. +dataset (str): The ID of the dataset containing this table or + :data:`None` if the table reference is specified entirely by the table + argument. +project (str): The ID of the project containing this table. +query (str): A query to be used instead of arguments table, dataset, and + project. + validate (bool): If :data:`True`, various checks will be done when source + gets initialized (e.g., is table present?). This should be + :data:`True` for most scenarios in order to catch errors as early as + possible (pipeline construction instead of pipeline execution). It + should be :data:`False` if the table is created during pipeline + execution by a previous step. +coder (~apache_beam.coders.coders.Coder): The coder for the table + rows. If :data:`None`, then the default coder is + :class:`~apache_beam.io.gcp.bigquery._BigQueryRowCoder`, + which will interpret every line in a file as a JSON serialized + dictionary. This argument needs a value only in special cases when + returning table rows as dictionaries is not desirable. +use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL + dialect for this query. The default value is :data:`False`. + If set to :data:`True`, the query will use BigQuery's updated SQL + dialect with improved standards compliance. + This parameter is ignored for table inputs. +flatten_results (bool): Flattens all nested and repeated fields in the + query results. The default value is :data:`True`. +kms_key (str): Experimental. Optional Cloud KMS key name for use when + creating new tables. +
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335462=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335462 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 10:19 Start Date: 29/Oct/19 10:19 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r339992040 ## File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py ## @@ -86,23 +117,61 @@ def create_table(self, tablename): table_schema.fields.append(table_field) table = bigquery.Table( tableReference=bigquery.TableReference( -projectId=self.project, -datasetId=self.dataset_id, -tableId=tablename), +projectId=cls.project, +datasetId=cls.dataset_id, +tableId=table_name), schema=table_schema) request = bigquery.BigqueryTablesInsertRequest( -projectId=self.project, datasetId=self.dataset_id, table=table) -self.bigquery_client.client.tables.Insert(request) +projectId=cls.project, datasetId=cls.dataset_id, table=table) +cls.bigquery_client.client.tables.Insert(request) table_data = [ {'number': 1, 'str': 'abc'}, {'number': 2, 'str': 'def'}, {'number': 3, 'str': u'你好'}, {'number': 4, 'str': u'привет'} ] -self.bigquery_client.insert_rows( -self.project, self.dataset_id, tablename, table_data) +cls.bigquery_client.insert_rows( +cls.project, cls.dataset_id, table_name, table_data) - def create_table_new_types(self, table_name): + def get_expected_data(self): +return [ +{'number': 1, 'str': 'abc'}, +{'number': 2, 'str': 'def'}, +{'number': 3, 'str': u'你好'}, +{'number': 4, 'str': u'привет'} +] + + @skip(['PortableRunner', 'FlinkRunner']) + @attr('IT') + def test_native_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource( + query=self.query, use_standard_sql=True))) + assert_that(result, equal_to(self.get_expected_data())) + + @skip(['DirectRunner', 'TestDirectRunner']) + @attr('IT') + def test_iobase_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.ReadFromBigQuery( + query=self.query, use_standard_sql=True, project=self.project, + gcs_bucket_name='gs://temp-storage-for-end-to-end-tests')) Review comment: > Why are we skipping this for DirectRunner? This should work there, right? Unfortunately, no. My solution doesn't work with DirectRunner. The direct cause is `get_range_tracker` and `read` methods aren't implemented in my source (they're raising NotImplementedError exception). This is purposeful — the runner is expected to call `split` instead. See Java implementation which works the same way: [link](https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQuerySourceBase.java) It seems that DataflowRunner and Flink are able to catch these exceptions somehow, while DirectRunner is not. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 335462) Time Spent: 6h 40m (was: 6.5h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 6h 40m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] >
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335457=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335457 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 09:55 Start Date: 29/Oct/19 09:55 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r339980081 ## File path: sdks/python/test-suites/portable/py37/build.gradle ## @@ -30,3 +33,25 @@ task preCommitPy37() { dependsOn portableWordCountBatch dependsOn portableWordCountStreaming } + +task postCommitIT { + dependsOn 'installGcpTest' + dependsOn 'setupVirtualenv' + dependsOn ':runners:flink:1.8:job-server:shadowJar' + + doLast { +def tests = [ +"apache_beam.io.gcp.bigquery_read_it_test", +] +def testOpts = ["--tests=${tests.join(',')}"] +def cmdArgs = mapToArgString([ +"test_opts": testOpts, +"suite": "postCommitIT-flink-py37", +"pipeline_opts": "--runner=FlinkRunner --project=apache-beam-testing --environment_type=LOOPBACK", Review comment: Yes, locally. I would phrase trigger a PostCommit check, but it is getting aborted almost always: https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_PostCommit_Python37/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 335457) Time Spent: 6.5h (was: 6h 20m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 6.5h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335452=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335452 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 09:45 Start Date: 29/Oct/19 09:45 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r339975153 ## File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py ## @@ -86,23 +117,61 @@ def create_table(self, tablename): table_schema.fields.append(table_field) table = bigquery.Table( tableReference=bigquery.TableReference( -projectId=self.project, -datasetId=self.dataset_id, -tableId=tablename), +projectId=cls.project, +datasetId=cls.dataset_id, +tableId=table_name), schema=table_schema) request = bigquery.BigqueryTablesInsertRequest( -projectId=self.project, datasetId=self.dataset_id, table=table) -self.bigquery_client.client.tables.Insert(request) +projectId=cls.project, datasetId=cls.dataset_id, table=table) +cls.bigquery_client.client.tables.Insert(request) table_data = [ {'number': 1, 'str': 'abc'}, {'number': 2, 'str': 'def'}, {'number': 3, 'str': u'你好'}, {'number': 4, 'str': u'привет'} ] -self.bigquery_client.insert_rows( -self.project, self.dataset_id, tablename, table_data) +cls.bigquery_client.insert_rows( +cls.project, cls.dataset_id, table_name, table_data) - def create_table_new_types(self, table_name): + def get_expected_data(self): +return [ +{'number': 1, 'str': 'abc'}, +{'number': 2, 'str': 'def'}, +{'number': 3, 'str': u'你好'}, +{'number': 4, 'str': u'привет'} +] + + @skip(['PortableRunner', 'FlinkRunner']) + @attr('IT') + def test_native_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource( + query=self.query, use_standard_sql=True))) + assert_that(result, equal_to(self.get_expected_data())) + + @skip(['DirectRunner', 'TestDirectRunner']) + @attr('IT') + def test_iobase_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.ReadFromBigQuery( + query=self.query, use_standard_sql=True, project=self.project, + gcs_bucket_name='gs://temp-storage-for-end-to-end-tests')) Review comment: > Why are we skipping this for DirectRunner? This should work there, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 335452) Time Spent: 6h (was: 5h 50m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 6h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335454=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335454 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 09:45 Start Date: 29/Oct/19 09:45 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r339975250 ## File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py ## @@ -86,23 +117,61 @@ def create_table(self, tablename): table_schema.fields.append(table_field) table = bigquery.Table( tableReference=bigquery.TableReference( -projectId=self.project, -datasetId=self.dataset_id, -tableId=tablename), +projectId=cls.project, +datasetId=cls.dataset_id, +tableId=table_name), schema=table_schema) request = bigquery.BigqueryTablesInsertRequest( -projectId=self.project, datasetId=self.dataset_id, table=table) -self.bigquery_client.client.tables.Insert(request) +projectId=cls.project, datasetId=cls.dataset_id, table=table) +cls.bigquery_client.client.tables.Insert(request) table_data = [ {'number': 1, 'str': 'abc'}, {'number': 2, 'str': 'def'}, {'number': 3, 'str': u'你好'}, {'number': 4, 'str': u'привет'} ] -self.bigquery_client.insert_rows( -self.project, self.dataset_id, tablename, table_data) +cls.bigquery_client.insert_rows( +cls.project, cls.dataset_id, table_name, table_data) - def create_table_new_types(self, table_name): + def get_expected_data(self): +return [ +{'number': 1, 'str': 'abc'}, +{'number': 2, 'str': 'def'}, +{'number': 3, 'str': u'你好'}, +{'number': 4, 'str': u'привет'} +] + + @skip(['PortableRunner', 'FlinkRunner']) + @attr('IT') + def test_native_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource( + query=self.query, use_standard_sql=True))) + assert_that(result, equal_to(self.get_expected_data())) + + @skip(['DirectRunner', 'TestDirectRunner']) + @attr('IT') + def test_iobase_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.ReadFromBigQuery( + query=self.query, use_standard_sql=True, project=self.project, + gcs_bucket_name='gs://temp-storage-for-end-to-end-tests')) Review comment: > Why are we skipping this for DirectRunner? This should work there, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 335454) Time Spent: 6h 20m (was: 6h 10m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 6h 20m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335453=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335453 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 09:45 Start Date: 29/Oct/19 09:45 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r339975250 ## File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py ## @@ -86,23 +117,61 @@ def create_table(self, tablename): table_schema.fields.append(table_field) table = bigquery.Table( tableReference=bigquery.TableReference( -projectId=self.project, -datasetId=self.dataset_id, -tableId=tablename), +projectId=cls.project, +datasetId=cls.dataset_id, +tableId=table_name), schema=table_schema) request = bigquery.BigqueryTablesInsertRequest( -projectId=self.project, datasetId=self.dataset_id, table=table) -self.bigquery_client.client.tables.Insert(request) +projectId=cls.project, datasetId=cls.dataset_id, table=table) +cls.bigquery_client.client.tables.Insert(request) table_data = [ {'number': 1, 'str': 'abc'}, {'number': 2, 'str': 'def'}, {'number': 3, 'str': u'你好'}, {'number': 4, 'str': u'привет'} ] -self.bigquery_client.insert_rows( -self.project, self.dataset_id, tablename, table_data) +cls.bigquery_client.insert_rows( +cls.project, cls.dataset_id, table_name, table_data) - def create_table_new_types(self, table_name): + def get_expected_data(self): +return [ +{'number': 1, 'str': 'abc'}, +{'number': 2, 'str': 'def'}, +{'number': 3, 'str': u'你好'}, +{'number': 4, 'str': u'привет'} +] + + @skip(['PortableRunner', 'FlinkRunner']) + @attr('IT') + def test_native_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource( + query=self.query, use_standard_sql=True))) + assert_that(result, equal_to(self.get_expected_data())) + + @skip(['DirectRunner', 'TestDirectRunner']) + @attr('IT') + def test_iobase_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.ReadFromBigQuery( + query=self.query, use_standard_sql=True, project=self.project, + gcs_bucket_name='gs://temp-storage-for-end-to-end-tests')) Review comment: > Why are we skipping this for DirectRunner? This should work there, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 335453) Time Spent: 6h 10m (was: 6h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 6h 10m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335451=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335451 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 09:44 Start Date: 29/Oct/19 09:44 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r339975153 ## File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py ## @@ -86,23 +117,61 @@ def create_table(self, tablename): table_schema.fields.append(table_field) table = bigquery.Table( tableReference=bigquery.TableReference( -projectId=self.project, -datasetId=self.dataset_id, -tableId=tablename), +projectId=cls.project, +datasetId=cls.dataset_id, +tableId=table_name), schema=table_schema) request = bigquery.BigqueryTablesInsertRequest( -projectId=self.project, datasetId=self.dataset_id, table=table) -self.bigquery_client.client.tables.Insert(request) +projectId=cls.project, datasetId=cls.dataset_id, table=table) +cls.bigquery_client.client.tables.Insert(request) table_data = [ {'number': 1, 'str': 'abc'}, {'number': 2, 'str': 'def'}, {'number': 3, 'str': u'你好'}, {'number': 4, 'str': u'привет'} ] -self.bigquery_client.insert_rows( -self.project, self.dataset_id, tablename, table_data) +cls.bigquery_client.insert_rows( +cls.project, cls.dataset_id, table_name, table_data) - def create_table_new_types(self, table_name): + def get_expected_data(self): +return [ +{'number': 1, 'str': 'abc'}, +{'number': 2, 'str': 'def'}, +{'number': 3, 'str': u'你好'}, +{'number': 4, 'str': u'привет'} +] + + @skip(['PortableRunner', 'FlinkRunner']) + @attr('IT') + def test_native_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource( + query=self.query, use_standard_sql=True))) + assert_that(result, equal_to(self.get_expected_data())) + + @skip(['DirectRunner', 'TestDirectRunner']) + @attr('IT') + def test_iobase_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.ReadFromBigQuery( + query=self.query, use_standard_sql=True, project=self.project, + gcs_bucket_name='gs://temp-storage-for-end-to-end-tests')) Review comment: > Why are we skipping this for DirectRunner? This should work there, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 335451) Time Spent: 5h 50m (was: 5h 40m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 5h 50m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335449=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335449 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 09:43 Start Date: 29/Oct/19 09:43 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r339974244 ## File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py ## @@ -86,23 +117,61 @@ def create_table(self, tablename): table_schema.fields.append(table_field) table = bigquery.Table( tableReference=bigquery.TableReference( -projectId=self.project, -datasetId=self.dataset_id, -tableId=tablename), +projectId=cls.project, +datasetId=cls.dataset_id, +tableId=table_name), schema=table_schema) request = bigquery.BigqueryTablesInsertRequest( -projectId=self.project, datasetId=self.dataset_id, table=table) -self.bigquery_client.client.tables.Insert(request) +projectId=cls.project, datasetId=cls.dataset_id, table=table) +cls.bigquery_client.client.tables.Insert(request) table_data = [ {'number': 1, 'str': 'abc'}, {'number': 2, 'str': 'def'}, {'number': 3, 'str': u'你好'}, {'number': 4, 'str': u'привет'} ] -self.bigquery_client.insert_rows( -self.project, self.dataset_id, tablename, table_data) +cls.bigquery_client.insert_rows( +cls.project, cls.dataset_id, table_name, table_data) - def create_table_new_types(self, table_name): + def get_expected_data(self): Review comment: +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 335449) Time Spent: 5h 40m (was: 5.5h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 5h 40m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335442=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335442 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 09:32 Start Date: 29/Oct/19 09:32 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r339969300 ## File path: sdks/python/apache_beam/io/gcp/bigquery_tools.py ## @@ -695,10 +769,12 @@ def get_or_create_table( def run_query(self, project_id, query, use_legacy_sql, flatten_results, dry_run=False): -job_id, location = self._start_query_job(project_id, query, - use_legacy_sql, flatten_results, - job_id=uuid.uuid4().hex, - dry_run=dry_run) +job = self._start_query_job(project_id, query, use_legacy_sql, Review comment: Yes. It returns the whole job object because I needed its `statistics` property. See usage: bigquery.py:655 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 335442) Time Spent: 5.5h (was: 5h 20m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 5.5h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335440=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335440 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 09:30 Start Date: 29/Oct/19 09:30 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r339962428 ## File path: sdks/python/apache_beam/io/gcp/bigquery_tools.py ## @@ -370,7 +383,37 @@ def _start_query_job(self, project_id, query, use_legacy_sql, flatten_results, jobReference=reference)) response = self.client.jobs.Insert(request) -return response.jobReference.jobId, response.jobReference.location +return response + + def wait_for_bq_job(self, job_reference, sleep_duration_sec=5, + max_retries=60): +"""Poll job until it is DONE. + +Args: + job_reference: bigquery.JobReference instance. + sleep_duration_sec: Specifies the delay in seconds between retries. + max_retries: The total number of times to retry. If equals to 0, +the function waits forever. + +Raises: + `RuntimeError`: If the job is FAILED or the number of retries has been +reached. +""" +retry = 0 +while True: + retry += 1 + job = self.get_job(job_reference.projectId, job_reference.jobId, + job_reference.location) + logging.info('Job status: %s', job.status.state) + if job.status.state == 'DONE' and job.status.errorResult: +raise RuntimeError("BigQuery job %s failed. Error Result: %s", Review comment: Oh, you're right. In that case maybe I'll use the `format` method This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 335440) Time Spent: 5h 20m (was: 5h 10m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 5h 20m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=335433=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-335433 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 29/Oct/19 09:18 Start Date: 29/Oct/19 09:18 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r339962428 ## File path: sdks/python/apache_beam/io/gcp/bigquery_tools.py ## @@ -370,7 +383,37 @@ def _start_query_job(self, project_id, query, use_legacy_sql, flatten_results, jobReference=reference)) response = self.client.jobs.Insert(request) -return response.jobReference.jobId, response.jobReference.location +return response + + def wait_for_bq_job(self, job_reference, sleep_duration_sec=5, + max_retries=60): +"""Poll job until it is DONE. + +Args: + job_reference: bigquery.JobReference instance. + sleep_duration_sec: Specifies the delay in seconds between retries. + max_retries: The total number of times to retry. If equals to 0, +the function waits forever. + +Raises: + `RuntimeError`: If the job is FAILED or the number of retries has been +reached. +""" +retry = 0 +while True: + retry += 1 + job = self.get_job(job_reference.projectId, job_reference.jobId, + job_reference.location) + logging.info('Job status: %s', job.status.state) + if job.status.state == 'DONE' and job.status.errorResult: +raise RuntimeError("BigQuery job %s failed. Error Result: %s", Review comment: Oh, you're right. In that some maybe I'll use the `format` method This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 335433) Time Spent: 5h 10m (was: 5h) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 5h 10m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=333650=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333650 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 24/Oct/19 18:28 Start Date: 24/Oct/19 18:28 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r338724547 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _BigQueryRowCoder(coders.Coder): + """A coder for a table row (represented as a dict) from a JSON string which + applies additional conversions. + """ + + def __init__(self, table_schema): +# bigquery.TableSchema type is unpickable so we must translate it to a +# pickable type +self.fields = [SchemaFields(x.fields, x.mode, x.name, x.type) + for x in table_schema.fields] +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + def decode(self, value): +value = json.loads(value) +for field in self.fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + try: +converter = self._converters[field.type] +value[field.name] = converter(value[field.name]) + except KeyError: +# No need to do any conversion +pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): + """Read data from BigQuery. + +This source uses a BigQuery export job to take a snapshot of the table +on GCS, and then reads from each produced JSON file. + +Do note that currently this source does not work with DirectRunner. + + Args: +table (str, callable, ValueProvider): The ID of the table, or a callable + that returns it. The ID must contain only letters ``a-z``, ``A-Z``, + numbers ``0-9``, or underscores ``_``. If dataset argument is + :data:`None` then the table argument must contain the entire table + reference specified as: ``'DATASET.TABLE'`` + or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one + argument representing an element to be written to BigQuery, and return + a TableReference, or a string table name as specified above. +dataset (str): The ID of the dataset containing this table or + :data:`None` if the table reference is specified entirely by the table + argument. +project (str): The ID of the project containing this table. +query (str): A query to be used instead of arguments table, dataset, and + project. + validate (bool): If :data:`True`, various checks will be done when source + gets initialized (e.g., is table present?). This should be + :data:`True` for most scenarios in order to catch errors as early as + possible (pipeline construction instead of pipeline execution). It + should be :data:`False` if the table is created during pipeline + execution by a previous step. +coder (~apache_beam.coders.coders.Coder): The coder for the table + rows. If :data:`None`, then the default coder is + :class:`~apache_beam.io.gcp.bigquery._BigQueryRowCoder`, + which will interpret every line in a file as a JSON serialized + dictionary. This argument needs a value only in special cases when + returning table rows as dictionaries is not desirable. +use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL + dialect for this query. The default value is :data:`False`. + If set to :data:`True`, the query will use BigQuery's updated SQL + dialect with improved standards compliance. + This parameter is ignored for table inputs. +flatten_results (bool): Flattens all nested and repeated fields in the + query results. The default value is :data:`True`. +kms_key (str): Experimental. Optional Cloud KMS key name for use when + creating new tables. +
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=333648=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333648 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 24/Oct/19 18:28 Start Date: 24/Oct/19 18:28 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r338725043 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -1265,3 +1501,14 @@ def display_data(self): tableSpec) res['table'] = DisplayDataItem(tableSpec, label='Table') return res + + +@experimental() +class ReadFromBigQuery(PTransform): + def __init__(self, *args, **kwargs): Review comment: Since `ReadFromBigQuery` is the user-facing transform, it should have all the Pydoc. That being said, I'll defer @chamikaramj whether we want to expose this already. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 333648) Time Spent: 4h 50m (was: 4h 40m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 4h 50m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=333649=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333649 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 24/Oct/19 18:28 Start Date: 24/Oct/19 18:28 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r338706927 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _BigQueryRowCoder(coders.Coder): + """A coder for a table row (represented as a dict) from a JSON string which + applies additional conversions. + """ + + def __init__(self, table_schema): +# bigquery.TableSchema type is unpickable so we must translate it to a +# pickable type +self.fields = [SchemaFields(x.fields, x.mode, x.name, x.type) + for x in table_schema.fields] +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + def decode(self, value): +value = json.loads(value) +for field in self.fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + try: +converter = self._converters[field.type] +value[field.name] = converter(value[field.name]) + except KeyError: +# No need to do any conversion +pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): + """Read data from BigQuery. + +This source uses a BigQuery export job to take a snapshot of the table +on GCS, and then reads from each produced JSON file. + +Do note that currently this source does not work with DirectRunner. + + Args: +table (str, callable, ValueProvider): The ID of the table, or a callable + that returns it. The ID must contain only letters ``a-z``, ``A-Z``, + numbers ``0-9``, or underscores ``_``. If dataset argument is + :data:`None` then the table argument must contain the entire table + reference specified as: ``'DATASET.TABLE'`` + or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one + argument representing an element to be written to BigQuery, and return + a TableReference, or a string table name as specified above. +dataset (str): The ID of the dataset containing this table or + :data:`None` if the table reference is specified entirely by the table + argument. +project (str): The ID of the project containing this table. +query (str): A query to be used instead of arguments table, dataset, and + project. + validate (bool): If :data:`True`, various checks will be done when source + gets initialized (e.g., is table present?). This should be + :data:`True` for most scenarios in order to catch errors as early as + possible (pipeline construction instead of pipeline execution). It + should be :data:`False` if the table is created during pipeline + execution by a previous step. +coder (~apache_beam.coders.coders.Coder): The coder for the table + rows. If :data:`None`, then the default coder is + :class:`~apache_beam.io.gcp.bigquery._BigQueryRowCoder`, + which will interpret every line in a file as a JSON serialized + dictionary. This argument needs a value only in special cases when + returning table rows as dictionaries is not desirable. +use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL + dialect for this query. The default value is :data:`False`. + If set to :data:`True`, the query will use BigQuery's updated SQL + dialect with improved standards compliance. + This parameter is ignored for table inputs. +flatten_results (bool): Flattens all nested and repeated fields in the + query results. The default value is :data:`True`. +kms_key (str): Experimental. Optional Cloud KMS key name for use when + creating new tables. +
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=333651=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333651 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 24/Oct/19 18:28 Start Date: 24/Oct/19 18:28 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r338724727 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _BigQueryRowCoder(coders.Coder): + """A coder for a table row (represented as a dict) from a JSON string which + applies additional conversions. + """ + + def __init__(self, table_schema): +# bigquery.TableSchema type is unpickable so we must translate it to a +# pickable type +self.fields = [SchemaFields(x.fields, x.mode, x.name, x.type) + for x in table_schema.fields] +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + def decode(self, value): +value = json.loads(value) +for field in self.fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + try: +converter = self._converters[field.type] +value[field.name] = converter(value[field.name]) + except KeyError: +# No need to do any conversion +pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): + """Read data from BigQuery. + +This source uses a BigQuery export job to take a snapshot of the table +on GCS, and then reads from each produced JSON file. + +Do note that currently this source does not work with DirectRunner. + + Args: +table (str, callable, ValueProvider): The ID of the table, or a callable + that returns it. The ID must contain only letters ``a-z``, ``A-Z``, + numbers ``0-9``, or underscores ``_``. If dataset argument is + :data:`None` then the table argument must contain the entire table + reference specified as: ``'DATASET.TABLE'`` + or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one + argument representing an element to be written to BigQuery, and return + a TableReference, or a string table name as specified above. +dataset (str): The ID of the dataset containing this table or + :data:`None` if the table reference is specified entirely by the table + argument. +project (str): The ID of the project containing this table. +query (str): A query to be used instead of arguments table, dataset, and + project. + validate (bool): If :data:`True`, various checks will be done when source + gets initialized (e.g., is table present?). This should be + :data:`True` for most scenarios in order to catch errors as early as + possible (pipeline construction instead of pipeline execution). It + should be :data:`False` if the table is created during pipeline + execution by a previous step. +coder (~apache_beam.coders.coders.Coder): The coder for the table + rows. If :data:`None`, then the default coder is + :class:`~apache_beam.io.gcp.bigquery._BigQueryRowCoder`, + which will interpret every line in a file as a JSON serialized + dictionary. This argument needs a value only in special cases when + returning table rows as dictionaries is not desirable. +use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL + dialect for this query. The default value is :data:`False`. + If set to :data:`True`, the query will use BigQuery's updated SQL + dialect with improved standards compliance. + This parameter is ignored for table inputs. +flatten_results (bool): Flattens all nested and repeated fields in the + query results. The default value is :data:`True`. +kms_key (str): Experimental. Optional Cloud KMS key name for use when + creating new tables. +
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=333652=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333652 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 24/Oct/19 18:28 Start Date: 24/Oct/19 18:28 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r338726341 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _BigQueryRowCoder(coders.Coder): Review comment: I feel silly - but didn't we have a coder for this? I see that `TableRowJsonCoder` does not do the full tablerow to dict conversion... does it make sense to extend it? Or not really? [1] https://github.com/apache/beam/blob/12d07745835e1b9c1e824b83beeeadf63ab4b234/sdks/python/apache_beam/io/gcp/bigquery.py#L312-L349 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 333652) Time Spent: 5h (was: 4h 50m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 5h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=333647=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333647 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 24/Oct/19 18:28 Start Date: 24/Oct/19 18:28 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r338706242 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _BigQueryRowCoder(coders.Coder): + """A coder for a table row (represented as a dict) from a JSON string which + applies additional conversions. + """ + + def __init__(self, table_schema): +# bigquery.TableSchema type is unpickable so we must translate it to a +# pickable type +self.fields = [SchemaFields(x.fields, x.mode, x.name, x.type) + for x in table_schema.fields] +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + def decode(self, value): +value = json.loads(value) +for field in self.fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + try: +converter = self._converters[field.type] +value[field.name] = converter(value[field.name]) + except KeyError: +# No need to do any conversion +pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): + """Read data from BigQuery. + +This source uses a BigQuery export job to take a snapshot of the table +on GCS, and then reads from each produced JSON file. + +Do note that currently this source does not work with DirectRunner. + + Args: +table (str, callable, ValueProvider): The ID of the table, or a callable + that returns it. The ID must contain only letters ``a-z``, ``A-Z``, + numbers ``0-9``, or underscores ``_``. If dataset argument is + :data:`None` then the table argument must contain the entire table + reference specified as: ``'DATASET.TABLE'`` + or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one + argument representing an element to be written to BigQuery, and return + a TableReference, or a string table name as specified above. +dataset (str): The ID of the dataset containing this table or + :data:`None` if the table reference is specified entirely by the table + argument. +project (str): The ID of the project containing this table. +query (str): A query to be used instead of arguments table, dataset, and + project. + validate (bool): If :data:`True`, various checks will be done when source + gets initialized (e.g., is table present?). This should be + :data:`True` for most scenarios in order to catch errors as early as + possible (pipeline construction instead of pipeline execution). It + should be :data:`False` if the table is created during pipeline + execution by a previous step. +coder (~apache_beam.coders.coders.Coder): The coder for the table + rows. If :data:`None`, then the default coder is + :class:`~apache_beam.io.gcp.bigquery._BigQueryRowCoder`, + which will interpret every line in a file as a JSON serialized + dictionary. This argument needs a value only in special cases when + returning table rows as dictionaries is not desirable. +use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL + dialect for this query. The default value is :data:`False`. + If set to :data:`True`, the query will use BigQuery's updated SQL + dialect with improved standards compliance. + This parameter is ignored for table inputs. +flatten_results (bool): Flattens all nested and repeated fields in the + query results. The default value is :data:`True`. +kms_key (str): Experimental. Optional Cloud KMS key name for use when + creating new tables. +
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=332793=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332793 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 23/Oct/19 18:45 Start Date: 23/Oct/19 18:45 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r337775537 ## File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py ## @@ -86,23 +117,61 @@ def create_table(self, tablename): table_schema.fields.append(table_field) table = bigquery.Table( tableReference=bigquery.TableReference( -projectId=self.project, -datasetId=self.dataset_id, -tableId=tablename), +projectId=cls.project, +datasetId=cls.dataset_id, +tableId=table_name), schema=table_schema) request = bigquery.BigqueryTablesInsertRequest( -projectId=self.project, datasetId=self.dataset_id, table=table) -self.bigquery_client.client.tables.Insert(request) +projectId=cls.project, datasetId=cls.dataset_id, table=table) +cls.bigquery_client.client.tables.Insert(request) table_data = [ {'number': 1, 'str': 'abc'}, {'number': 2, 'str': 'def'}, {'number': 3, 'str': u'你好'}, {'number': 4, 'str': u'привет'} ] -self.bigquery_client.insert_rows( -self.project, self.dataset_id, tablename, table_data) +cls.bigquery_client.insert_rows( +cls.project, cls.dataset_id, table_name, table_data) - def create_table_new_types(self, table_name): + def get_expected_data(self): Review comment: Make this a constant, and use it to create the table, and to match in the asserts. ``` TABLE_DATA = [ {'number': 1, 'str': 'abc'}, {'number': 2, 'str': 'def'}, {'number': 3, 'str': u'你好'}, {'number': 4, 'str': u'привет'} ] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332793) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 4h 40m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=332789=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332789 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 23/Oct/19 18:45 Start Date: 23/Oct/19 18:45 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r337778293 ## File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py ## @@ -86,23 +117,61 @@ def create_table(self, tablename): table_schema.fields.append(table_field) table = bigquery.Table( tableReference=bigquery.TableReference( -projectId=self.project, -datasetId=self.dataset_id, -tableId=tablename), +projectId=cls.project, +datasetId=cls.dataset_id, +tableId=table_name), schema=table_schema) request = bigquery.BigqueryTablesInsertRequest( -projectId=self.project, datasetId=self.dataset_id, table=table) -self.bigquery_client.client.tables.Insert(request) +projectId=cls.project, datasetId=cls.dataset_id, table=table) +cls.bigquery_client.client.tables.Insert(request) table_data = [ {'number': 1, 'str': 'abc'}, {'number': 2, 'str': 'def'}, {'number': 3, 'str': u'你好'}, {'number': 4, 'str': u'привет'} ] -self.bigquery_client.insert_rows( -self.project, self.dataset_id, tablename, table_data) +cls.bigquery_client.insert_rows( +cls.project, cls.dataset_id, table_name, table_data) - def create_table_new_types(self, table_name): + def get_expected_data(self): +return [ +{'number': 1, 'str': 'abc'}, +{'number': 2, 'str': 'def'}, +{'number': 3, 'str': u'你好'}, +{'number': 4, 'str': u'привет'} +] + + @skip(['PortableRunner', 'FlinkRunner']) + @attr('IT') + def test_native_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource( + query=self.query, use_standard_sql=True))) + assert_that(result, equal_to(self.get_expected_data())) + + @skip(['DirectRunner', 'TestDirectRunner']) + @attr('IT') + def test_iobase_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.ReadFromBigQuery( + query=self.query, use_standard_sql=True, project=self.project, + gcs_bucket_name='gs://temp-storage-for-end-to-end-tests')) Review comment: Why are we skipping this for DirectRunner? This should work there, right? `gcs_bucket_name` may need to be passed testpipeline arguments, in case it runs in a project that does not have access to that bucket (we run it internally at Google). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332789) Time Spent: 4.5h (was: 4h 20m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 4.5h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)