Re: Reading from BigTable with Beam Python SDK

Sayak Paul Fri, 07 Jan 2022 01:07:31 -0800

We'd want to avoid the intermediate CSV serialization and directly operate
with the BigTable instance instead. And we are using Python SDK so, reading
from Java is not an option for us.
Sayak Paul | sayak.dev




On Fri, Jan 7, 2022 at 2:32 PM Sofia’s World <[email protected]> wrote:

> Hello,
>  my 2 cents (and not sure if it makes sense for your usecase)
> What about the python process read from BigTable and  store in a bucket as
> csv?  Then you can read the csv from java>?
>
> hth
>  marco
>
> On Fri, Jan 7, 2022 at 7:31 AM Chamikara Jayalath <[email protected]>
> wrote:
>
>> Irrespective of whether the Java transform is defined by a user or
>> available in Beam Java SDK, the APIs for using such a transform from Python
>> are the same.
>> In other words, there's no special support for using arbitrary Java
>> transforms in Beam from Python pipelines. We have to use the API mentioned
>> in the documentation I linked above to use Java transforms from Python in
>> either case.
>>
>> To set expectations correctly, using a complex Java IO connector
>> transform such as BigTableIO.Read from Python can be a bit involved. For
>> example,
>> (1) We have to make sure that options needed to instantiate the transform
>> (for example, BigTableOptions) can be correctly instantiated on the Python
>> side.
>> (2) Seems like Bigtable read transform currently has output type
>> "com.google.bigtable.v2.Row". This has to be mapped to a cross-language
>> compatible type so that Python can understand it (for example, Beam Rows).
>>
>> Thanks,
>> Cham
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Jan 6, 2022 at 10:32 PM Sayak Paul <[email protected]> wrote:
>>
>>> My question still remains same. I am not yet sure how to use an existing
>>> Java transform (like BigTable IO reader in Java) from a Python pipeline.
>>> The examples take a user-defined sample transform and then show their
>>> usage.
>>>
>>> On Fri, 7 Jan, 2022, 11:10 Chamikara Jayalath, <[email protected]>
>>> wrote:
>>>
>>>> Actually this is the correct link for multi-language Python
>>>> documentation:
>>>> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
>>>> We also have a quickstart guide which might be a better starting point:
>>>> https://beam.apache.org/documentation/sdks/python-multi-language-pipelines/
>>>>
>>>> We haven't looked into developing a cross-language wrapper for the Java
>>>> BigTable connector yet. I created
>>>> https://issues.apache.org/jira/browse/BEAM-13607 for tracking this.
>>>> It's great if you can contribute to this.
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>>
>>>> On Thu, Jan 6, 2022 at 8:35 PM Sayak Paul <[email protected]>
>>>> wrote:
>>>>
>>>>> Luke, I studied the resources you provided. However, it's still a
>>>>> little unclear to me as to how I could use the BigTableIO
>>>>> <https://beam.apache.org/releases/javadoc/2.1.0/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.html>
>>>>>  in
>>>>> Java from a Python pipeline. The examples and documentation first 
>>>>> implement
>>>>> a demo class in Java and then show how to use it.
>>>>>
>>>>> I was wondering if there was a guide on using the existing connectors
>>>>> (i.e., without defining them first) from Python pipelines. I am probably
>>>>> mistaken somewhere so happy to rectify myself if that's the case.
>>>>>
>>>>> Sayak Paul | sayak.dev
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 6, 2022 at 10:35 PM Sayak Paul <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> On Thu, 6 Jan, 2022, 22:27 Luke Cwik, <[email protected]> wrote:
>>>>>>
>>>>>>> +1 on using cross language to get the Java Bigtable connector that
>>>>>>> already exists.
>>>>>>>
>>>>>>> You could also take a look at this other xlang documentation[1] and
>>>>>>> look at an existing implementation such as kafka[2] that is xlang.
>>>>>>>
>>>>>>> Finally there was support added to use many transforms in Java using
>>>>>>> the class name and builder methods[3].
>>>>>>>
>>>>>>> 1: https://beam.apache.org/documentation/patterns/cross-language/
>>>>>>> 2:
>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/kafka.py
>>>>>>> 3: https://issues.apache.org/jira/browse/BEAM-12769
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 6, 2022 at 4:41 AM Sayak Paul <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi folks,
>>>>>>>>
>>>>>>>> My project needs reading data from Cloud BigTable. We are aware
>>>>>>>> that an IO connector for BigTable is available in the Java SDK. So we 
>>>>>>>> could
>>>>>>>> probably make use of the cross-language capabilities
>>>>>>>> <https://beam.apache.org/documentation/programming-guide/#1311-creating-cross-language-java-transforms>
>>>>>>>> of Beam and make it work. I am, however, looking for
>>>>>>>> guidance/resources/pointers that could be beneficial to build a Beam
>>>>>>>> pipeline in Python that reads data from Cloud BigTable. Any relevant 
>>>>>>>> clue
>>>>>>>> would be greatly appreciated.
>>>>>>>>
>>>>>>>> Sayak Paul | sayak.dev
>>>>>>>>
>>>>>>>>

Re: Reading from BigTable with Beam Python SDK

Reply via email to