[ https://issues.apache.org/jira/browse/BEAM-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Delfour updated BEAM-3757: ----------------------------------- Description: Hi, First issue is that the beam 2.3.0 python SDK is apparently not working on GCP. It gets stuck: {noformat} Workflow failed. Causes: (bf832d44290fbf41): The Dataflow appears to be stuck. You can get help with Cloud Dataflow at https://cloud.google.com/dataflow/support. {noformat} I tried two times. Reverting back to 2.2.0: it usually works but today, after > 1 hour of processing, and 30 workers used, I get a failure with these in the logs: {noformat} Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 582, in do_work work_executor.execute() File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 167, in execute op.start() File "dataflow_worker/shuffle_operations.py", line 49, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start def start(self): File "dataflow_worker/shuffle_operations.py", line 50, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start with self.scoped_start_state: File "dataflow_worker/shuffle_operations.py", line 65, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start with self.shuffle_source.reader() as reader: File "dataflow_worker/shuffle_operations.py", line 67, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start for key_values in reader: File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 406, in __iter__ for entry in entries_iterator: File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 248, in next return next(self.iterator) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 206, in __iter__ chunk, next_position = self.reader.Read(start_position, end_position) File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 138, in shuffle_client.PyShuffleReader.Read IOError: Shuffle read failed: INTERNAL: Received RST_STREAM with error code 2 talking to my-dataflow-02271107-756f-harness-2p65:12346 {noformat} i also get some information message: {noformat} Refusing to split <dataflow_worker.shuffle.GroupedShuffleRangeTracker object at 0x7f03a00fe790> at '\x00\x00\x00\t\x1d\x14\x87\xa3\x00\x01': unstarted {noformat} For the flow, I am extracting data from BQ, cleaning using pandas, exporting as a csv file, gzipping and uploading the compressed file to a bucket using decompressive transcoding (csv export, gzip compression and upload are in the same 'worker' as they are done in the same beam.DoFn). PS: i can't find a reasonable way to export the logs from GCP but i can privately send the log file i have of the run on my machine (the log of the pipeline) was: Hi, First issue is that the beam 2.3.0 python SDK is apparently not working on GCP. It gets stuck: {noformat} Workflow failed. Causes: (bf832d44290fbf41): The Dataflow appears to be stuck. You can get help with Cloud Dataflow at https://cloud.google.com/dataflow/support. {noformat} I tried two times. Reverting back to 2.2.0: it usually works but today, after > 1 hour of processing, and 30 workers used, I get a failure with these in the logs: {noformat} Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 582, in do_work work_executor.execute() File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 167, in execute op.start() File "dataflow_worker/shuffle_operations.py", line 49, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start def start(self): File "dataflow_worker/shuffle_operations.py", line 50, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start with self.scoped_start_state: File "dataflow_worker/shuffle_operations.py", line 65, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start with self.shuffle_source.reader() as reader: File "dataflow_worker/shuffle_operations.py", line 67, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start for key_values in reader: File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 406, in __iter__ for entry in entries_iterator: File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 248, in next return next(self.iterator) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 206, in __iter__ chunk, next_position = self.reader.Read(start_position, end_position) File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 138, in shuffle_client.PyShuffleReader.Read IOError: Shuffle read failed: INTERNAL: Received RST_STREAM with error code 2 talking to my-dataflow-02271107-756f-harness-2p65:12346 {noformat} i also get some information message: {noformat} Refusing to split <dataflow_worker.shuffle.GroupedShuffleRangeTracker object at 0x7f03a00fe790> at '\x00\x00\x00\t\x1d\x14\x87\xa3\x00\x01': unstarted {noformat} For the flow, I am extracting data from BQ, cleaning using pandas, exporting as a csv file, gzipping and uploading the compressed file to a bucket using decompressive transcoding (csv export, gzip compression and upload are in the same 'worker' as they are done in the same beam.DoFn). > Shuffle read failed using python 2.2.0 > -------------------------------------- > > Key: BEAM-3757 > URL: https://issues.apache.org/jira/browse/BEAM-3757 > Project: Beam > Issue Type: Bug > Components: runner-dataflow > Affects Versions: 2.2.0 > Environment: gcp, macos > Reporter: Jonathan Delfour > Assignee: Thomas Groh > Priority: Major > > Hi, > First issue is that the beam 2.3.0 python SDK is apparently not working on > GCP. It gets stuck: > {noformat} > Workflow failed. Causes: (bf832d44290fbf41): The Dataflow appears to be > stuck. You can get help with Cloud Dataflow at > https://cloud.google.com/dataflow/support. > {noformat} > I tried two times. > Reverting back to 2.2.0: it usually works but today, after > 1 hour of > processing, and 30 workers used, I get a failure with these in the logs: > {noformat} > Traceback (most recent call last): > File > "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line > 582, in do_work > work_executor.execute() > File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", > line 167, in execute > op.start() > File "dataflow_worker/shuffle_operations.py", line 49, in > dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start > def start(self): > File "dataflow_worker/shuffle_operations.py", line 50, in > dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start > with self.scoped_start_state: > File "dataflow_worker/shuffle_operations.py", line 65, in > dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start > with self.shuffle_source.reader() as reader: > File "dataflow_worker/shuffle_operations.py", line 67, in > dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start > for key_values in reader: > File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", > line 406, in __iter__ > for entry in entries_iterator: > File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", > line 248, in next > return next(self.iterator) > File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", > line 206, in __iter__ > chunk, next_position = self.reader.Read(start_position, end_position) > File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 138, in > shuffle_client.PyShuffleReader.Read > IOError: Shuffle read failed: INTERNAL: Received RST_STREAM with error code 2 > talking to my-dataflow-02271107-756f-harness-2p65:12346 > {noformat} > i also get some information message: > {noformat} > Refusing to split <dataflow_worker.shuffle.GroupedShuffleRangeTracker object > at 0x7f03a00fe790> at '\x00\x00\x00\t\x1d\x14\x87\xa3\x00\x01': unstarted > {noformat} > For the flow, I am extracting data from BQ, cleaning using pandas, exporting > as a csv file, gzipping and uploading the compressed file to a bucket using > decompressive transcoding (csv export, gzip compression and upload are in the > same 'worker' as they are done in the same beam.DoFn). > PS: i can't find a reasonable way to export the logs from GCP but i can > privately send the log file i have of the run on my machine (the log of the > pipeline) -- This message was sent by Atlassian JIRA (v7.6.3#76005)