Re: [Question] Invitation to Join Beam Slack Channel
I also want to be invited :) On Thu, Aug 5, 2021 at 1:45 AM Koosha Hosseiny wrote: > Hello there > Can I also be invited? > > Many thanks. > > > > > > > > > > -- > *From:* Bergmeier, Andreas > *Sent:* Thursday, August 5, 2021 06:48 > *To:* user@beam.apache.org > *Subject:* Re: [Question] Invitation to Join Beam Slack Channel > > Would not mind an invite either > -- > *From:* Kyle Weaver > *Sent:* Wednesday, 4 August 2021 23:44 > *To:* user@beam.apache.org > *Subject:* Re: [Question] Invitation to Join Beam Slack Channel > > Hi Jo, > > I invited you to join. It looks like Slack invite links expire after a > couple days, so the one you were using may be out of date. > > On Wed, Aug 4, 2021 at 2:36 PM Jo Alex wrote: > > Hi, currently I'm joining the Beam Summit 2021 and would like to learn > more by joining the Beam Slack Channel. But it seems I can't join with my > personal email and I don't have the @apache.org > <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache.org%2F=04%7C01%7CKoosha.Hosseiny%40trivago.com%7Cf8e1981528f942e666f108d957cc60e3%7C688965da43a5418fae45331761010f00%7C0%7C0%7C637637357587281177%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=y0QfO%2F46MbwCh%2B9A%2F4ncgP%2FVMCHKhn5spUg0PVcAyG4%3D=0> > email either. I would like to ask if I can get invited to the Slack > channel, or maybe there is another way to join? Thanks! Regards > > -- Asif Iqbal Senior Software Engineer *BenchSci* *www.benchsci.com <http://www.benchsci.com>* *E: *aiq...@benchsci.com BenchSci helps scientists and organizations improve the speed and quality of their research, learn how <https://hubs.ly/H0R25f60>
SplittableDoFn for Extracting contents of zip/tar files
Hi there, I have a python pipeline that reads a PCollection of zip/tar file paths from BigQuery, makes them readable via applying ReadMatches, and then unzips the contents inside a ParDo. input # this is a PCollection of GCS paths of zip files | "Load ZIP archives" >> fileio.ReadMatches(skip_directories=True) | "Decompress zip archives" >> beam.ParDo(decompress) # decompress is the method that loops through the contents of the zip file and adds to emitted output Since the zip files are of variable sizes, and can contain arbitrary number of smaller files, some instances of the decompress method can take a pretty long time while some others will finish relatively early, resulting in an uneven load of computation. I am thinking of tackling this via converting the decompress to a SplittableDoFn. The basic structure of the decompressor looks like this: with ZipFile(archive) as compressed_data: for member in compressed_data.infolist(): try: decompressed_file = compressed_data.open(member) raw_contents = decompressed_file.read() output.append(raw_contents) finally: decompressed_file.close() I need some advice on this and have a few questions: - Does SplittableDoFn seem a reasonable approach here? - The plan I have in mind is to possibly write my own RestrictionTracker (inheriting ioBase.RestrictionTracker) and RestrictionProvider, where the provider would just yield the next file (member) of the compressed file. Does that sound like a reasonable approach? - Is it possible to share some code snippet that can be helpful? I didn't find too many examples for SDF in Python Any questions - don't hesitate to ask. Thanks in adavnce. - Asif -- Asif Iqbal Senior Software Engineer *BenchSci* *www.benchsci.com <http://www.benchsci.com>* *E: *aiq...@benchsci.com Did you know that $48 billion is lost to Avoidable Experiment Expenditure every year? Read our latest whitepaper <https://hubs.ly/H0rB8W50> to learn more.