Re: [Question] Invitation to Join Beam Slack Channel

2021-08-05 Thread Asif Iqbal
I also want to be invited :)

On Thu, Aug 5, 2021 at 1:45 AM Koosha Hosseiny 
wrote:

> Hello there
> Can I also be invited?
>
> Many thanks.
>
>
>
>
>
>
>
>
>
> --
> *From:* Bergmeier, Andreas 
> *Sent:* Thursday, August 5, 2021 06:48
> *To:* user@beam.apache.org 
> *Subject:* Re: [Question] Invitation to Join Beam Slack Channel
>
> Would not mind an invite either 
> --
> *From:* Kyle Weaver 
> *Sent:* Wednesday, 4 August 2021 23:44
> *To:* user@beam.apache.org 
> *Subject:* Re: [Question] Invitation to Join Beam Slack Channel
>
> Hi Jo,
>
> I invited you to join. It looks like Slack invite links expire after a
> couple days, so the one you were using may be out of date.
>
> On Wed, Aug 4, 2021 at 2:36 PM Jo Alex  wrote:
>
> Hi, currently I'm joining the Beam Summit 2021 and would like to learn
> more by joining the Beam Slack Channel. But it seems I can't join with my
> personal email and I don't have the @apache.org
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache.org%2F=04%7C01%7CKoosha.Hosseiny%40trivago.com%7Cf8e1981528f942e666f108d957cc60e3%7C688965da43a5418fae45331761010f00%7C0%7C0%7C637637357587281177%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=y0QfO%2F46MbwCh%2B9A%2F4ncgP%2FVMCHKhn5spUg0PVcAyG4%3D=0>
> email either. I would like to ask if I can get invited to the Slack
> channel, or maybe there is another way to join? Thanks! Regards
>
>

-- 
Asif Iqbal

Senior Software Engineer
*BenchSci*
*www.benchsci.com <http://www.benchsci.com>*
*E: *aiq...@benchsci.com
BenchSci helps scientists and organizations improve the speed and quality
of their research, learn how <https://hubs.ly/H0R25f60>


SplittableDoFn for Extracting contents of zip/tar files

2021-04-21 Thread Asif Iqbal
Hi there,

I have a python pipeline that reads a PCollection of zip/tar file paths
from BigQuery, makes them readable via applying ReadMatches, and then
unzips the contents inside a ParDo.

 input # this is a PCollection of GCS paths of zip files
  | "Load ZIP archives" >> fileio.ReadMatches(skip_directories=True)
  | "Decompress zip archives" >> beam.ParDo(decompress) # decompress is
the method that loops through the contents of the zip file and adds to
emitted output

Since the zip files are of variable sizes, and can contain arbitrary number
of smaller files, some instances of the decompress method can take a pretty
long time while some others will finish relatively early, resulting in an
uneven load of computation. I am thinking of tackling this via converting
the decompress to a SplittableDoFn.

The basic structure of the decompressor looks like this:
with ZipFile(archive) as compressed_data:
for member in compressed_data.infolist():
try:
decompressed_file = compressed_data.open(member)
raw_contents = decompressed_file.read()
output.append(raw_contents)
finally:
 decompressed_file.close()

I need some advice on this and have a few questions:

- Does SplittableDoFn seem a reasonable approach here?
- The plan I have in mind is to possibly write my own RestrictionTracker
(inheriting ioBase.RestrictionTracker) and RestrictionProvider, where the
provider would just yield the next
file (member) of the compressed file. Does that sound like a reasonable
approach?
- Is it possible to share some code snippet that can be helpful? I didn't
find too many examples for SDF in Python

Any questions - don't hesitate to ask. Thanks in adavnce.

- Asif

-- 
Asif Iqbal

Senior Software Engineer
*BenchSci*
*www.benchsci.com <http://www.benchsci.com>*
*E: *aiq...@benchsci.com
Did you know that $48 billion is lost to Avoidable Experiment Expenditure
every year? Read our latest whitepaper <https://hubs.ly/H0rB8W50> to learn
more.