Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Reuven Lax
In many cases, the writer of the ParDo has no access to the GBK (e.g. the GBK is hidden inside an upstream PTransform that they cannot modify). This is the same reason why RequiresStableInput was made a property of the ParDo, because the GroupByKey is quite often inaccessible. The car analogy

Re: Reading from RDB, ParDo or BoundedSource

2019-09-27 Thread Pablo Estrada
Hi Lucas! That makes sense. I saw a question for this on StackOverflow recently. Perhaps that was you? [1] - perhaps not, but then you're not the only one trying to do this. I do not know a lot about connecting to RDBs from Python - it seemed to me that you'd need to also install ODBC / JDBC

Re: NOTICE: New Python PreCommit jobs

2019-09-27 Thread Kenneth Knowles
Do things go wrong when nose is configured to use parallel execution? On Fri, Sep 27, 2019 at 5:09 PM Chad Dombrova wrote: > By the way, the outcome on this was that splitting the python precommit > job into one job per python version resulted in increasing the total test > completion time by

Re: Why is there no standard boolean coder?

2019-09-27 Thread Chad Dombrova
> It would still be a standard coder - the distinction I'm proposing is that > there are certain coders that _must_ be implemented by a new runner/sdk > (for example windowedvalue, varint, kv, ...) since they are important for > SDK - runner communication, but now we're starting to standardize

Re: Why is there no standard boolean coder?

2019-09-27 Thread Reuven Lax
+1. Would be a good contribution! Reuven On Fri, Sep 27, 2019 at 5:33 PM Brian Hulette wrote: > +1, thank you! > > Note In my Row Coder PR I added a new section for "Additional Standard > Coders" - i.e. coders that have a URN, but aren't required for a new > runner/sdk to implement the beam

Re: Why is there no standard boolean coder?

2019-09-27 Thread Brian Hulette
It would still be a standard coder - the distinction I'm proposing is that there are certain coders that _must_ be implemented by a new runner/sdk (for example windowedvalue, varint, kv, ...) since they are important for SDK - runner communication, but now we're starting to standardize coders that

Re: Why is there no standard boolean coder?

2019-09-27 Thread Robert Bradshaw
Yes, go ahead and do this (though for your usecase I'm hoping we'll be able to switch to schemas soon). On Fri, Sep 27, 2019 at 5:35 PM Chad Dombrova wrote: > > Would BooleanCoder continue to fall into this category? I was under the > impression we might make it a full fledge standard coder

Re: Why is there no standard boolean coder?

2019-09-27 Thread Chad Dombrova
Would BooleanCoder continue to fall into this category? I was under the impression we might make it a full fledge standard coder with this PR. On Fri, Sep 27, 2019 at 5:32 PM Brian Hulette wrote: > +1, thank you! > > Note In my Row Coder PR I added a new section for "Additional Standard >

Re: Why is there no standard boolean coder?

2019-09-27 Thread Brian Hulette
+1, thank you! Note In my Row Coder PR I added a new section for "Additional Standard Coders" - i.e. coders that have a URN, but aren't required for a new runner/sdk to implement the beam model: https://github.com/apache/beam/pull/9188/files#diff-f0d64c2cfc4583bfe2a7e5ee59818ae2R646 I think this

Re: Why is there no standard boolean coder?

2019-09-27 Thread Thomas Weise
+1 for adding the coder Please also add a test here: https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml On Fri, Sep 27, 2019 at 5:17 PM Chad Dombrova wrote: > Are there any dissenting votes to making a

Re: Why is there no standard boolean coder?

2019-09-27 Thread Chad Dombrova
Are there any dissenting votes to making a BooleanCoder a standard (portable) coder? I'm happy to make a PR to implement a BooleanCoder in python (and to add the Java BooleanCoder to the ModelCoderRegistrar) if everyone agrees that this is useful. -chad On Fri, Sep 27, 2019 at 3:32 PM Robert

Re: NOTICE: New Python PreCommit jobs

2019-09-27 Thread Chad Dombrova
By the way, the outcome on this was that splitting the python precommit job into one job per python version resulted in increasing the total test completion time by 66%, which is obviously not good. This is because we are using Gradle to run the python tests tasks in parallel (the jenkins VMs

Re: Reading from RDB, ParDo or BoundedSource

2019-09-27 Thread Eugene Kirpichov
I'm actually very surprised why to this day nobody wrote a Python connector for the Python Database API, like JdbcIO. Do we maybe have a way to use JdbcIO from Python via the cross-language connectors stuff? On Fri, Sep 27, 2019 at 4:28 PM Lucas Magalhães < lucas.magalh...@paralelocs.com.br>

Re: Reading from RDB, ParDo or BoundedSource

2019-09-27 Thread Lucas Magalhães
Hi guys. Sorry. I forgot to mention that.. I'm using python SDK.. Its seems that Java SDK looks like more mature, but i have no skill on that language. I'm trying to extract data from postgres (Cloud SQL), make some agregations and save into BigQuery. Em sex, 27 de set de 2019 19:21, Pablo

Review Request - BEAM-7990 Add ReadFromParquetBatched to Python

2019-09-27 Thread Brian Hulette
Could someone take a look at https://github.com/apache/beam/pull/9361. I already have a review from Heejong, who wrote the original ReadFromParquet, but it still needs a committer's review. Brian

Re: Why is there no standard boolean coder?

2019-09-27 Thread Robert Bradshaw
I think boolean is useful to have. What I'm more skeptical of is adding standard types for variations like UnsignedInteger16, etc. that don't have natural representations in all languages. On Fri, Sep 27, 2019 at 2:46 PM Brian Hulette wrote: > > Some more context from an offline discussion I had

Re: NOTICE: New Python PreCommit jobs

2019-09-27 Thread Kyle Weaver
> Do we have good pypi caching? Building Python SDK harness containers takes 2 mins each (times 4, the number of versions) on my machine, even if nothing has changed. But we're already paying that cost, so I don't think splitting the jobs should make it any worse.

Re: Reading from RDB, ParDo or BoundedSource

2019-09-27 Thread Pablo Estrada
Hi Lucas! Can you share more information about your use case? Java has JdbcIO. Maybe that's all you need? Or perhaps you're using Python SDK? Best -P. On Fri, Sep 27, 2019 at 3:08 PM Eugene Kirpichov wrote: > Hi Lucas, > Any reason why you can't use JdbcIO? > You almost certainly should *not*

Re: [UPDATE] Beam 2.16.0 Release Tracking

2019-09-27 Thread Mark Liu
Sorry for the late update. - I'm always keeping track of release blockers on Jira and there is only one now (BEAM-8314 ). Fix is out and under reviewing. - Preparation for doc and rc build is done.

Re: Reading from RDB, ParDo or BoundedSource

2019-09-27 Thread Eugene Kirpichov
Hi Lucas, Any reason why you can't use JdbcIO? You almost certainly should *not* use BoundedSource, nor Splittable DoFn for this. BoundedSource is obsolete in favor of assembling your connector from regular transforms and/or using an SDF, and SDF is an extremely advanced feature whose primary

Re: [ANNOUNCE] New committer: Alan Myrvold

2019-09-27 Thread Valentyn Tymofieiev
Congratulations, Alan. Well deserved. On Fri, Sep 27, 2019 at 2:09 PM Chamikara Jayalath wrote: > Congrats Alan!! > > On Fri, Sep 27, 2019 at 1:49 PM Jan Lukavský wrote: > >> Congrats Alan! >> On 9/27/19 10:22 PM, Mark Liu wrote: >> >> Congratulations Alan!!! >> >> On Fri, Sep 27, 2019 at

Re: Why is there no standard boolean coder?

2019-09-27 Thread Brian Hulette
Some more context from an offline discussion I had with +Robert Bradshaw a while ago: We both agreed all of the coders listed in BEAM-7996 should be implemented in Python, but didn't come to a conclusion on whether or not they should actually be _standard_ coders, versus just being implicitly

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Jan Lukavský
I'd suggest Stream instead of Iterator, it has the same semantics and much better API. Still not sure, what is wrong on letting the GBK to decide this. I have an analogy - if I decide to buy a car, I have to decide *what* car I'm going to buy (by think about how I'm going to use it) *before*

Re: Why is there no standard boolean coder?

2019-09-27 Thread Kenneth Knowles
Yes, noted here: https://github.com/apache/beam/pull/9188/files#diff-f0d64c2cfc4583bfe2a7e5ee59818ae2R678 and that links to https://issues.apache.org/jira/browse/BEAM-7996 Kenn On Fri, Sep 27, 2019 at 12:57 PM Reuven Lax wrote: > Java has one, implemented as a byte coder. My guess is that

Re: [ANNOUNCE] New committer: Alan Myrvold

2019-09-27 Thread Chamikara Jayalath
Congrats Alan!! On Fri, Sep 27, 2019 at 1:49 PM Jan Lukavský wrote: > Congrats Alan! > On 9/27/19 10:22 PM, Mark Liu wrote: > > Congratulations Alan!!! > > On Fri, Sep 27, 2019 at 12:55 PM Ning Kang wrote: > >> Congrats Alan! >> >> On Fri, Sep 27, 2019 at 12:02 PM Ankur Goenka wrote: >> >>>

Re: Shuffling on shardnum, is it necessary?

2019-09-27 Thread Shannon Duncan
Yes we do have a use case for specifying number of shards, but unfortunately I can't share it with the group. Shannon On Fri, Sep 27, 2019 at 2:14 PM Reuven Lax wrote: > Is there a reason that you need to explicitly specify the number of > shards? If you don't, then this extra shuffle will not

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Kenneth Knowles
Good point about sibling fusion requiring this. The type PTransform, KV>> already does imply that the output iterable can be iterated arbitrarily many times. I think this should remain the default for all the reasons mentioned. We could have opt-in to the weaker KV> version. Agree that this is

Re: [ANNOUNCE] New committer: Alan Myrvold

2019-09-27 Thread Jan Lukavský
Congrats Alan! On 9/27/19 10:22 PM, Mark Liu wrote: Congratulations Alan!!! On Fri, Sep 27, 2019 at 12:55 PM Ning Kang > wrote: Congrats Alan! On Fri, Sep 27, 2019 at 12:02 PM Ankur Goenka mailto:goe...@google.com>> wrote: Congratulations Alan!

Re: [ANNOUNCE] New committer: Alan Myrvold

2019-09-27 Thread Mark Liu
Congratulations Alan!!! On Fri, Sep 27, 2019 at 12:55 PM Ning Kang wrote: > Congrats Alan! > > On Fri, Sep 27, 2019 at 12:02 PM Ankur Goenka wrote: > >> Congratulations Alan! >> >> On Fri, Sep 27, 2019 at 11:17 AM Yichi Zhang wrote: >> >>> Congrats, Alan! >>> >>> On Fri, Sep 27, 2019 at 10:26

Re: Why is there no standard boolean coder?

2019-09-27 Thread Reuven Lax
Java has one, implemented as a byte coder. My guess is that nobody has gotten around to implementing it yet for portability. On Fri, Sep 27, 2019 at 12:44 PM Chad Dombrova wrote: > Hi all, > It seems a bit unfortunate that there isn’t a portable way to serialize a > boolean value. > > I’m

Re: [ANNOUNCE] New committer: Alan Myrvold

2019-09-27 Thread Ning Kang
Congrats Alan! On Fri, Sep 27, 2019 at 12:02 PM Ankur Goenka wrote: > Congratulations Alan! > > On Fri, Sep 27, 2019 at 11:17 AM Yichi Zhang wrote: > >> Congrats, Alan! >> >> On Fri, Sep 27, 2019 at 10:26 AM Robin Qiu wrote: >> >>> Congrats, Alan! >>> >>> On Fri, Sep 27, 2019 at 10:15 AM

Why is there no standard boolean coder?

2019-09-27 Thread Chad Dombrova
Hi all, It seems a bit unfortunate that there isn’t a portable way to serialize a boolean value. I’m working on porting my external PubsubIO PR over to use the improved schema-based external transform API in python, but because of this limitation I can’t use boolean values. For example, this

Re: Shuffling on shardnum, is it necessary?

2019-09-27 Thread Reuven Lax
Is there a reason that you need to explicitly specify the number of shards? If you don't, then this extra shuffle will not be performed. Reuven On Fri, Sep 27, 2019 at 12:12 PM Shannon Duncan wrote: > Interesting. Right now we are only doing batch processing so I hadn't > thought about the

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Reuven Lax
I think the behavior to make explicit is the need to reiterate, not the need to handle large results. How large of a result can be handled will always be dependent on the runner, and each runner will probably have a different definition of large keys. Reiteration however is a logical difference in

Re: Shuffling on shardnum, is it necessary?

2019-09-27 Thread Shannon Duncan
Interesting. Right now we are only doing batch processing so I hadn't thought about the windowing aspect. On Fri, Sep 27, 2019 at 12:10 PM Reuven Lax wrote: > Are you doing this in streaming with windowed writes? Window grouping does > not "happen" in Beam until a GroupByKey, so you do need the

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Jan Lukavský
Ok, I think I understand there might be some benefits of this. Then I'd propose we make this clear on the GBK. If we would support somehing like this:  PCollection input = ;  input.apply(GroupByKey.withLargeKeys()); then SparkRunner could expand this to repartitionAndSortWithinPartitions

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Jan Lukavský
> But - does it imply that it is actually required O(n^2) I meant O(n) iterations, O(n^2) operations on elements. On 9/27/19 8:31 PM, Jan Lukavský wrote: Okay, the self-join example is understandable. But - does it imply that it is actually required O(n^2) iterations (maybe caching can

Re: [ANNOUNCE] New committer: Alan Myrvold

2019-09-27 Thread Ankur Goenka
Congratulations Alan! On Fri, Sep 27, 2019 at 11:17 AM Yichi Zhang wrote: > Congrats, Alan! > > On Fri, Sep 27, 2019 at 10:26 AM Robin Qiu wrote: > >> Congrats, Alan! >> >> On Fri, Sep 27, 2019 at 10:15 AM Hannah Jiang >> wrote: >> >>> Congrats Alan! >>> >>> On Fri, Sep 27, 2019 at 9:57 AM

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Reuven Lax
As I mentioned above, CoGroupByKey already takes advantage of this. Reiterating is not the most common use case, but it's definitely one that comes up. Also keep in mind that this API has supported reiterating for the past five years (since even before the SDK was donated to Apache). Therefore you

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Jan Lukavský
Okay, the self-join example is understandable. But - does it imply that it is actually required O(n^2) iterations (maybe caching can somehow help, but asymptotically, the complexity will be this)? If so, that seems to be very prohibitively slow (for large inputs that don't fit into memory),

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Kenneth Knowles
CoGroupByKey is one example. To perform a CoGroupByKey based join requires multiple iterations (caching is key to getting performance). You could make up other calculations that require it, most of which would look like a self-join, like "output the largest difference between any two elements for

Re: [ANNOUNCE] New committer: Alan Myrvold

2019-09-27 Thread Yichi Zhang
Congrats, Alan! On Fri, Sep 27, 2019 at 10:26 AM Robin Qiu wrote: > Congrats, Alan! > > On Fri, Sep 27, 2019 at 10:15 AM Hannah Jiang > wrote: > >> Congrats Alan! >> >> On Fri, Sep 27, 2019 at 9:57 AM Ruoyun Huang wrote: >> >>> Congratulations, Alan! >>> >>> >>> On Fri, Sep 27, 2019 at 9:55

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Jan Lukavský
I'd like to know the use-case. Why would you *need* to actually iterate the grouped elements twice? By definition the first iteration would have to extract some statistic (or subset of elements that must fit into memory). This statistic can then be used as another input for the second

Re: using avro instead of json for BigQueryIO.Write

2019-09-27 Thread Steve Niemitz
I put up a semi-WIP pull request https://github.com/apache/beam/pull/9665 for this. The initial results look good. I'll spend some time soon adding unit tests and documentation, but I'd appreciate it if someone could take a first pass over it. On Wed, Sep 18, 2019 at 6:14 PM Pablo Estrada

Re: [ANNOUNCE] New committer: Alan Myrvold

2019-09-27 Thread Robin Qiu
Congrats, Alan! On Fri, Sep 27, 2019 at 10:15 AM Hannah Jiang wrote: > Congrats Alan! > > On Fri, Sep 27, 2019 at 9:57 AM Ruoyun Huang wrote: > >> Congratulations, Alan! >> >> >> On Fri, Sep 27, 2019 at 9:55 AM Rui Wang wrote: >> >>> Congrats! >>> >>> -Rui >>> >>> On Fri, Sep 27, 2019 at 9:54

Re: [ANNOUNCE] New committer: Alan Myrvold

2019-09-27 Thread Hannah Jiang
Congrats Alan! On Fri, Sep 27, 2019 at 9:57 AM Ruoyun Huang wrote: > Congratulations, Alan! > > > On Fri, Sep 27, 2019 at 9:55 AM Rui Wang wrote: > >> Congrats! >> >> -Rui >> >> On Fri, Sep 27, 2019 at 9:54 AM Pablo Estrada wrote: >> >>> Yooh! : D >>> >>> On Fri, Sep 27, 2019 at 9:53 AM

Re: Shuffling on shardnum, is it necessary?

2019-09-27 Thread Reuven Lax
Are you doing this in streaming with windowed writes? Window grouping does not "happen" in Beam until a GroupByKey, so you do need the GroupByKey in that case. If you are not windowing but want a specific number of shards (though the general suggestion in that case is to not pick a specific

Re: Shuffling on shardnum, is it necessary?

2019-09-27 Thread Lukasz Cwik
Using a state variable to store the shard key introduces a GroupByKey within Dataflow to ensure that there is a strict ordering on state. Other runners insert similar materializations to guarantee this as well. Also a sufficiently powerful enough execution engine could do state processing for the

Re: [ANNOUNCE] New committer: Alan Myrvold

2019-09-27 Thread Ruoyun Huang
Congratulations, Alan! On Fri, Sep 27, 2019 at 9:55 AM Rui Wang wrote: > Congrats! > > -Rui > > On Fri, Sep 27, 2019 at 9:54 AM Pablo Estrada wrote: > >> Yooh! : D >> >> On Fri, Sep 27, 2019 at 9:53 AM Yifan Zou wrote: >> >>> Congratulations, Alan! >>> >>> On Fri, Sep 27, 2019 at 9:18 AM

Re: [ANNOUNCE] New committer: Alan Myrvold

2019-09-27 Thread Rui Wang
Congrats! -Rui On Fri, Sep 27, 2019 at 9:54 AM Pablo Estrada wrote: > Yooh! : D > > On Fri, Sep 27, 2019 at 9:53 AM Yifan Zou wrote: > >> Congratulations, Alan! >> >> On Fri, Sep 27, 2019 at 9:18 AM Ahmet Altay wrote: >> >>> Hi, >>> >>> Please join me and the rest of the Beam PMC in

Re: [ANNOUNCE] New committer: Alan Myrvold

2019-09-27 Thread Pablo Estrada
Yooh! : D On Fri, Sep 27, 2019 at 9:53 AM Yifan Zou wrote: > Congratulations, Alan! > > On Fri, Sep 27, 2019 at 9:18 AM Ahmet Altay wrote: > >> Hi, >> >> Please join me and the rest of the Beam PMC in welcoming a new >> committer: Alan Myrvold >> >> Alan has been a long time Beam

Re: [ANNOUNCE] New committer: Alan Myrvold

2019-09-27 Thread Yifan Zou
Congratulations, Alan! On Fri, Sep 27, 2019 at 9:18 AM Ahmet Altay wrote: > Hi, > > Please join me and the rest of the Beam PMC in welcoming a new > committer: Alan Myrvold > > Alan has been a long time Beam contributor. His contributions made Beam > more productive and friendlier [1] for all

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Reuven Lax
This should be an element in the compatibility matrix as well. On Fri, Sep 27, 2019 at 9:26 AM Kenneth Knowles wrote: > I am pretty surprised that we do not have a @Category(ValidatesRunner) > test in GroupByKeyTest that iterates multiple times. That is a major > oversight. We should have this

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Kenneth Knowles
I am pretty surprised that we do not have a @Category(ValidatesRunner) test in GroupByKeyTest that iterates multiple times. That is a major oversight. We should have this test, and it can be disabled by the SparkRunner's configuration. Kenn On Fri, Sep 27, 2019 at 9:24 AM Reuven Lax wrote: >

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Reuven Lax
The Dataflow version does not spill to disk. However Spark's design might require spilling to disk if you want that to be implemented properly. On Fri, Sep 27, 2019 at 9:08 AM David Morávek wrote: > Hi, > > Spark's GBK is currently implemented using `sortBy(key and > value).mapPartition(...)`

[ANNOUNCE] New committer: Alan Myrvold

2019-09-27 Thread Ahmet Altay
Hi, Please join me and the rest of the Beam PMC in welcoming a new committer: Alan Myrvold Alan has been a long time Beam contributor. His contributions made Beam more productive and friendlier [1] for all contributors with significant improvements to Beam release process, automation, and

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread David Morávek
Hi, Spark's GBK is currently implemented using `sortBy(key and value).mapPartition(...)` for non-merging windowing in order to support large keys and large scale shuffles. Merging windowing is implemented using standard GBK (underlying spark impl. uses ListCombiner + Hash Grouping), which is by

Kicking off Beam Meetup NYC

2019-09-27 Thread Austin Bennett
On the heels of the new Seattle Meetup (yesterday's event), announcing the kickoff of the first event in NYC. https://www.meetup.com/New-York-Apache-Beam/events/265128669/ We'll have Tyler Akidau sharing on Streaming SQL, and some talks from Oden Technologies (a fantastic example of Beam, using

Re: Shuffling on shardnum, is it necessary?

2019-09-27 Thread Shannon Duncan
Yes, Specifically TextIO withNumShards(). On Fri, Sep 27, 2019 at 10:45 AM Reuven Lax wrote: > I'm not sure what you mean by "write out ot a specific shard number." Are > you talking about FIleIO sinks? > > Reuven > > On Fri, Sep 27, 2019 at 7:41 AM Shannon Duncan > wrote: > >> So when beam

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Reuven Lax
The Beam API was written to support multiple iterations, and there are definitely transforms that do so. I believe that CoGroupByKey may do this as well with the resulting iterator. I know that the Dataflow runner is able to handles iterators larger than available memory by paging them in from

Re: Shuffling on shardnum, is it necessary?

2019-09-27 Thread Reuven Lax
I'm not sure what you mean by "write out ot a specific shard number." Are you talking about FIleIO sinks? Reuven On Fri, Sep 27, 2019 at 7:41 AM Shannon Duncan wrote: > So when beam writes out to a specific shard number, as I understand it > does a few things: > > - Assigns a shard key to each

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Shannon Duncan
I see two main options here. Create an in memory Iterable as you do your first iteration. (poor implementation imo) Separate your iterations as separate DoFn and call them separately with the PCollection output from Shuffle. There are many different paths but finding the most parallel way is

Shuffling on shardnum, is it necessary?

2019-09-27 Thread Shannon Duncan
So when beam writes out to a specific shard number, as I understand it does a few things: - Assigns a shard key to each record (reduces parallelism) - Shuffles and Groups by the shard key to colocate all records - Writes out to each shard file within a single DoFn per key... When thinking about

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Jan Lukavský
+dev Lukasz, why do you think that users expect to be able to iterate multiple times grouped elements? Besides that it obviously suggests the 'Iterable'? The way that spark behaves is pretty much analogous to how MapReduce used to work - in certain cases it calles