Re: Breaking MAPREDUCE out of scalding-core — Re: [cascading3 branch] descriptions & parallelism

2016-11-04 Thread 'Piyush Narang' via Scalding Development
This is pretty cool! Shall take a look at the WIP PR as well.

On Fri, Nov 4, 2016 at 12:57 PM, Cyrille Chépélov <
c...@transparencyrights.com> wrote:

> Thanks!
>
> That's PR#1617 
> https://github.com/twitter/scalding/pull/1617
>
> Of note is 
> scalding-core-fabric-tests/src/fabric/scala/com/twitter/scalding/CoreTestFabric.scala
>
> ,
> which is really anything in scalding-core/src/test/scala/
> com/twitter/scalding/CoreTest.scala
> that
> needed to .runHadoop
>
> The real fun starts in Mode.scala, Source.scala and FileSource.scala
>
> -- Cyrille
>
>
> Le 04/11/2016 à 19:58, Oscar Boykin a écrit :
>
> Thanks for taking on this project. I'm excited about it.
>
> Can you go ahead and make a WIP PR so we can see what the diff looks like
> and start giving feedback?
>
> I'll be reviewing the WIP PR carefully.
>
> On Fri, Nov 4, 2016 at 8:43 AM Cyrille Chépélov <
> c...@transparencyrights.com> wrote:
>
>> Hi there,
>>
>> Some progress on the "separation of fabrics" project:
>>
>> TL;DR: I have a branch here https://github.com/cchepelov/
>> scalding/tree/split-fabrics that is mostly working on *Hadoop*,
>> *Hadoop2-MR1* *and Tez*, … and baby steps on *Flink*.
>>
>>
>> *The Good *
>>
>>- scalding-core has dependencies on Hadoop for HDFS, but no longer
>>has explicit dependencies on MAPREDUCE
>>- One can switch between MAPREDUCE using the legacy hadoop1 API or
>>MAPREDUCE using Cascading's hadoop2-mr1 fabric
>>- Most tests run on all four available fabrics in addition to Local.
>>That is: Legacy Hadoop, Hadoop2-MR1, Tez, *and Flink*.
>>- Switching from a fabric to another is a matter of supplying the
>>appropriate fabric jar (scalding-fabric-hadoop,  scalding-fabric-tez, 
>> etc.)
>>in your assembly
>>- Even the REPL seems to accept using a different fabric (!)
>>- Having an explicit per-fabric bit of code within Scalding enables
>>experimentation with more advanced things, such implementing
>>scalding-specific Cascading Planner graph transforms, as Chris advises.
>>- I *think* I didn't break any widely-used API at the source level.
>>
>>
>> *The Bad *
>>
>>- I *think* I didn't break any widely-used API at the source level,
>>but I haven't (yet) checked if any damage control should/can be done
>>- A few tests still break in Tez. This is on things that I've lived
>>with for a long time, but fixing those should be easier and a higher
>>priority now. For now it seems there are really two outstanding items 
>> left:
>>1. mapping .withReducers all the way down to the level of parallelism in
>>the TezChild node in charge of performing that processing and 2. perhaps a
>>planner bug, or perhaps a missing scalding-specific planner transform to
>>handle jobs involving Sketched Joins *(that's on cascading 3.2-wip-6)*
>>- Flink is not yet ready for prime time. At the moment, I'm building
>>it using a local snapshot reflecting https://github.com/
>>dataArtisans/cascading-flink/pull/70
>> — This is
>>required as some of Cascading's internal interfaces changed a bit since
>>3.1.
>>Some of the test are bound to fail for now, as cascading-flink cannot
>>yet map some variants of hash joins (outer right hash joins, for 
>> instance).
>>- Mode.scala is a mess and will need a little bit of clean-up
>>- There are still a couple tests that are bound to fail
>>- Any test that was doing pattern maching on the exact type of Mode
>>(Test vs. Hadoop vs. Local vs. HadoopTest) *will* fail, and there is
>>no solution
>>- Tez and Flink *tests* seem quite slow. Not yet sure what's
>>happening, it seems some of the code is simply waiting and waking up long
>>after a given test job is complete.
>>
>> *The Ugly*
>>
>>- Mode.scala is a mess and will *really* need a little bit of clean-up
>>
>>- we still need to compile scalding-core with a *provided *dependency
>>to either cascading-hadoop or cascading-hadoop2-mr1. This is due to
>>HadoopTap and friends (HDFS support). Ideally we could have a (perhaps
>>hard?) dependency on cascading-hadoop2-io since everyone's using it
>>(hadoop2-mr1, tez, flink), but we'd have to manage the case of
>>cascading-hadoop (which brings almost identical copies but cannot, by
>>trade, depend on cascading-hadoop2-io). Still slightly confused on the 
>> best
>>course of action; I'd like things in scalding-core to actually not compile
>>if they still accidentally depend on MAPREDUCE. I'm unsure it's achievable
>>as it is.
>>
>>- I've tried to share the fabric-sensitive tests from scalding-core
>>into a pool of tests that is 

Re: Breaking MAPREDUCE out of scalding-core — Re: [cascading3 branch] descriptions & parallelism

2016-11-04 Thread Cyrille Chépélov

Thanks!

That's PR#1617  
https://github.com/twitter/scalding/pull/1617


Of note is 
scalding-core-fabric-tests/src/fabric/scala/com/twitter/scalding/CoreTestFabric.scala 
, 
which is really anything in 
scalding-core/src/test/scala/com/twitter/scalding/CoreTest.scala 
that 
needed to .runHadoop


The real fun starts in Mode.scala, Source.scala and FileSource.scala

-- Cyrille

Le 04/11/2016 à 19:58, Oscar Boykin a écrit :

Thanks for taking on this project. I'm excited about it.

Can you go ahead and make a WIP PR so we can see what the diff looks 
like and start giving feedback?


I'll be reviewing the WIP PR carefully.

On Fri, Nov 4, 2016 at 8:43 AM Cyrille Chépélov 
> wrote:


Hi there,

Some progress on the "separation of fabrics" project:

TL;DR: I have a branch here
https://github.com/cchepelov/scalding/tree/split-fabrics that is
mostly working on *Hadoop*, *Hadoop2-MR1* *and Tez*, … and baby
steps on *Flink*.

*The Good
*

  * scalding-core has dependencies on Hadoop for HDFS, but no
longer has explicit dependencies on MAPREDUCE
  * One can switch between MAPREDUCE using the legacy hadoop1 API
or MAPREDUCE using Cascading's hadoop2-mr1 fabric
  * Most tests run on all four available fabrics in addition to
Local. That is: Legacy Hadoop, Hadoop2-MR1, Tez, _and Flink_.
  * Switching from a fabric to another is a matter of supplying
the appropriate fabric jar (scalding-fabric-hadoop, 
scalding-fabric-tez, etc.) in your assembly

  * Even the REPL seems to accept using a different fabric (!)
  * Having an explicit per-fabric bit of code within Scalding
enables experimentation with more advanced things, such
implementing scalding-specific Cascading Planner graph
transforms, as Chris advises.
  * I *think* I didn't break any widely-used API at the source level.

*The Bad
*

  * I *think* I didn't break any widely-used API at the source
level, but I haven't (yet) checked if any damage control
should/can be done
  * A few tests still break in Tez. This is on things that I've
lived with for a long time, but fixing those should be easier
and a higher priority now. For now it seems there are really
two outstanding items left: 1. mapping .withReducers all the
way down to the level of parallelism in the TezChild node in
charge of performing that processing and 2. perhaps a planner
bug, or perhaps a missing scalding-specific planner transform
to handle jobs involving Sketched Joins /(that's on cascading
3.2-wip-6)/
  * Flink is not yet ready for prime time. At the moment, I'm
building it using a local snapshot reflecting
https://github.com/dataArtisans/cascading-flink/pull/70 — This
is required as some of Cascading's internal interfaces changed
a bit since 3.1.
Some of the test are bound to fail for now, as cascading-flink
cannot yet map some variants of hash joins (outer right hash
joins, for instance).
  * Mode.scala is a mess and will need a little bit of clean-up
  * There are still a couple tests that are bound to fail
  * Any test that was doing pattern maching on the exact type of
Mode (Test vs. Hadoop vs. Local vs. HadoopTest) *will* fail,
and there is no solution
  * Tez and Flink _tests_ seem quite slow. Not yet sure what's
happening, it seems some of the code is simply waiting and
waking up long after a given test job is complete.

*The Ugly*

  * Mode.scala is a mess and will *really* need a little bit of
clean-up

  * we still need to compile scalding-core with a /provided
/dependency to either cascading-hadoop or
cascading-hadoop2-mr1. This is due to HadoopTap and friends
(HDFS support). Ideally we could have a (perhaps hard?)
dependency on cascading-hadoop2-io since everyone's using it
(hadoop2-mr1, tez, flink), but we'd have to manage the case of
cascading-hadoop (which brings almost identical copies but
cannot, by trade, depend on cascading-hadoop2-io). Still
slightly confused on the best course of action; I'd like
things in scalding-core to actually not compile if they still
accidentally depend on MAPREDUCE. I'm unsure it's achievable
as it is.

  * I've tried to share the fabric-sensitive tests from
scalding-core into a pool of tests that is shared and verified
with all fabrics: this is scalding-core-fabric-tests

Although Scalatest's 

Re: Breaking MAPREDUCE out of scalding-core — Re: [cascading3 branch] descriptions & parallelism

2016-11-04 Thread 'Oscar Boykin' via Scalding Development
Thanks for taking on this project. I'm excited about it.

Can you go ahead and make a WIP PR so we can see what the diff looks like
and start giving feedback?

I'll be reviewing the WIP PR carefully.

On Fri, Nov 4, 2016 at 8:43 AM Cyrille Chépélov 
wrote:

> Hi there,
>
> Some progress on the "separation of fabrics" project:
>
> TL;DR: I have a branch here
> https://github.com/cchepelov/scalding/tree/split-fabrics that is mostly
> working on *Hadoop*, *Hadoop2-MR1* *and Tez*, … and baby steps on *Flink*.
>
>
> *The Good *
>
>- scalding-core has dependencies on Hadoop for HDFS, but no longer has
>explicit dependencies on MAPREDUCE
>- One can switch between MAPREDUCE using the legacy hadoop1 API or
>MAPREDUCE using Cascading's hadoop2-mr1 fabric
>- Most tests run on all four available fabrics in addition to Local.
>That is: Legacy Hadoop, Hadoop2-MR1, Tez, *and Flink*.
>- Switching from a fabric to another is a matter of supplying the
>appropriate fabric jar (scalding-fabric-hadoop,  scalding-fabric-tez, etc.)
>in your assembly
>- Even the REPL seems to accept using a different fabric (!)
>- Having an explicit per-fabric bit of code within Scalding enables
>experimentation with more advanced things, such implementing
>scalding-specific Cascading Planner graph transforms, as Chris advises.
>- I *think* I didn't break any widely-used API at the source level.
>
>
> *The Bad *
>
>- I *think* I didn't break any widely-used API at the source level,
>but I haven't (yet) checked if any damage control should/can be done
>- A few tests still break in Tez. This is on things that I've lived
>with for a long time, but fixing those should be easier and a higher
>priority now. For now it seems there are really two outstanding items left:
>1. mapping .withReducers all the way down to the level of parallelism in
>the TezChild node in charge of performing that processing and 2. perhaps a
>planner bug, or perhaps a missing scalding-specific planner transform to
>handle jobs involving Sketched Joins *(that's on cascading 3.2-wip-6)*
>- Flink is not yet ready for prime time. At the moment, I'm building
>it using a local snapshot reflecting
>https://github.com/dataArtisans/cascading-flink/pull/70 — This is
>required as some of Cascading's internal interfaces changed a bit since
>3.1.
>Some of the test are bound to fail for now, as cascading-flink cannot
>yet map some variants of hash joins (outer right hash joins, for instance).
>- Mode.scala is a mess and will need a little bit of clean-up
>- There are still a couple tests that are bound to fail
>- Any test that was doing pattern maching on the exact type of Mode
>(Test vs. Hadoop vs. Local vs. HadoopTest) *will* fail, and there is
>no solution
>- Tez and Flink *tests* seem quite slow. Not yet sure what's
>happening, it seems some of the code is simply waiting and waking up long
>after a given test job is complete.
>
> *The Ugly*
>
>- Mode.scala is a mess and will *really* need a little bit of clean-up
>
>- we still need to compile scalding-core with a *provided *dependency
>to either cascading-hadoop or cascading-hadoop2-mr1. This is due to
>HadoopTap and friends (HDFS support). Ideally we could have a (perhaps
>hard?) dependency on cascading-hadoop2-io since everyone's using it
>(hadoop2-mr1, tez, flink), but we'd have to manage the case of
>cascading-hadoop (which brings almost identical copies but cannot, by
>trade, depend on cascading-hadoop2-io). Still slightly confused on the best
>course of action; I'd like things in scalding-core to actually not compile
>if they still accidentally depend on MAPREDUCE. I'm unsure it's achievable
>as it is.
>
>- I've tried to share the fabric-sensitive tests from scalding-core
>into a pool of tests that is shared and verified with all fabrics: this is
>scalding-core-fabric-tests
>
>Although Scalatest's internal discovery seems to be happy with running
>anything that looks like a test, the discovery module used by "sbt test" is
>different. It only looks at tests that are implemented within the current
>project, specifically ignoring tests inherited from dependencies.
>
>I failed to find a way to convince sbt to adopt scalatest's discovery
>pattern. As a result, I've moved the "shared" tests from 
> scalding-core-fabric-tests
>into another subdirectory of src/, which is referenced by all four
>fabrics as their own. As a result, this code is compiled 4 times, and
>IntelliJ can be confused and refusing to step into that.
>
>If there is an sbt guru around willing to give me a hand on this, I'd
>be really grateful.
>
>- Making counter implementation dependent on the fabric required
>passing a class *name* into fabric-specific properties, then using
>

Re: [cascading3 branch] descriptions & parallelism

2016-10-12 Thread 'Oscar Boykin' via Scalding Development
sounds great!

On Tue, Oct 11, 2016 at 11:39 PM Cyrille Chépélov <
c...@transparencyrights.com> wrote:

> Oscar, Piyush,
>
> thanks for the feedback!
>
> At the moment, I'm not sure it's realistic to fully break the dependency
> to "hadoop" completely out of scalding-core. As an intermediate goal, I'd
> shoot for at least soft-removing the assumption that the *processing* is
> made on Hadoop, but the storage interface will pretty much remain HDFS for
> the time being (IOW, I'll leave Source essentially unchanged in
> scalding-core).
>
> Meanwhile, I'm taking the messages here and on the gitter channel as
> positive towards the principle of scalding-$FABRIC sub-modules, and will
> start working on that in the background.
>
>
> -- Cyrille
>
>
> Le 12/10/2016 à 03:29, 'Oscar Boykin' via Scalding Development a écrit :
>
> Generally, I think this is a good idea also (separate modules for
> fabrics).
>
> I agree that Mode and Job are a bit hairy in spots. I think we can remove
> some deprecated code if it makes life significantly easier, but source and
> binary compatibility should be kept as much as we can reasonably manage.
>
> I would actually really rather `buildFlow` be private[scalding] but maybe
> that is too much. Making it return a subclass of Flow seems like a fine
> idea to me at the moment.
>
> Breaking hadoop out of scalding-core seems pretty hard since `Source` has
> it baked in at a few spots. That said, the Source abstractions in scalding
> are not very great. If we could improve that (without removing support for
> the old stuff) it might be worth it. Many have complained about Source's
> design over the years, but we have not really had a full proposal that
> seems to address all the concerns.
>
> The desire for jobs to all look the same across all fabrics make
> modularization a bit ugly.
>
> On Tue, Oct 11, 2016 at 2:23 PM 'Piyush Narang' via Scalding Development <
> scalding-dev@googlegroups.com> wrote:
>
> We ran into similar problems while trying to set the number of reducers
> while testing out Cascading3 on Tez. We hacked around it temporarily
> 
>  but
> haven't yet cleaned up that code and put it out for review (we'll need to
> fork MR / Tez there as nodeConfigDef works for Tez but not Hadoop). Based
> on my understanding, so far we've tried to delegate as much of this to
> Cascading as we can but there seem to be a few places where we're doing
> some platform specific stuff in Scalding. Breaking up to create
> fabric-specific sub-modules seems like a nice idea to me. We might need to
> think through the right way to do this to ensure we don't break stuff.
> Would it make sense to spin up an issue and we can discuss on it?
>
> On Tue, Oct 11, 2016 at 10:42 AM, Cyrille Chépélov <
> c...@transparencyrights.com> wrote:
>
> Hi,
>
> I'm trying to tie a few loose ends in the way step descriptions (text
> typically passed via *.withDescriptions(...)*) and the desired level of
> parallelism (typically passed via *.withReducers(N)*) is pushed on the
> various fabrics.
>
> Right now:
>
>- Most of the scalding code base either ignores the back-end (good) or
>assumes
>
> 
>the world is either Local or HadoopFlow (which covers Hadoop 1.x and MR1).
>As a consequence, a couple things don't yet work smoothly on Tez and I
>assume on Flink.
>- the descriptions are entirely dropped if not running on Hadoop1 or
>MR1
>- .withReducers sets a hadoop-specific property (*mapred*.*reduce*.
>*tasks*) at RichPipe#L41
>
> 
>- the Tez fabric ignores .withReducers; and there is no other conduit
>(for now) to set the number of desired parts on the sinks. As a
>consequence, you can't run a tez DAG with a large level of parallelism and
>a small (single) number of output files (e.g. stats leading to a result
>file of a couple dozen lines); you must pick one and set
>*cascading.flow.runtime.gather.partitions.num*. There are workarounds,
>but they're quite ugly.
>- there are a few instance of "flow match { case HadoopFlow =>
>doSomething ; case _ => () }" scattered around the code
>- there's some heavily reflection-based code in Mode.scala
>
> 
>which depends on jars not part of the scalding build process (and it's good
>that these jars stay out of the scalding-core build, e.g. Tez client
>libraries)
>- While it may be desirable to experiment with scalding-specific
>transform