Re: Debugging pipelines

Dan Halperin Sun, 20 Mar 2016 14:11:10 -0700

(Warning: I may just be rehashing Davor's email in different language, but
I think it might help.)


I think we could use a few definitions here:  (aside: do we have a
definitions page? I could not find one).

- a *primitive* is a fundamental component of the Beam programming model.
It should mean the same thing in every language and it should not be
expressible in terms of other primitives. If you look at the Beam
Compatibility Matrix <http://beam.incubator.apache.org/capability-matrix/> and
the Beam Technical Vision
<https://drive.google.com/a/google.com/folderview?id=0B-IhJZh9Ab52OFBVZHpsNjc4eXc&usp=sharing#>
you will see things like GroupByKey and ParDo listed among the primitives.

- an *SDK* is a language-specific implementation of the Beam model. An SDK
will usually have at least 2 runners: a "testing and experimenting" runner
that is likely to be single-process and in-memory; and a "real" runner that
scale to large datasets and may use multiple cores and be distributed. In
some languages, an SDK may have multiple runners (Java:
Direct,InProcess,Cloud Dataflow,Flink,Spark).

I think that Log does not fit well with the definition for a Primitive. It
is likely to have somewhat different definitions and semantics across
languages, and we can already implement it as a Composite PTransform
containing a ParDo+DoFn.

Next, I do think that each SDK needs to have a standard way to log.
   This needs to go above and beyond a Logging primitive -- log messages
inside of DoFns or libraries of composite transforms, for instance, should
all be loggable the same way and end up being processed in a coherent
manner.
   This standardized logging should not be runner-specific, because we do
not want library authors to have to specialize their logging code to the
intended runner, but it should be "interceptable" by runners to provide
specialized behavior. In Java, for instance, we commend slf4j loggers, but
each runner can register customized logging handlers.

Third, we get to the question of whether most SDKs should include a Log
PTransform in the "standard library" along with say, the ability to read
and write from text files.
    We could make a reasonable argument for yes, if there we common
patterns such as Ismael suggested, e.g., "log if some condition holds",
perhaps with some frequency.
    We should think somewhat carefully about what these patterns are: it's
easy to dramatically slow down a pipeline by logging too much, and it's
hard to implement certain primitives in a global way. E.g., "log every 1
second" would naturally mean "1 second ... per process" if we had many
processes, unless we have a distributed rate limiter.

Hope that makes sense / hope it helps,

Thanks!
Dan

On Sun, Mar 20, 2016 at 1:05 PM, Jean-Baptiste Onofré <[email protected]>
wrote:

> Welcome aboard and great idea ;)
>
> Probably DoFn is easier and straight forward in your case.
>
> Anyway, it would be a good addition to the SDK (as primitives), or at
> least in a "Data Integration DSL" (as I thought first).
>
> Regards
> JB
>
> On 03/20/2016 09:01 PM, Ismaël Mejía wrote:
>
>> Hello,
>>
>> I agree with you JB, Log is a more appropriate name for the case of
>> 'print',
>> we can definitely create a richer transform with your ideas, and we will
>> discuss the details later on when we start to work together.
>>
>> The more abstract case which I call Debug since I didn't find a better
>> name is a
>> general transform that can be the base of many others who produce side
>> effects
>> but don't change the data in the PTransform, that's why I consider it a
>> different (more abstract) Transform per se, and I implemented the general
>> predicate + function application just to prove my point, and the
>> Log/print case
>> was just a test of a specific case.
>>
>> Since I am new to the Dataflow model I don't know which unintended
>> consequences
>> this transform can have (or which good practices a transform that
>> side-effects must
>> take care of), aditionally I have not thought about how to support more
>> advanced features of the model (e.g. side inputs/outputs). Any ideas ?
>>
>> But well, this is my hello world in the Dataflow model, so we'll see
>> what's to
>> come :)
>>
>> -Ismaël
>>
>>
>>
>> On Sun, Mar 20, 2016 at 4:18 PM, Jean-Baptiste Onofré <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     Hi,
>>
>>     thanks for the update.
>>
>>     IMHO, I would name Debug transform as Log:
>>
>>     .apply(Log.withLevel("DEBUG"))
>>     .apply(Log.withLevel("INFO").withPattern("%d %m ..."))
>>
>> .apply(Log.withLevel("WARN").withMessage("Foo").withStream("System.out")
>>
>>     It would more flexible and related to the actual behavior.
>>
>>     I would mimic a bit the Camel log component for instance.
>>
>>     If you don't mind, I will do it with you.
>>
>>     Thanks
>>     Regards
>>     JB
>>
>>     On 03/20/2016 12:07 PM, Ismaël Mejía wrote:
>>
>>         Hi,
>>
>>         The code of the transform is here in a playground for Beam
>>         experiments I
>>         created (it is a bit alpha for the moment, and it does not have
>>         comments):
>>
>>
>> https://github.com/iemejia/beam-playground/blob/master/src/main/java/org/apache/beam/transforms/Debug.java
>>
>>         Since my initial goal was more of a test scenario in the
>>         DirectPipelineRunner I haven't considered yet more advanced
>> logging
>>         capabilities and the possible issues of distribution
>>         (serialization, in
>>         particular of dependencies, as well as exceptions, etc), but of
>>         course
>>         it is something I expect to improve if there is interest. Do you
>> see
>>         some immediate things to improve to try it with the distributed
>>         runners
>>         (I want to do this, as a excuse also to  try the FlinkRunner).
>>
>>         Best,
>>         -Ismael
>>
>>
>>         On Sun, Mar 20, 2016 at 11:13 AM, Jean-Baptiste Onofré
>>         <[email protected] <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>
>>              By the way, for the "Integration" DSL, in addition of
>>         explicit debug
>>              transform, it would make sense to have an implicit
>>         "Tracer". It's
>>              something that I planned: it would allow us to have sampling
>> on
>>              PCollection if the pipeline tracer is enabled (like we do
>>         in a Camel
>>              route with the tracer).
>>
>>              Regards
>>              JB
>>
>>              On 03/20/2016 10:14 AM, Ismaël Mejía wrote:
>>
>>                  Hello,
>>
>>                  I just started playing with Beam and I wanted to debug
>>         what happens
>>                  between transforms in pipelines. I wrote a simple 'Debug'
>>                  transform for
>>                  this.
>>                  The idea is to apply a function based on a predicate to
>> any
>>                  element in a
>>                  collection without changing the collection, or in other
>>         words, a
>>                  transform that
>>                  does not transform but produces side effects.
>>
>>                  The idea is better illustrated with this simple example:
>>
>>                        .apply(FlatMapElements.via((String text) ->
>>                  Arrays.asList(text.split(" ")))
>>                          .withOutputType(new TypeDescriptor<String>() {
>>                         }))
>>                        .apply(Debug
>>                          .when((String s) -> s.startsWith("A"))
>>                          .with((String s) -> {
>>                            System.out.println(s);
>>                            return null;
>>                          }));
>>                        .apply(Filter.byPredicate((String text) ->
>>         text.length() > 5))
>>                        .apply(Debug.print());  // sugared method, same
>>         as above
>>
>>                  I think this can be useful (at least for debugging
>>         purposes), is
>>                  there
>>                  something
>>                  like this already in the SDK ? If this is not the case,
>>         can you
>>                  please
>>                  give me some
>>                  feedback/ideas to improve my transform.
>>
>>                  Thanks,
>>                  -Ismael
>>
>>                  ps. You can find the code of the first version of the
>>         transform
>>                  here:
>>
>> https://github.com/iemejia/beam-playground/blob/master/src/main/java/org/apache/beam/transforms/Debug.java
>>
>>
>>
>>              --
>>              Jean-Baptiste Onofré
>>         [email protected] <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>
>>         http://blog.nanthrax.net
>>              Talend - http://www.talend.com
>>
>>
>>
>>     --
>>     Jean-Baptiste Onofré
>>     [email protected] <mailto:[email protected]>
>>     http://blog.nanthrax.net
>>     Talend - http://www.talend.com
>>
>>
>>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Debugging pipelines

Reply via email to