Re: execute() and collect()/print()/count()

2015-06-23 Thread Stephan Ewen
That would help to get around many cases, it still leaves some open, like
forgetting to create a sink after transformations after a collect() call.

Would probably be a good improvement over the status quo, though...

On Mon, Jun 22, 2015 at 9:37 PM, Alexander Alexandrov 
alexander.s.alexand...@gmail.com wrote:

 What about adding some state state to the DataBag internals that tracks the
 following conditions

 1. whether the last job execution was triggered by an enforcer API method
 like print() / collect();
 2. whether a DataSource / lazy operator was created after that;

 If 1 is true and 2 is false, a WARN can be displayed. Otherwise, we can
 still throw an error.


 2015-06-22 18:17 GMT+02:00 Stephan Ewen se...@apache.org:

  We have two situations to trade off here, and fixing one will make the
  other worse:
 
  1) env.execute() after collect() - see Max's mail
 
  2) env.execute() on empty sinks program. Not throwing an exception makes
  people wonder why nothing happens (if they write the program to just test
  whether it runs or if they want to measure time).
 
  Both choices make one behave nice and the other not. So far, the idea was
  that throwing an exception on empty sinks is that the error message will
  help people figure out what is wrong fast. Debugging why nothing happens
  can be slow.
 
 
  It is hard to say if we would not introduce another source of confusion
 by
  fixing one...
 
 
 
 
 
  On Mon, Jun 22, 2015 at 10:26 AM, Maximilian Michels m...@apache.org
  wrote:
 
   +1 for cleaning up the documentation
   +1 for adding a link to the documentation (should be a permalink)
   +1 for printing a warning instead of an exception
  
   On Sun, Jun 21, 2015 at 12:25 AM, Robert Metzger rmetz...@apache.org
   wrote:
  
We could also add a link to the documentation into the exception that
explains the behavior.
   
On Fri, Jun 19, 2015 at 5:52 AM, Chiwan Park chiwanp...@icloud.com
wrote:
   
 +1 for ignoring execute() call with warning.

 But I'm concerned for how the user catches the error in program
  without
 any data sinks.

 By the way, eager execution is not well documented in data sinks
   section
 but is in program
 skeleton section. [1] This makes the user’s confusion. We should
  clean
   up
 documents.
 There are many codes calling execute() method after print() method.
[2][3]

 We should add a description for count() method to documents too.

 [1]

   
  
 
 http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#data-sinks
 [2]

   
  
 
 http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#parallel-execution
 [3]

   
  
 
 http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#iteration-operators

 Regards,
 Chiwan Park

  On Jun 19, 2015, at 9:15 PM, Maximilian Michels m...@apache.org
wrote:
 
  Dear Flink community,
 
  I have stopped to count how many people on the user list and
 during
Flink
  trainings have asked why their Flink program throws an Exception
  when
 they
  just one to print a DataSet. The reason for this is that print()
  now
  executes eagerly, thus, executes the Flink program. Subsequent
  calls
   to
  execute() need to define new DataSinks and throw an exception
otherwise.
 
  We have recently introduced a flag in the ExecutionEnvironment
 that
 checks
  whether the user executed before (explicitly via execute() or
implicitly
  through collect()/print()/count()). That enabled us to print a
  nicer
  exception message. However, users either do not read the
 exception
 message
  or do not understand it. They do ask this question a lot.
 
  That's why I propose to ignore calls to execute() entirely if no
   sinks
 are
  defined. That will get rid of one of the core annoyances for
 Flink
 users. I
  know, that this is painfully for us programmers because we
  understand
how
  Flink works internally but let's step back once and see that it
wouldn't
 be
  so bad if execute didn't do anything in case of no new sinks.
 
  What would be the downside of this change? Users might call
  execute()
and
  wonder that nothing happens. We would then simply print a warning
   that
  their program didn't define any sinks. That is a big difference
 to
   the
  behavior before because users are scared of exceptions. If they
  just
get
 a
  warning they will double-check their program and investigate why
nothing
  happens. Most of the cases they do actually have defined sinks
 but
simply
  left a call to execute() when they were printing a DataSet.
 
  What are you opinions on this issue? I have opened a JIRA for
 this
  as
 well:
  https://issues.apache.org/jira/browse/FLINK-2249
  

Re: execute() and collect()/print()/count()

2015-06-23 Thread Maximilian Michels
@Stephan I understand your concerns that the user might wonder that nothing
happens when executing. However, in this case a warning will provide a hint
to the user that he didn't define any sinks. In the case where he
immediately calls execute() after an eager execution, the program is
actually executed and he still gets an Exception although everything
already ran. I think that's much worse because the user sees that something
executed and thinks it failed.

@Alexander I like your proposal. It adds a bit more logic to the
ExecutionEnvironment but I think that's acceptable for the sake of the user
experience.





On Mon, Jun 22, 2015 at 9:37 PM, Alexander Alexandrov 
alexander.s.alexand...@gmail.com wrote:

 What about adding some state state to the DataBag internals that tracks the
 following conditions

 1. whether the last job execution was triggered by an enforcer API method
 like print() / collect();
 2. whether a DataSource / lazy operator was created after that;

 If 1 is true and 2 is false, a WARN can be displayed. Otherwise, we can
 still throw an error.


 2015-06-22 18:17 GMT+02:00 Stephan Ewen se...@apache.org:

  We have two situations to trade off here, and fixing one will make the
  other worse:
 
  1) env.execute() after collect() - see Max's mail
 
  2) env.execute() on empty sinks program. Not throwing an exception makes
  people wonder why nothing happens (if they write the program to just test
  whether it runs or if they want to measure time).
 
  Both choices make one behave nice and the other not. So far, the idea was
  that throwing an exception on empty sinks is that the error message will
  help people figure out what is wrong fast. Debugging why nothing happens
  can be slow.
 
 
  It is hard to say if we would not introduce another source of confusion
 by
  fixing one...
 
 
 
 
 
  On Mon, Jun 22, 2015 at 10:26 AM, Maximilian Michels m...@apache.org
  wrote:
 
   +1 for cleaning up the documentation
   +1 for adding a link to the documentation (should be a permalink)
   +1 for printing a warning instead of an exception
  
   On Sun, Jun 21, 2015 at 12:25 AM, Robert Metzger rmetz...@apache.org
   wrote:
  
We could also add a link to the documentation into the exception that
explains the behavior.
   
On Fri, Jun 19, 2015 at 5:52 AM, Chiwan Park chiwanp...@icloud.com
wrote:
   
 +1 for ignoring execute() call with warning.

 But I'm concerned for how the user catches the error in program
  without
 any data sinks.

 By the way, eager execution is not well documented in data sinks
   section
 but is in program
 skeleton section. [1] This makes the user’s confusion. We should
  clean
   up
 documents.
 There are many codes calling execute() method after print() method.
[2][3]

 We should add a description for count() method to documents too.

 [1]

   
  
 
 http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#data-sinks
 [2]

   
  
 
 http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#parallel-execution
 [3]

   
  
 
 http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#iteration-operators

 Regards,
 Chiwan Park

  On Jun 19, 2015, at 9:15 PM, Maximilian Michels m...@apache.org
wrote:
 
  Dear Flink community,
 
  I have stopped to count how many people on the user list and
 during
Flink
  trainings have asked why their Flink program throws an Exception
  when
 they
  just one to print a DataSet. The reason for this is that print()
  now
  executes eagerly, thus, executes the Flink program. Subsequent
  calls
   to
  execute() need to define new DataSinks and throw an exception
otherwise.
 
  We have recently introduced a flag in the ExecutionEnvironment
 that
 checks
  whether the user executed before (explicitly via execute() or
implicitly
  through collect()/print()/count()). That enabled us to print a
  nicer
  exception message. However, users either do not read the
 exception
 message
  or do not understand it. They do ask this question a lot.
 
  That's why I propose to ignore calls to execute() entirely if no
   sinks
 are
  defined. That will get rid of one of the core annoyances for
 Flink
 users. I
  know, that this is painfully for us programmers because we
  understand
how
  Flink works internally but let's step back once and see that it
wouldn't
 be
  so bad if execute didn't do anything in case of no new sinks.
 
  What would be the downside of this change? Users might call
  execute()
and
  wonder that nothing happens. We would then simply print a warning
   that
  their program didn't define any sinks. That is a big difference
 to
   the
  behavior before because users are scared of exceptions. If they
  

Re: execute() and collect()/print()/count()

2015-06-22 Thread Stephan Ewen
We have two situations to trade off here, and fixing one will make the
other worse:

1) env.execute() after collect() - see Max's mail

2) env.execute() on empty sinks program. Not throwing an exception makes
people wonder why nothing happens (if they write the program to just test
whether it runs or if they want to measure time).

Both choices make one behave nice and the other not. So far, the idea was
that throwing an exception on empty sinks is that the error message will
help people figure out what is wrong fast. Debugging why nothing happens
can be slow.


It is hard to say if we would not introduce another source of confusion by
fixing one...





On Mon, Jun 22, 2015 at 10:26 AM, Maximilian Michels m...@apache.org wrote:

 +1 for cleaning up the documentation
 +1 for adding a link to the documentation (should be a permalink)
 +1 for printing a warning instead of an exception

 On Sun, Jun 21, 2015 at 12:25 AM, Robert Metzger rmetz...@apache.org
 wrote:

  We could also add a link to the documentation into the exception that
  explains the behavior.
 
  On Fri, Jun 19, 2015 at 5:52 AM, Chiwan Park chiwanp...@icloud.com
  wrote:
 
   +1 for ignoring execute() call with warning.
  
   But I'm concerned for how the user catches the error in program without
   any data sinks.
  
   By the way, eager execution is not well documented in data sinks
 section
   but is in program
   skeleton section. [1] This makes the user’s confusion. We should clean
 up
   documents.
   There are many codes calling execute() method after print() method.
  [2][3]
  
   We should add a description for count() method to documents too.
  
   [1]
  
 
 http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#data-sinks
   [2]
  
 
 http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#parallel-execution
   [3]
  
 
 http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#iteration-operators
  
   Regards,
   Chiwan Park
  
On Jun 19, 2015, at 9:15 PM, Maximilian Michels m...@apache.org
  wrote:
   
Dear Flink community,
   
I have stopped to count how many people on the user list and during
  Flink
trainings have asked why their Flink program throws an Exception when
   they
just one to print a DataSet. The reason for this is that print() now
executes eagerly, thus, executes the Flink program. Subsequent calls
 to
execute() need to define new DataSinks and throw an exception
  otherwise.
   
We have recently introduced a flag in the ExecutionEnvironment that
   checks
whether the user executed before (explicitly via execute() or
  implicitly
through collect()/print()/count()). That enabled us to print a nicer
exception message. However, users either do not read the exception
   message
or do not understand it. They do ask this question a lot.
   
That's why I propose to ignore calls to execute() entirely if no
 sinks
   are
defined. That will get rid of one of the core annoyances for Flink
   users. I
know, that this is painfully for us programmers because we understand
  how
Flink works internally but let's step back once and see that it
  wouldn't
   be
so bad if execute didn't do anything in case of no new sinks.
   
What would be the downside of this change? Users might call execute()
  and
wonder that nothing happens. We would then simply print a warning
 that
their program didn't define any sinks. That is a big difference to
 the
behavior before because users are scared of exceptions. If they just
  get
   a
warning they will double-check their program and investigate why
  nothing
happens. Most of the cases they do actually have defined sinks but
  simply
left a call to execute() when they were printing a DataSet.
   
What are you opinions on this issue? I have opened a JIRA for this as
   well:
https://issues.apache.org/jira/browse/FLINK-2249
   
Best,
Max
  
  
  
  
  
 



Re: execute() and collect()/print()/count()

2015-06-22 Thread Maximilian Michels
+1 for cleaning up the documentation
+1 for adding a link to the documentation (should be a permalink)
+1 for printing a warning instead of an exception

On Sun, Jun 21, 2015 at 12:25 AM, Robert Metzger rmetz...@apache.org
wrote:

 We could also add a link to the documentation into the exception that
 explains the behavior.

 On Fri, Jun 19, 2015 at 5:52 AM, Chiwan Park chiwanp...@icloud.com
 wrote:

  +1 for ignoring execute() call with warning.
 
  But I'm concerned for how the user catches the error in program without
  any data sinks.
 
  By the way, eager execution is not well documented in data sinks section
  but is in program
  skeleton section. [1] This makes the user’s confusion. We should clean up
  documents.
  There are many codes calling execute() method after print() method.
 [2][3]
 
  We should add a description for count() method to documents too.
 
  [1]
 
 http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#data-sinks
  [2]
 
 http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#parallel-execution
  [3]
 
 http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#iteration-operators
 
  Regards,
  Chiwan Park
 
   On Jun 19, 2015, at 9:15 PM, Maximilian Michels m...@apache.org
 wrote:
  
   Dear Flink community,
  
   I have stopped to count how many people on the user list and during
 Flink
   trainings have asked why their Flink program throws an Exception when
  they
   just one to print a DataSet. The reason for this is that print() now
   executes eagerly, thus, executes the Flink program. Subsequent calls to
   execute() need to define new DataSinks and throw an exception
 otherwise.
  
   We have recently introduced a flag in the ExecutionEnvironment that
  checks
   whether the user executed before (explicitly via execute() or
 implicitly
   through collect()/print()/count()). That enabled us to print a nicer
   exception message. However, users either do not read the exception
  message
   or do not understand it. They do ask this question a lot.
  
   That's why I propose to ignore calls to execute() entirely if no sinks
  are
   defined. That will get rid of one of the core annoyances for Flink
  users. I
   know, that this is painfully for us programmers because we understand
 how
   Flink works internally but let's step back once and see that it
 wouldn't
  be
   so bad if execute didn't do anything in case of no new sinks.
  
   What would be the downside of this change? Users might call execute()
 and
   wonder that nothing happens. We would then simply print a warning that
   their program didn't define any sinks. That is a big difference to the
   behavior before because users are scared of exceptions. If they just
 get
  a
   warning they will double-check their program and investigate why
 nothing
   happens. Most of the cases they do actually have defined sinks but
 simply
   left a call to execute() when they were printing a DataSet.
  
   What are you opinions on this issue? I have opened a JIRA for this as
  well:
   https://issues.apache.org/jira/browse/FLINK-2249
  
   Best,
   Max
 
 
 
 
 



Re: execute() and collect()/print()/count()

2015-06-20 Thread Robert Metzger
We could also add a link to the documentation into the exception that
explains the behavior.

On Fri, Jun 19, 2015 at 5:52 AM, Chiwan Park chiwanp...@icloud.com wrote:

 +1 for ignoring execute() call with warning.

 But I'm concerned for how the user catches the error in program without
 any data sinks.

 By the way, eager execution is not well documented in data sinks section
 but is in program
 skeleton section. [1] This makes the user’s confusion. We should clean up
 documents.
 There are many codes calling execute() method after print() method. [2][3]

 We should add a description for count() method to documents too.

 [1]
 http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#data-sinks
 [2]
 http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#parallel-execution
 [3]
 http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#iteration-operators

 Regards,
 Chiwan Park

  On Jun 19, 2015, at 9:15 PM, Maximilian Michels m...@apache.org wrote:
 
  Dear Flink community,
 
  I have stopped to count how many people on the user list and during Flink
  trainings have asked why their Flink program throws an Exception when
 they
  just one to print a DataSet. The reason for this is that print() now
  executes eagerly, thus, executes the Flink program. Subsequent calls to
  execute() need to define new DataSinks and throw an exception otherwise.
 
  We have recently introduced a flag in the ExecutionEnvironment that
 checks
  whether the user executed before (explicitly via execute() or implicitly
  through collect()/print()/count()). That enabled us to print a nicer
  exception message. However, users either do not read the exception
 message
  or do not understand it. They do ask this question a lot.
 
  That's why I propose to ignore calls to execute() entirely if no sinks
 are
  defined. That will get rid of one of the core annoyances for Flink
 users. I
  know, that this is painfully for us programmers because we understand how
  Flink works internally but let's step back once and see that it wouldn't
 be
  so bad if execute didn't do anything in case of no new sinks.
 
  What would be the downside of this change? Users might call execute() and
  wonder that nothing happens. We would then simply print a warning that
  their program didn't define any sinks. That is a big difference to the
  behavior before because users are scared of exceptions. If they just get
 a
  warning they will double-check their program and investigate why nothing
  happens. Most of the cases they do actually have defined sinks but simply
  left a call to execute() when they were printing a DataSet.
 
  What are you opinions on this issue? I have opened a JIRA for this as
 well:
  https://issues.apache.org/jira/browse/FLINK-2249
 
  Best,
  Max







execute() and collect()/print()/count()

2015-06-19 Thread Maximilian Michels
Dear Flink community,

I have stopped to count how many people on the user list and during Flink
trainings have asked why their Flink program throws an Exception when they
just one to print a DataSet. The reason for this is that print() now
executes eagerly, thus, executes the Flink program. Subsequent calls to
execute() need to define new DataSinks and throw an exception otherwise.

We have recently introduced a flag in the ExecutionEnvironment that checks
whether the user executed before (explicitly via execute() or implicitly
through collect()/print()/count()). That enabled us to print a nicer
exception message. However, users either do not read the exception message
or do not understand it. They do ask this question a lot.

That's why I propose to ignore calls to execute() entirely if no sinks are
defined. That will get rid of one of the core annoyances for Flink users. I
know, that this is painfully for us programmers because we understand how
Flink works internally but let's step back once and see that it wouldn't be
so bad if execute didn't do anything in case of no new sinks.

What would be the downside of this change? Users might call execute() and
wonder that nothing happens. We would then simply print a warning that
their program didn't define any sinks. That is a big difference to the
behavior before because users are scared of exceptions. If they just get a
warning they will double-check their program and investigate why nothing
happens. Most of the cases they do actually have defined sinks but simply
left a call to execute() when they were printing a DataSet.

What are you opinions on this issue? I have opened a JIRA for this as well:
https://issues.apache.org/jira/browse/FLINK-2249

Best,
Max


Re: execute() and collect()/print()/count()

2015-06-19 Thread Chiwan Park
+1 for ignoring execute() call with warning.

But I'm concerned for how the user catches the error in program without any 
data sinks.

By the way, eager execution is not well documented in data sinks section but is 
in program
skeleton section. [1] This makes the user’s confusion. We should clean up 
documents.
There are many codes calling execute() method after print() method. [2][3]

We should add a description for count() method to documents too.

[1] 
http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#data-sinks
[2] 
http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#parallel-execution
[3] 
http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#iteration-operators

Regards,
Chiwan Park

 On Jun 19, 2015, at 9:15 PM, Maximilian Michels m...@apache.org wrote:
 
 Dear Flink community,
 
 I have stopped to count how many people on the user list and during Flink
 trainings have asked why their Flink program throws an Exception when they
 just one to print a DataSet. The reason for this is that print() now
 executes eagerly, thus, executes the Flink program. Subsequent calls to
 execute() need to define new DataSinks and throw an exception otherwise.
 
 We have recently introduced a flag in the ExecutionEnvironment that checks
 whether the user executed before (explicitly via execute() or implicitly
 through collect()/print()/count()). That enabled us to print a nicer
 exception message. However, users either do not read the exception message
 or do not understand it. They do ask this question a lot.
 
 That's why I propose to ignore calls to execute() entirely if no sinks are
 defined. That will get rid of one of the core annoyances for Flink users. I
 know, that this is painfully for us programmers because we understand how
 Flink works internally but let's step back once and see that it wouldn't be
 so bad if execute didn't do anything in case of no new sinks.
 
 What would be the downside of this change? Users might call execute() and
 wonder that nothing happens. We would then simply print a warning that
 their program didn't define any sinks. That is a big difference to the
 behavior before because users are scared of exceptions. If they just get a
 warning they will double-check their program and investigate why nothing
 happens. Most of the cases they do actually have defined sinks but simply
 left a call to execute() when they were printing a DataSet.
 
 What are you opinions on this issue? I have opened a JIRA for this as well:
 https://issues.apache.org/jira/browse/FLINK-2249
 
 Best,
 Max