Re: Reading null value from datasets

2015-10-23 Thread Maximilian Michels
Hi Guido,

This depends on your use case but you may read those values as type String
and treat them accordingly.

Cheers,
Max

On Fri, Oct 23, 2015 at 1:59 PM, Guido  wrote:

> Hello,
> I would like to ask if there were any particular ways to read or treat
> null (e.g. Name, Lastname,, Age..) value in a dataset using readCsvFile,
> without being forced to ignore them.
>
> Thanks for your time.
> Guido
>
>


Error running an hadoop job from web interface

2015-10-23 Thread Flavio Pompermaier
Hi to all,
I'm trying to run a job from the web interface but I get this error:

java.lang.RuntimeException: java.io.FileNotFoundException: JAR entry
core-site.xml not found in /tmp/webclient-jobs/EntitonsJsonizer.jar
at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2334)
at 
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2187)
at 
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2104)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:853)
at 
org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:2088)
at org.apache.hadoop.mapred.JobConf.(JobConf.java:446)
at org.apache.hadoop.mapreduce.Job.getInstance(Job.java:175)
at org.apache.hadoop.mapreduce.Job.getInstance(Job.java:156)
at 
it.okkam.flink.entitons.io.utils.ParquetThriftEntitons.readEntitons(ParquetThriftEntitons.java:42)
at 
it.okkam.flink.entitons.io.utils.ParquetThriftEntitons.readEntitonsWithId(ParquetThriftEntitons.java:73)
at 
org.okkam.entitons.EntitonsJsonizer.readAtomQuads(EntitonsJsonizer.java:235)
at org.okkam.entitons.EntitonsJsonizer.main(EntitonsJsonizer.java:119)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:497)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:395)
at 
org.apache.flink.client.program.OptimizerPlanEnvironment.getOptimizedPlan(OptimizerPlanEnvironment.java:80)
at 
org.apache.flink.client.program.Client.getOptimizedPlan(Client.java:220)
at org.apache.flink.client.CliFrontend.info(CliFrontend.java:412)
at 
org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:977)
at 
org.apache.flink.client.web.JobSubmissionServlet.doGet(JobSubmissionServlet.java:171)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:734)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:847)
at 
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:532)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:965)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:388)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:187)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:901)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at 
org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:47)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:113)
at org.eclipse.jetty.server.Server.handle(Server.java:352)
at 
org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:596)
at 
org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1048)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:549)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:211)
at 
org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:425)
at 
org.eclipse.jetty.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:489)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:436)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: JAR entry core-site.xml not
found in /tmp/webclient-jobs/EntitonsJsonizer.jar
at 
sun.net.www.protocol.jar.JarURLConnection.connect(JarURLConnection.java:140)
at 
sun.net.www.protocol.jar.JarURLConnection.getInputStream(JarURLConnection.java:150)
at java.net.URL.openStream(URL.java:1037)
at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2163)
at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2234)



I checked the jar and it contains the core-site.xml file..Am I forced to
configure the hadoop classpath in my Flink cluster config files?

Best,
Flavio


Re: reading csv file from null value

2015-10-23 Thread Maximilian Michels
Hi Philip,

How about making the empty field of type String? Then you can read the CSV
into a DataSet and treat the empty string as a null value. Not very nice
but a workaround. As of now, Flink deliberately doesn't support null values.

Regards,
Max

On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee  wrote:

> Hi,
>
> I am trying to load the dataset with the part of null value by using
> readCsvFile().
>
> // e.g  _date|_click|_sales|_item|_web_page|_user
>
> case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, _item: 
> Int,_page: Int, _user: Int)
>
> private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] 
> = {
>
>   env.readCsvFile[WebClick](
> webClickPath,
> fieldDelimiter = "|",
> includedFields = Array(0, 1, 2, 3, 4, 5),
> // lenient = true
>   )
> }
>
>
> Well, I know there is an option to ignore malformed value, but I have to
> read the dataset even though it has null value.
>
> as it follows, dataset (third column is null) looks like
> 37794|24669||16705|23|54810
> but I have to read null value as well because I have to use filter or
> where function ( _sales == null )
>
> Is there any detail suggestion to do it?
>
> Thanks,
> Philip
>
>
>
>
>
>
>
> --
>
> ==
>
> *Hae Joon Lee*
>
>
> Now, in Germany,
>
> M.S. Candidate, Interested in Distributed System, Iterative Processing
>
> Dept. of Computer Science, Informatik in German, TUB
>
> Technical University of Berlin
>
>
> In Korea,
>
> M.S. Candidate, Computer Architecture Laboratory
>
> Dept. of Computer Science, KAIST
>
>
> Rm# 4414 CS Dept. KAIST
>
> 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)
>
>
> Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
>
> ==
>


Re: Error running an hadoop job from web interface

2015-10-23 Thread Maximilian Michels
Hi Flavio,

Which version of Flink are you using?

Cheers,
Max

On Fri, Oct 23, 2015 at 2:45 PM, Flavio Pompermaier 
wrote:

> Hi to all,
> I'm trying to run a job from the web interface but I get this error:
>
> java.lang.RuntimeException: java.io.FileNotFoundException: JAR entry 
> core-site.xml not found in /tmp/webclient-jobs/EntitonsJsonizer.jar
>   at 
> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2334)
>   at 
> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2187)
>   at 
> org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2104)
>   at org.apache.hadoop.conf.Configuration.get(Configuration.java:853)
>   at 
> org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:2088)
>   at org.apache.hadoop.mapred.JobConf.(JobConf.java:446)
>   at org.apache.hadoop.mapreduce.Job.getInstance(Job.java:175)
>   at org.apache.hadoop.mapreduce.Job.getInstance(Job.java:156)
>   at 
> it.okkam.flink.entitons.io.utils.ParquetThriftEntitons.readEntitons(ParquetThriftEntitons.java:42)
>   at 
> it.okkam.flink.entitons.io.utils.ParquetThriftEntitons.readEntitonsWithId(ParquetThriftEntitons.java:73)
>   at 
> org.okkam.entitons.EntitonsJsonizer.readAtomQuads(EntitonsJsonizer.java:235)
>   at org.okkam.entitons.EntitonsJsonizer.main(EntitonsJsonizer.java:119)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:497)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:395)
>   at 
> org.apache.flink.client.program.OptimizerPlanEnvironment.getOptimizedPlan(OptimizerPlanEnvironment.java:80)
>   at 
> org.apache.flink.client.program.Client.getOptimizedPlan(Client.java:220)
>   at org.apache.flink.client.CliFrontend.info(CliFrontend.java:412)
>   at 
> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:977)
>   at 
> org.apache.flink.client.web.JobSubmissionServlet.doGet(JobSubmissionServlet.java:171)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:734)
>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:847)
>   at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:532)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
>   at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:965)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:388)
>   at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:187)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:901)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
>   at 
> org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:47)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:113)
>   at org.eclipse.jetty.server.Server.handle(Server.java:352)
>   at 
> org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:596)
>   at 
> org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1048)
>   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:549)
>   at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:211)
>   at 
> org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:425)
>   at 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:489)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:436)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: JAR entry core-site.xml not found 
> in /tmp/webclient-jobs/EntitonsJsonizer.jar
>   at 
> sun.net.www.protocol.jar.JarURLConnection.connect(JarURLConnection.java:140)
>   at 
> sun.net.www.protocol.jar.JarURLConnection.getInputStream(JarURLConnection.java:150)
>   at java.net.URL.openStream(URL.java:1037)
>   at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2163)
>   at 
> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2234)
>
>
>
> I checked the jar and it contains the core-site.xml file..Am I forced to
> configure the hadoop classpath in my Flink cluster config files?
>
> Best,
> Flavio
>


Re: Running continuously on yarn with kerberos

2015-10-23 Thread Maximilian Michels
Hi Niels,

Thank you for your question. Flink relies entirely on the Kerberos
support of Hadoop. So your question could also be rephrased to "Does
Hadoop support long-term authentication using Kerberos?". And the
answer is: Yes!

While Hadoop uses Kerberos tickets to authenticate users with services
initially, the authentication process continues differently
afterwards. Instead of saving the ticket to authenticate on a later
access, Hadoop creates its own security tockens (DelegationToken) that
it passes around. These are authenticated to Kerberos periodically. To
my knowledge, the tokens have a life span identical to the Kerberos
ticket maximum life span. So be sure to set the maximum life span very
high for long streaming jobs. The renewal time, on the other hand, is
not important because Hadoop abstracts this away using its own
security tockens.

I'm afraid there is not Kerberos how-to yet. If you are on Yarn, then
it is sufficient to authenticate the client with Kerberos. On a Flink
standalone cluster you need to ensure that, initially, all nodes are
authenticated with Kerberos using the kinit tool.

Feel free to ask if you have more questions and let us know about any
difficulties.

Best regards,
Max



On Thu, Oct 22, 2015 at 2:06 PM, Niels Basjes  wrote:
> Hi,
>
> I want to write a long running (i.e. never stop it) streaming flink
> application on a kerberos secured Hadoop/Yarn cluster. My application needs
> to do things with files on HDFS and HBase tables on that cluster so having
> the correct kerberos tickets is very important. The stream is to be ingested
> from Kafka.
>
> One of the things with Kerberos is that the tickets expire after a
> predetermined time. My knowledge about kerberos is very limited so I hope
> you guys can help me.
>
> My question is actually quite simple: Is there an howto somewhere on how to
> correctly run a long running flink application with kerberos that includes a
> solution for the kerberos ticket timeout  ?
>
> Thanks
>
> Niels Basjes


Re: Reading null value from datasets

2015-10-23 Thread Shiti Saxena
For a similar problem where we wanted to preserve and track null entries,
we load the CSV as a DataSet[Array[Object]] and then transform it into
DataSet[Row] using a custom RowSerializer(
https://gist.github.com/Shiti/d0572c089cc08654019c) which handles null.

The Table API(which supports null) can then be used on the resulting
DataSet[Row].

On Fri, Oct 23, 2015 at 7:40 PM, Maximilian Michels  wrote:

> Hi Guido,
>
> This depends on your use case but you may read those values as type String
> and treat them accordingly.
>
> Cheers,
> Max
>
> On Fri, Oct 23, 2015 at 1:59 PM, Guido  wrote:
>
>> Hello,
>> I would like to ask if there were any particular ways to read or treat
>> null (e.g. Name, Lastname,, Age..) value in a dataset using readCsvFile,
>> without being forced to ignore them.
>>
>> Thanks for your time.
>> Guido
>>
>>
>


Re: reading csv file from null value

2015-10-23 Thread Shiti Saxena
For a similar problem where we wanted to preserve and track null entries,
we load the CSV as a DataSet[Array[Object]] and then transform it into
DataSet[Row] using a custom RowSerializer(
https://gist.github.com/Shiti/d0572c089cc08654019c) which handles null.

The Table API(which supports null) can then be used on the resulting
DataSet[Row].


On Fri, Oct 23, 2015 at 7:38 PM, Maximilian Michels  wrote:

> Hi Philip,
>
> How about making the empty field of type String? Then you can read the CSV
> into a DataSet and treat the empty string as a null value. Not very nice
> but a workaround. As of now, Flink deliberately doesn't support null values.
>
> Regards,
> Max
>
>
> On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee  wrote:
>
>> Hi,
>>
>> I am trying to load the dataset with the part of null value by using
>> readCsvFile().
>>
>> // e.g  _date|_click|_sales|_item|_web_page|_user
>>
>> case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, 
>> _item: Int,_page: Int, _user: Int)
>>
>> private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] 
>> = {
>>
>>   env.readCsvFile[WebClick](
>> webClickPath,
>> fieldDelimiter = "|",
>> includedFields = Array(0, 1, 2, 3, 4, 5),
>> // lenient = true
>>   )
>> }
>>
>>
>> Well, I know there is an option to ignore malformed value, but I have to
>> read the dataset even though it has null value.
>>
>> as it follows, dataset (third column is null) looks like
>> 37794|24669||16705|23|54810
>> but I have to read null value as well because I have to use filter or
>> where function ( _sales == null )
>>
>> Is there any detail suggestion to do it?
>>
>> Thanks,
>> Philip
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> ==
>>
>> *Hae Joon Lee*
>>
>>
>> Now, in Germany,
>>
>> M.S. Candidate, Interested in Distributed System, Iterative Processing
>>
>> Dept. of Computer Science, Informatik in German, TUB
>>
>> Technical University of Berlin
>>
>>
>> In Korea,
>>
>> M.S. Candidate, Computer Architecture Laboratory
>>
>> Dept. of Computer Science, KAIST
>>
>>
>> Rm# 4414 CS Dept. KAIST
>>
>> 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)
>>
>>
>> Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
>>
>> ==
>>
>
>


Re: Session Based Windows

2015-10-23 Thread Aljoscha Krettek
Hi Paul,
the key based state should now be fixed in the current 0.10-SNAPSHOT builds if 
you want to continue playing around with it.

Cheers,
Aljoscha
> On 21 Oct 2015, at 19:40, Aljoscha Krettek  wrote:
> 
> Hi Paul,
> good to hear that the windowing works for you.
> 
> With the key based state I’m afraid you found a bug. The problem is that the 
> state backend is not properly set to the right key when the window is 
> evaluated. I will look into fixing this ASAP before the 0.10 release.
> 
> Cheers,
> Aljoscha
>> On 21 Oct 2015, at 19:32, Hamilton, Paul  wrote:
>> 
>> Hi Aljoscha,
>> 
>> Thanks a lot for your Trigger implementation, definitely helped provide
>> some direction.  It appears to be working well for our use case.  One
>> thing I have noticed now that I have pulled the state API changes in is
>> that key based state within a window function does not appear to be
>> working.  Perhaps I am not using it correctly now that the API has
>> changed.  Previously we had done something like this within the
>> RichWindowFunction:
>> 
>> @Override
>>  public void open(final Configuration parameters) throws Exception {
>>  state = getRuntimeContext().getOperatorState("state", new StatePojo(),
>> true);
>>  }
>> 
>> Based on the API changes I switched it to:
>> 
>> @Override
>> public void open(final Configuration parameters) throws Exception {
>>  state = getRuntimeContext().getKeyValueState("state", StatePojo.class,
>> new StatePojo());
>> }
>> 
>> 
>> But the state doesn’t seem to be partitioned based on the key.  I haven’t
>> had much time to play around with it, so its certainly possible that I
>> messed something up while refactoring to the API change.  I will look at
>> it further when I get a chance, but if you have any thoughts they are much
>> appreciated.
>> 
>> 
>> Thanks,
>> Paul Hamilton
>> 
>> 
>> On 10/17/15, 6:39 AM, "Aljoscha Krettek"  wrote:
>> 
>>> Hi Paul,
>>> it’s good to see people interested in this. I sketched a Trigger that
>>> should fit your requirements:
>>> https://gist.github.com/aljoscha/a7c6f22548e7d24bc4ac
>>> 
>>> You can use it like this:
>>> 
>>> DataStream<> input = …
>>> DataStream<> result = input
>>> .keyBy(“session-id”)
>>> .window(GlobalWindows.create())
>>> .trigger(new SessionTrigger(timeout, maxElements))
>>> .apply(new MyWindowFunction())
>>> 
>>> The Trigger uses the new state API that I’m currently introducing in a
>>> new Pull Request. It should be merged very soon, before the 0.10 release.
>>> 
>>> This implementation has one caveat, though. It cannot deal with elements
>>> that belong to different sessions that arrive intermingled with other
>>> sessions. The reason is that Flink does not yet support merging the
>>> windows that the WindowAssigner assigns as, for example, the Cloud
>>> Dataflow API supports. This means that elements cannot be assigned to
>>> session windows, instead the workaround with the GlobalWindow has to be
>>> used. I want to tackle this for the release after 0.10, however.
>>> 
>>> Please let us know if you need more information. I’m always happy to help
>>> in these interesting cases at the bleeding edge of what is possible. :-)
>>> 
>>> Cheers,
>>> Aljoscha
>>> 
 On 16 Oct 2015, at 19:36, Hamilton, Paul 
 wrote:
 
 Hi,
 
 I am attempting to make use of the new window APIs in streaming to
 implement a session based window and am not sure if the currently
 provided
 functionality handles my use case.  Specifically what I want to do is
 something conceptually similar to a ³Sessions.withGapDuration(Š)² window
 in Google DataFlow.
 
 Assuming the events are keyed by session id.  I would like to use the
 event time and the watermarking functionality to trigger a window after
 the ³end of a session² (no events for a given session received within x
 amount of time).  With watermarking this would mean trigger when a
 watermark is seen that is > (the time of the last event + session
 timeout). Also I want to perform an early triggering of the window
 after a
 given number of events have been received.
 
 Is it currently possible to do this with the current combination of
 window
 assigners and triggers?  I am happy to write custom triggers etc, but
 wanted to make sure it wasn¹t already available before going down that
 road.
 
 Thanks,
 
 Paul Hamilton
 Hybris Software
 
 
>>> 
>> 
>