Re: Reading null value from datasets
Hi Guido, This depends on your use case but you may read those values as type String and treat them accordingly. Cheers, Max On Fri, Oct 23, 2015 at 1:59 PM, Guidowrote: > Hello, > I would like to ask if there were any particular ways to read or treat > null (e.g. Name, Lastname,, Age..) value in a dataset using readCsvFile, > without being forced to ignore them. > > Thanks for your time. > Guido > >
Error running an hadoop job from web interface
Hi to all, I'm trying to run a job from the web interface but I get this error: java.lang.RuntimeException: java.io.FileNotFoundException: JAR entry core-site.xml not found in /tmp/webclient-jobs/EntitonsJsonizer.jar at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2334) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2187) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2104) at org.apache.hadoop.conf.Configuration.get(Configuration.java:853) at org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:2088) at org.apache.hadoop.mapred.JobConf.(JobConf.java:446) at org.apache.hadoop.mapreduce.Job.getInstance(Job.java:175) at org.apache.hadoop.mapreduce.Job.getInstance(Job.java:156) at it.okkam.flink.entitons.io.utils.ParquetThriftEntitons.readEntitons(ParquetThriftEntitons.java:42) at it.okkam.flink.entitons.io.utils.ParquetThriftEntitons.readEntitonsWithId(ParquetThriftEntitons.java:73) at org.okkam.entitons.EntitonsJsonizer.readAtomQuads(EntitonsJsonizer.java:235) at org.okkam.entitons.EntitonsJsonizer.main(EntitonsJsonizer.java:119) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:497) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:395) at org.apache.flink.client.program.OptimizerPlanEnvironment.getOptimizedPlan(OptimizerPlanEnvironment.java:80) at org.apache.flink.client.program.Client.getOptimizedPlan(Client.java:220) at org.apache.flink.client.CliFrontend.info(CliFrontend.java:412) at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:977) at org.apache.flink.client.web.JobSubmissionServlet.doGet(JobSubmissionServlet.java:171) at javax.servlet.http.HttpServlet.service(HttpServlet.java:734) at javax.servlet.http.HttpServlet.service(HttpServlet.java:847) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:532) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:965) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:388) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:187) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:901) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:47) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:113) at org.eclipse.jetty.server.Server.handle(Server.java:352) at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:596) at org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1048) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:549) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:211) at org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:425) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:489) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:436) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: JAR entry core-site.xml not found in /tmp/webclient-jobs/EntitonsJsonizer.jar at sun.net.www.protocol.jar.JarURLConnection.connect(JarURLConnection.java:140) at sun.net.www.protocol.jar.JarURLConnection.getInputStream(JarURLConnection.java:150) at java.net.URL.openStream(URL.java:1037) at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2163) at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2234) I checked the jar and it contains the core-site.xml file..Am I forced to configure the hadoop classpath in my Flink cluster config files? Best, Flavio
Re: reading csv file from null value
Hi Philip, How about making the empty field of type String? Then you can read the CSV into a DataSet and treat the empty string as a null value. Not very nice but a workaround. As of now, Flink deliberately doesn't support null values. Regards, Max On Thu, Oct 22, 2015 at 4:30 PM, Philip Leewrote: > Hi, > > I am trying to load the dataset with the part of null value by using > readCsvFile(). > > // e.g _date|_click|_sales|_item|_web_page|_user > > case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, _item: > Int,_page: Int, _user: Int) > > private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] > = { > > env.readCsvFile[WebClick]( > webClickPath, > fieldDelimiter = "|", > includedFields = Array(0, 1, 2, 3, 4, 5), > // lenient = true > ) > } > > > Well, I know there is an option to ignore malformed value, but I have to > read the dataset even though it has null value. > > as it follows, dataset (third column is null) looks like > 37794|24669||16705|23|54810 > but I have to read null value as well because I have to use filter or > where function ( _sales == null ) > > Is there any detail suggestion to do it? > > Thanks, > Philip > > > > > > > > -- > > == > > *Hae Joon Lee* > > > Now, in Germany, > > M.S. Candidate, Interested in Distributed System, Iterative Processing > > Dept. of Computer Science, Informatik in German, TUB > > Technical University of Berlin > > > In Korea, > > M.S. Candidate, Computer Architecture Laboratory > > Dept. of Computer Science, KAIST > > > Rm# 4414 CS Dept. KAIST > > 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701) > > > Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea > > == >
Re: Error running an hadoop job from web interface
Hi Flavio, Which version of Flink are you using? Cheers, Max On Fri, Oct 23, 2015 at 2:45 PM, Flavio Pompermaierwrote: > Hi to all, > I'm trying to run a job from the web interface but I get this error: > > java.lang.RuntimeException: java.io.FileNotFoundException: JAR entry > core-site.xml not found in /tmp/webclient-jobs/EntitonsJsonizer.jar > at > org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2334) > at > org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2187) > at > org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2104) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:853) > at > org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:2088) > at org.apache.hadoop.mapred.JobConf.(JobConf.java:446) > at org.apache.hadoop.mapreduce.Job.getInstance(Job.java:175) > at org.apache.hadoop.mapreduce.Job.getInstance(Job.java:156) > at > it.okkam.flink.entitons.io.utils.ParquetThriftEntitons.readEntitons(ParquetThriftEntitons.java:42) > at > it.okkam.flink.entitons.io.utils.ParquetThriftEntitons.readEntitonsWithId(ParquetThriftEntitons.java:73) > at > org.okkam.entitons.EntitonsJsonizer.readAtomQuads(EntitonsJsonizer.java:235) > at org.okkam.entitons.EntitonsJsonizer.main(EntitonsJsonizer.java:119) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:497) > at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:395) > at > org.apache.flink.client.program.OptimizerPlanEnvironment.getOptimizedPlan(OptimizerPlanEnvironment.java:80) > at > org.apache.flink.client.program.Client.getOptimizedPlan(Client.java:220) > at org.apache.flink.client.CliFrontend.info(CliFrontend.java:412) > at > org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:977) > at > org.apache.flink.client.web.JobSubmissionServlet.doGet(JobSubmissionServlet.java:171) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:734) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:847) > at > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:532) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:965) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:388) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:187) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:901) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) > at > org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:47) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:113) > at org.eclipse.jetty.server.Server.handle(Server.java:352) > at > org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:596) > at > org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1048) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:549) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:211) > at > org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:425) > at > org.eclipse.jetty.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:489) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:436) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.FileNotFoundException: JAR entry core-site.xml not found > in /tmp/webclient-jobs/EntitonsJsonizer.jar > at > sun.net.www.protocol.jar.JarURLConnection.connect(JarURLConnection.java:140) > at > sun.net.www.protocol.jar.JarURLConnection.getInputStream(JarURLConnection.java:150) > at java.net.URL.openStream(URL.java:1037) > at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2163) > at > org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2234) > > > > I checked the jar and it contains the core-site.xml file..Am I forced to > configure the hadoop classpath in my Flink cluster config files? > > Best, > Flavio >
Re: Running continuously on yarn with kerberos
Hi Niels, Thank you for your question. Flink relies entirely on the Kerberos support of Hadoop. So your question could also be rephrased to "Does Hadoop support long-term authentication using Kerberos?". And the answer is: Yes! While Hadoop uses Kerberos tickets to authenticate users with services initially, the authentication process continues differently afterwards. Instead of saving the ticket to authenticate on a later access, Hadoop creates its own security tockens (DelegationToken) that it passes around. These are authenticated to Kerberos periodically. To my knowledge, the tokens have a life span identical to the Kerberos ticket maximum life span. So be sure to set the maximum life span very high for long streaming jobs. The renewal time, on the other hand, is not important because Hadoop abstracts this away using its own security tockens. I'm afraid there is not Kerberos how-to yet. If you are on Yarn, then it is sufficient to authenticate the client with Kerberos. On a Flink standalone cluster you need to ensure that, initially, all nodes are authenticated with Kerberos using the kinit tool. Feel free to ask if you have more questions and let us know about any difficulties. Best regards, Max On Thu, Oct 22, 2015 at 2:06 PM, Niels Basjeswrote: > Hi, > > I want to write a long running (i.e. never stop it) streaming flink > application on a kerberos secured Hadoop/Yarn cluster. My application needs > to do things with files on HDFS and HBase tables on that cluster so having > the correct kerberos tickets is very important. The stream is to be ingested > from Kafka. > > One of the things with Kerberos is that the tickets expire after a > predetermined time. My knowledge about kerberos is very limited so I hope > you guys can help me. > > My question is actually quite simple: Is there an howto somewhere on how to > correctly run a long running flink application with kerberos that includes a > solution for the kerberos ticket timeout ? > > Thanks > > Niels Basjes
Re: Reading null value from datasets
For a similar problem where we wanted to preserve and track null entries, we load the CSV as a DataSet[Array[Object]] and then transform it into DataSet[Row] using a custom RowSerializer( https://gist.github.com/Shiti/d0572c089cc08654019c) which handles null. The Table API(which supports null) can then be used on the resulting DataSet[Row]. On Fri, Oct 23, 2015 at 7:40 PM, Maximilian Michelswrote: > Hi Guido, > > This depends on your use case but you may read those values as type String > and treat them accordingly. > > Cheers, > Max > > On Fri, Oct 23, 2015 at 1:59 PM, Guido wrote: > >> Hello, >> I would like to ask if there were any particular ways to read or treat >> null (e.g. Name, Lastname,, Age..) value in a dataset using readCsvFile, >> without being forced to ignore them. >> >> Thanks for your time. >> Guido >> >> >
Re: reading csv file from null value
For a similar problem where we wanted to preserve and track null entries, we load the CSV as a DataSet[Array[Object]] and then transform it into DataSet[Row] using a custom RowSerializer( https://gist.github.com/Shiti/d0572c089cc08654019c) which handles null. The Table API(which supports null) can then be used on the resulting DataSet[Row]. On Fri, Oct 23, 2015 at 7:38 PM, Maximilian Michelswrote: > Hi Philip, > > How about making the empty field of type String? Then you can read the CSV > into a DataSet and treat the empty string as a null value. Not very nice > but a workaround. As of now, Flink deliberately doesn't support null values. > > Regards, > Max > > > On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee wrote: > >> Hi, >> >> I am trying to load the dataset with the part of null value by using >> readCsvFile(). >> >> // e.g _date|_click|_sales|_item|_web_page|_user >> >> case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, >> _item: Int,_page: Int, _user: Int) >> >> private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] >> = { >> >> env.readCsvFile[WebClick]( >> webClickPath, >> fieldDelimiter = "|", >> includedFields = Array(0, 1, 2, 3, 4, 5), >> // lenient = true >> ) >> } >> >> >> Well, I know there is an option to ignore malformed value, but I have to >> read the dataset even though it has null value. >> >> as it follows, dataset (third column is null) looks like >> 37794|24669||16705|23|54810 >> but I have to read null value as well because I have to use filter or >> where function ( _sales == null ) >> >> Is there any detail suggestion to do it? >> >> Thanks, >> Philip >> >> >> >> >> >> >> >> -- >> >> == >> >> *Hae Joon Lee* >> >> >> Now, in Germany, >> >> M.S. Candidate, Interested in Distributed System, Iterative Processing >> >> Dept. of Computer Science, Informatik in German, TUB >> >> Technical University of Berlin >> >> >> In Korea, >> >> M.S. Candidate, Computer Architecture Laboratory >> >> Dept. of Computer Science, KAIST >> >> >> Rm# 4414 CS Dept. KAIST >> >> 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701) >> >> >> Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea >> >> == >> > >
Re: Session Based Windows
Hi Paul, the key based state should now be fixed in the current 0.10-SNAPSHOT builds if you want to continue playing around with it. Cheers, Aljoscha > On 21 Oct 2015, at 19:40, Aljoscha Krettekwrote: > > Hi Paul, > good to hear that the windowing works for you. > > With the key based state I’m afraid you found a bug. The problem is that the > state backend is not properly set to the right key when the window is > evaluated. I will look into fixing this ASAP before the 0.10 release. > > Cheers, > Aljoscha >> On 21 Oct 2015, at 19:32, Hamilton, Paul wrote: >> >> Hi Aljoscha, >> >> Thanks a lot for your Trigger implementation, definitely helped provide >> some direction. It appears to be working well for our use case. One >> thing I have noticed now that I have pulled the state API changes in is >> that key based state within a window function does not appear to be >> working. Perhaps I am not using it correctly now that the API has >> changed. Previously we had done something like this within the >> RichWindowFunction: >> >> @Override >> public void open(final Configuration parameters) throws Exception { >> state = getRuntimeContext().getOperatorState("state", new StatePojo(), >> true); >> } >> >> Based on the API changes I switched it to: >> >> @Override >> public void open(final Configuration parameters) throws Exception { >> state = getRuntimeContext().getKeyValueState("state", StatePojo.class, >> new StatePojo()); >> } >> >> >> But the state doesn’t seem to be partitioned based on the key. I haven’t >> had much time to play around with it, so its certainly possible that I >> messed something up while refactoring to the API change. I will look at >> it further when I get a chance, but if you have any thoughts they are much >> appreciated. >> >> >> Thanks, >> Paul Hamilton >> >> >> On 10/17/15, 6:39 AM, "Aljoscha Krettek" wrote: >> >>> Hi Paul, >>> it’s good to see people interested in this. I sketched a Trigger that >>> should fit your requirements: >>> https://gist.github.com/aljoscha/a7c6f22548e7d24bc4ac >>> >>> You can use it like this: >>> >>> DataStream<> input = … >>> DataStream<> result = input >>> .keyBy(“session-id”) >>> .window(GlobalWindows.create()) >>> .trigger(new SessionTrigger(timeout, maxElements)) >>> .apply(new MyWindowFunction()) >>> >>> The Trigger uses the new state API that I’m currently introducing in a >>> new Pull Request. It should be merged very soon, before the 0.10 release. >>> >>> This implementation has one caveat, though. It cannot deal with elements >>> that belong to different sessions that arrive intermingled with other >>> sessions. The reason is that Flink does not yet support merging the >>> windows that the WindowAssigner assigns as, for example, the Cloud >>> Dataflow API supports. This means that elements cannot be assigned to >>> session windows, instead the workaround with the GlobalWindow has to be >>> used. I want to tackle this for the release after 0.10, however. >>> >>> Please let us know if you need more information. I’m always happy to help >>> in these interesting cases at the bleeding edge of what is possible. :-) >>> >>> Cheers, >>> Aljoscha >>> On 16 Oct 2015, at 19:36, Hamilton, Paul wrote: Hi, I am attempting to make use of the new window APIs in streaming to implement a session based window and am not sure if the currently provided functionality handles my use case. Specifically what I want to do is something conceptually similar to a ³Sessions.withGapDuration(Š)² window in Google DataFlow. Assuming the events are keyed by session id. I would like to use the event time and the watermarking functionality to trigger a window after the ³end of a session² (no events for a given session received within x amount of time). With watermarking this would mean trigger when a watermark is seen that is > (the time of the last event + session timeout). Also I want to perform an early triggering of the window after a given number of events have been received. Is it currently possible to do this with the current combination of window assigners and triggers? I am happy to write custom triggers etc, but wanted to make sure it wasn¹t already available before going down that road. Thanks, Paul Hamilton Hybris Software >>> >> >