Re: Managed Keyed state update

2018-08-14 Thread Fabian Hueske
Hi, It is recommended to always call update(). State modifications by modifying objects is only possible because the heap based backends do not serialize or copy records to avoid additional costs. Hence, this is rather a side effect than a provided API. As soon as you change the state backend,

Re: Kerberos Configuration Does Not Apply To Krb5LoginModule

2018-08-13 Thread Fabian Hueske
Hi Paul, Maybe Aljoscha (in CC) can help you with this question. AFAIK, he has some experience with Flink and Kerberos. Best, Fabian 2018-08-13 14:51 GMT+02:00 Paul Lam : > Hi, > > I built Flink from the latest 1.5.x source code, and got some strange > outputs from the command line when

Re: Introduce Barriers in stream source

2018-08-13 Thread Fabian Hueske
Hi, It is sufficient to implement the CheckpointedFunction interface. Since SourceFunctions emit records in a separate thread, you need to ensure that not record is emitted while the shapshotState method is called. Flink provides a lock to synchronize data emission and state snapshotting. See the

Re: Tuning checkpoint

2018-08-13 Thread Fabian Hueske
Hi Mingliang, let me answer your second question first: > Another question is about the alignment buffer, I thought it was only used for multiple input stream cases. But for keyed process function , what is actually aligned? When a task sends records to multiple downstream tasks (task not

Re: JDBCInputFormat and SplitDataProperties

2018-08-13 Thread Fabian Hueske
ong configuration of the cluster; there was only 1 > task manager with 1 slot. > > If I submit a job with "flink run -p 24 ...", will the job hang until at > least 24 slots are available? > > Regards, > Alexis. > > On Fri, 10 Aug 2018, 14:01 Fabian Hueske wrote: >

Re: Flink socketTextStream UDP connection

2018-08-13 Thread Fabian Hueske
Hi, ExecutionEnvironment.socketTextStream is deprecated and it is very likely that it will be removed because of its limited use. I would recommend to have at the implementation of the SourceFunction [1] and adapt it to your needs. Best, Fabian [1]

Re: flink requires table key when insert into upsert table sink

2018-08-10 Thread Fabian Hueske
Hi Henry, The problem is that the table that results from the query does not have a unique key. You can only use an upsert sink if the table has a (composite) unique key. Since this is not the case, you cannot use upsert sink. However, you can implement a StreamRetractionTableSink which allows to

Re: Small-files source - partitioning based on prefix of file

2018-08-10 Thread Fabian Hueske
Hi Averell, Conceptually, you are right. Checkpoints are taken at every operator at the same "logical" time. It is not important, that each operator checkpoints at the same wallclock time. Instead, the need to take a checkpoint when they have processed the same input. This is implemented with

Re: JDBCInputFormat and SplitDataProperties

2018-08-10 Thread Fabian Hueske
upBy(0, 1) >> .reduceGroup(groupReducer) >> .withForwardedFields("_1") >> .output(outputFormat) >> >> It seems to work well, and the semantic annotation does remove a hash >> partition from the execution plan. >> >> Regards, >> Ale

Re: Flink Rebalance

2018-08-10 Thread Fabian Hueske
Hi, Elias and Paul have good points. I think the performance degradation is mostly to the lack of function chaining in the rebalance case. If all steps are just map functions, they can be chained in the no-rebalance case. That means, records are passed via function calls. If you add rebalancing,

Re: Dataset.distinct - Question on deterministic results

2018-08-10 Thread Fabian Hueske
Hi Will, The distinct operator is implemented as a groupBy(distinctKeys) and a ReduceFunction that returns the first argument. Hence, it depends on the order in which the records are processed by the ReduceFunction. Flink does not maintain a deterministic order because it is quite expensive in

Re: UTF-16 support for TextInputFormat

2018-08-10 Thread Fabian Hueske
dian and the caller indicates > UTF-16BE, Flink should rewrite the charsetName as UTF-16LE. > > I hope this makes sense and that I haven't been testing incorrectly or > misreading the code. > > Thank you, > David > > On Thu, Aug 9, 2018 at 4:04 AM Fabian Hueske wro

Re: Small-files source - partitioning based on prefix of file

2018-08-10 Thread Fabian Hueske
Hi Averell, One comment regarding what you said: > As my files are small, I think there would not be much benefit in checkpointing file offset state. Checkpointing is not about efficiency but about consistency. If the position in a split is not checkpointed, your application won't operate with

Re: Table API, custom window

2018-08-09 Thread Fabian Hueske
Hi, regarding the plans. There are no plans to support custom window assigners and evictors. There were some thoughts about supporting different result update strategies that could be used to return early results or update results in case of late data. However, these features are currently not

Re: JDBCInputFormat and SplitDataProperties

2018-08-09 Thread Fabian Hueske
takes two parameters: > partitionNumber and totalNumberOfPartitions. Should I assume that there are > 2 splits divided into 24 partitions? > > Regards, > Alexis. > > > > On Wed, Aug 8, 2018 at 11:57 AM Fabian Hueske wrote: > >> Hi Alexis, >> >> First o

Re: UTF-16 support for TextInputFormat

2018-08-09 Thread Fabian Hueske
Hi David, Did you try to set the encoding on the TextInputFormat with TextInputFormat tif = ... tif.setCharsetName("UTF-16"); Best, Fabian 2018-08-08 17:45 GMT+02:00 David Dreyfus : > Hello - > > It does not appear that Flink supports a charset encoding of "UTF-16". It > particular, it

Re: State in the Scala DataStream API

2018-08-08 Thread Fabian Hueske
Hi Juan, The state will be purged if you return None instead of a Some. However, this only happens when the function is called for a specific key, i.e., state won't be automatically removed after some time. If this is your use case, you have to implement a ProcessFunction and use timers to

Re: Listing in Powered By Flink directory

2018-08-08 Thread Fabian Hueske
Thanks Amit! I've added Limeroad to the list with your description. Best, Fabian 2018-08-08 14:12 GMT+02:00 amit.jain : > Hi Fabian, > > We at Limeroad, are using Flink for multiple use-cases ranging from ETL > jobs, ClickStream data processing, real-time dashboard to CEP. Could you > list us

Listing in Powered By Flink directory

2018-08-08 Thread Fabian Hueske
Hi everybody, The Flink community maintains a directory of organizations and projects that use Apache Flink [1]. Please reply to this thread if you'd like to add an entry to this list. Thanks, Fabian [1] https://cwiki.apache.org/confluence/display/FLINK/Powered+by+Flink

Re: JDBCInputFormat and SplitDataProperties

2018-08-08 Thread Fabian Hueske
Hi Alexis, First of all, I think you leverage the partitioning and sorting properties of the data returned by the database using SplitDataProperties. However, please be aware that SplitDataProperties are a rather experimental feature. If used without query parameters, the JDBCInputFormat

Re: Need help regarding Flink Batch Application

2018-08-08 Thread Fabian Hueske
The code or the execution plan (ExecutionEnvironment.getExecutionPlan()) of the job would be interesting. 2018-08-08 10:26 GMT+02:00 Chesnay Schepler : > What have you tried so far to increase performance? (Did you try different > combinations of -yn and -ys?) > > Can you provide us with your

Re: Filter-Join Ordering Issue

2018-08-08 Thread Fabian Hueske
I've created FLINK-10100 [1] to track the problem and suggest a solution and workaround. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-10100 2018-08-08 10:39 GMT+02:00 Fabian Hueske : > Hi Dylan, > > Yes, that's a bug. > As you can see from the plan, the parti

Re: Accessing source table data from hive/Presto

2018-08-08 Thread Fabian Hueske
t; Thanks for the reply. I was mainly thinking of the usecase of streaming > job. > In the approach to port to Flink's SQL API, is it possible to read parquet > data from S3 and register table in flink? > > > On Tue, Aug 7, 2018 at 1:05 PM, Fabian Hueske wrote: > >> Hi Mugun

Re: Filter-Join Ordering Issue

2018-08-08 Thread Fabian Hueske
Hi Dylan, Yes, that's a bug. As you can see from the plan, the partitioning step is pushed past the Filter. This is possible, because the optimizer knows that a Filter function cannot modify the data (it only removes records). A workaround should be to implement the filter as a FlatMapFunction.

Re: Small-files source - partitioning based on prefix of file

2018-08-08 Thread Fabian Hueske
Hi Averall, As Vino said, checkpoints store the state of all operators of an application. The state of a monitoring source function is the position in the currently read split and all splits that have been received and are currently pending. In case of a recovery, the splits are recovered and

Re: Accessing source table data from hive/Presto

2018-08-07 Thread Fabian Hueske
Hi Mugunthan, this depends on the type of your job. Is it a batch or a streaming job? Some queries could be ported to Flink's SQL API as suggested by the link that Hequn posted. In that case, the query would be executed in Flink. Other options are to use a JDBC InputFormat or persisting the

Re: Converting a DataStream into a Table throws error

2018-08-06 Thread Fabian Hueske
cannot be applied to (String, org.apache.flink.streaming. >>>> api.environment.StreamExecutionEnvironment, Symbol, Symbol, Symbol, >>>> Symbol) >>>> [error] tableEnv.registerDataStream("table1", streamExecEnv, 'key, >>>> 'ticker, 'timeissued

Re: Event Time Session Window does not trigger..

2018-08-06 Thread Fabian Hueske
Hi, By setting the time characteristic to EventTime, you enable the internal handling of record timestamps and watermarks. In contrast to EventTime, ProcessingTime does not require any additional data. You can use both, EventTime and ProcessingTime in the same application and

Re: Description of Flink event time processing

2018-08-02 Thread Fabian Hueske
discussion of your document. Elias, do you want to put your document into Markdown and open a PR for the documentation? Thanks, Fabian 2018-07-31 18:16 GMT+02:00 Fabian Hueske : > Hi Elias, > > Sorry for the delay. I just made a pass over the document. > I think it is very good. > >

Re: Multiple output operations in a job vs multiple jobs

2018-08-02 Thread Fabian Hueske
Hi, Paul is right. Which and how much data is stored in state for a window depends on the type of the function that is applied on the windows: - ReduceFunction: Only the reduced value is stored - AggregateFunction: Only the accumulator value is stored - WindowFunction or ProcessWindowFunction:

Re: Converting a DataStream into a Table throws error

2018-08-01 Thread Fabian Hueske
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all res

Re: Converting a DataStream into a Table throws error

2018-08-01 Thread Fabian Hueske
Hi I think you are mixing Java and Scala dependencies. org.apache.flink.streaming.api.datastream.DataStream is the DataStream of the Java DataStream API. You should use the DataStream of the Scala DataStream API. Best, Fabian 2018-08-01 14:01 GMT+02:00 Mich Talebzadeh : > Hi, > > I believed I

Re: Small-files source - partitioning based on prefix of file

2018-07-31 Thread Fabian Hueske
Hi Averell, please find my answers inlined. Best, Fabian 2018-07-31 13:52 GMT+02:00 Averell : > Hi Fabian, > > Thanks for the information. I will try to look at the change to that > complex > logic that you mentioned when I have time. That would save one more shuffle > (from 1 to 0), wouldn't

Re: Description of Flink event time processing

2018-07-31 Thread Fabian Hueske
: > Fabian, > > You have any time to review the changes? > > On Thu, Jul 19, 2018 at 2:19 AM Fabian Hueske wrote: > >> Hi Elias, >> >> Thanks for the update! >> I'll try to have another look soon. >> >> Best, Fabian >> >> 2018-07-11 1:3

Re: Event time didn't advance because of some idle slots

2018-07-31 Thread Fabian Hueske
Hi, If you are using a custom source, you can call SourceContext.markAsTemporarilyIdle() to indicate that a task is currently not producing new records [1]. Best, Fabian 2018-07-31 8:50 GMT+02:00 Reza Sameei : > It's not a real solution; but why you don't change the parallelism for > your

Re: Small-files source - partitioning based on prefix of file

2018-07-31 Thread Fabian Hueske
Hi Averell, The records emitted by the monitoring tasks are "just" file splits, i.e., meta information that defines which data to read from where. The reader tasks receive these splits and process them by reading the corresponding files. You could of course partition the splits based on the file

Re: watermark VS window trigger

2018-07-31 Thread Fabian Hueske
Hi, Watermarks are not holding back records. Instead they define the event-time at an operator (as Vino said) and can trigger the processing of data if the logic of an operator is based on time. For example, a window operator can emit complete results for a window once the time passed the

Re: Questions on Unbounded number of keys

2018-07-30 Thread Fabian Hueske
Hi Chang, The state handle objects are not created per key but just once per function instance. Instead they route state accesses to the backend (JVM heap or RocksDB) for the currently active key. Best, Fabian 2018-07-30 12:19 GMT+02:00 Chang Liu : > Hi Andrey, > > Thanks for your reply. My

Re: Working out through individual messages in Flink

2018-07-30 Thread Fabian Hueske
ich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be lia

Re: S3 file source - continuous monitoring - many files missed

2018-07-25 Thread Fabian Hueske
Hi, First of all, the ticket reports a bug (or improvement or feature suggestion) such that others are aware of the problem and understand its cause. At some point it might be picked up and implemented. In general, there is no guarantee whether or when this happens, but the Flink community is of

Re: S3 file source - continuous monitoring - many files missed

2018-07-25 Thread Fabian Hueske
Hi, Thanks for creating the Jira issue. I'm not sure if I would consider this a blocker but it is certainly an important problem to fix. Anyway, in the original version Flink checkpoints the modification timestamp up to which all files have been read (or at least up to which point it *thinks* to

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Fabian Hueske
Hi, The problem is that Flink tracks which files it has read by remembering the modification time of the file that was added (or modified) last. We use the modification time, to avoid that we have to remember the names of all files that were ever consumed, which would be expensive to check and

Re: Question regarding State in full outer join

2018-07-24 Thread Fabian Hueske
Hi Darshan, The join implementation in SQL / Table API does what is demanded by the SQL semantics. Hence, what results to emit and also what data to store (state) to compute these results is pretty much given. You can think of the semantics of the join as writing both streams into a relational

Re: IoT Use Case, Problem and Thoughts

2018-07-23 Thread Fabian Hueske
ing). > > I can see the point of making the checkpoint triggering more flexible and > giving some control to the user. In contrast to savepoints, checkpoints are > considered for recovery. My question here would be, what would be the > triggering condition in your case (other than

Re: Re: Bootstrapping the state

2018-07-23 Thread Fabian Hueske
Hi Henkka, You might want to consider implementing a dedicated job for state bootstrapping that uses the same operator UUID and state names. That might be easier than integrating the logic into your regular job. I think you have to use the monitoring file source because AFAIK it won't be

Re: Parallelism and keyed streams

2018-07-23 Thread Fabian Hueske
Hi, Flink guarantees order only within a partition. For example, if you have the program map_1 -> map_2 and both map functions run with parallelism 4, the order of records in each of the 4 partitions is not changed.. In case of a shuffle (such as a keyBy or change in parallelism) records are

Re: Keeping only latest row by key?

2018-07-19 Thread Fabian Hueske
HI James, Yes, that should also do the trick. Best, Fabian 2018-07-19 16:06 GMT+02:00 Porritt, James : > It looks like the following gives me the result I’m interested in: > > > > batchEnv > > .createInput(dataset) > > .groupBy("id") > >

Re: Parallel stream partitions

2018-07-19 Thread Fabian Hueske
Hi Nick, What Ken said is correct, but let me add two more things. 1) State Usually, you only need to partition (keyBy()) the data if you want to process tuples with the same same key together. Therefore, it is necessary to hold some tuples or intermediate results (like partial or running

Re: Description of Flink event time processing

2018-07-19 Thread Fabian Hueske
;>> I think you did not to enable access for comments for the link. Would >>> you mind enabling comments for the google doc? >>> >>> Thanks, >>> Rong >>> >>> >>> On Thu, Jul 5, 2018 at 8:39 AM Fabian Hueske wrote: >

Re: Production readiness of Flink Job Stop Service

2018-07-19 Thread Fabian Hueske
Hi Chirag, Stop with savepoint is not mentioned in the 1.5.0 release notes [1]. Since its a frequently requested feature, I'm pretty sure that it would have been mentioned if it was added. Best, Fabian [1] http://flink.apache.org/news/2018/05/25/release-1.5.0.html 2018-07-19 8:39 GMT+02:00

Re: Why data didn't enter the time window in EventTime mode

2018-07-19 Thread Fabian Hueske
Hi Soheil, Hequn is right. This might be an issue with advancing event-time. You can monitor that by checking the watermarks in the web dashboard or print-debug it with a ProcessFunction which can lookup the current watermark. Best, Fabian 2018-07-19 3:30 GMT+02:00 Hequn Cheng : > Hi Soheil, >

Re: Race between window assignment and same window timeout

2018-07-19 Thread Fabian Hueske
Hi Shay, This sounds very much like the off-by-one bug described by FLINK-9857 [1]. The problem was identified in another recent user ml thread and fixed for Flink 1.5.2 and 1.6.0. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-9857 2018-07-18 19:00 GMT+02:00 Andrey Zagrebin : >

[ANNOUNCE] Program for Flink Forward Berlin 2018 has been announced

2018-07-17 Thread Fabian Hueske
Hi everyone, I'd like to announce the program for Flink Forward Berlin 2018. The program committee [1] assembled a program of about 50 talks on use cases, operations, ecosystem, tech deep dive, and research topics. The conference will host speakers from Airbnb, Amazon, Google, ING, Lyft,

Re: Flink on Mesos: containers question

2018-07-16 Thread Fabian Hueske
Hi Alexei, Till (in CC) is familiar with Flink's Mesos support in 1.4.x. Best, Fabian 2018-07-13 15:07 GMT+02:00 NEKRASSOV, ALEXEI : > Can someone please clarify how Flink on Mesos in containerized? > > > > On 5-node Mesos cluster I started Flink (1.4.2) with two Task Managers. > Mesos shows

Re: Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-16 Thread Fabian Hueske
Hi Gerard, Thanks for reporting this issue. I'm pulling in Nico and Piotr who have been working on the networking stack lately and might have some ideas regarding your issue. Best, Fabian 2018-07-13 13:00 GMT+02:00 Zhijiang(wangzhijiang999) < wangzhijiang...@aliyun.com>: > Hi Gerard, > > I

Re: State sharing across trigger and evictor

2018-07-16 Thread Fabian Hueske
Hi, I don't think that is possible. The Evictor interface does not provide access to a state store, so there is no way to access state. Best, Fabian 2018-07-10 13:26 GMT+02:00 Jayant Ameta : > Hi, > I'm using the GlobalWindow with a custom CountTrigger (similar to the > CountTrigger provided

Re: A use-case for Flink and reactive systems

2018-07-06 Thread Fabian Hueske
ver, a second PoC I was considering is related to Flink CEP. Let's >>> say I am elaborating sensor data, I want to have a rule which is working on >>> the following principle: >>> - If the temperature is more than 40 >>> - If the temperature yesterday at noon was more

Re: Slide Window Compute Optimization

2018-07-06 Thread Fabian Hueske
Hi Yennie, You might want to have a look at the OVER windows of Flink's Table API or SQL [1]. An OVER window computes an aggregate (such as a count) for each incoming record over a range of previous events. For example the query: SELECT ip, successful, COUNT(*) OVER (PARTITION BY ip, successful

Re: Description of Flink event time processing

2018-07-05 Thread Fabian Hueske
Hi Elias, Thanks for the great document! I made a pass over it and left a few comments. I think we should definitely add this to the documentation. Thanks, Fabian 2018-07-04 10:30 GMT+02:00 Fabian Hueske : > Hi Elias, > > I agree, the docs lack a coherent discussion of event time

Re: Dynamic Rule Evaluation in Flink

2018-07-05 Thread Fabian Hueske
Hi, > Flink doesn't support connecting multiple streams with heterogeneous schema This is not correct. Flink is very well able to connect streams with different schema. However, you cannot union two streams with different schema. In order to reconfigure an operator with changing rules, you can

Re: [Table API/SQL] Finding missing spots to resolve why 'no watermark' is presented in Flink UI

2018-07-05 Thread Fabian Hueske
ble.getSchema.getTypes) > tableEnv.toRetractStream[Row](outTable).print() > > > Thanks again, > Jungtaek Lim (HeartSaVioR) > > [1] https://issues.apache.org/jira/browse/FLINK-9742 > > 2018년 7월 4일 (수) 오후 10:03, Fabian Hueske 님이 작성: > >> Hi, >> >> Glad you

Re: A use-case for Flink and reactive systems

2018-07-05 Thread Fabian Hueske
Hi Yersinia, The main idea of an event-driven application is to hold the state (i.e., the account data) in the streaming application and not in an external database like Couchbase. This design is very scalable (state is partitioned) and avoids look-ups from the external database because all state

Re: [Table API/SQL] Finding missing spots to resolve why 'no watermark' is presented in Flink UI

2018-07-04 Thread Fabian Hueske
("eventTime"), > new BoundedOutOfOrderTimestamps(Time.minutes(1).toMilliseconds) > ) > .build() > > Thanks again! > Jungtaek Lim (HeartSaVioR) > > 2018년 7월 4일 (수) 오후 8:18, Fabian Hueske 님이 작성: > >> Hi Jungtaek, >> >> If it is "only&q

Re: Let BucketingSink roll file on each checkpoint

2018-07-04 Thread Fabian Hueske
Hi Xilang, I thought about this again. The bucketing sink would need to roll on event-time intervals (similar to the current processing time rolling) which are triggered by watermarks in order to support consistency. However, it would also need to maintain a write ahead log of all received rows

Re: [Table API/SQL] Finding missing spots to resolve why 'no watermark' is presented in Flink UI

2018-07-04 Thread Fabian Hueske
e tricky (as the semantic of SQL query is not for > multiple outputs). > > Thanks, > Jungtaek Lim (HeartSaVioR) > > 2018년 7월 4일 (수) 오후 5:49, Fabian Hueske 님이 작성: > >> Hi Jungtaek, >> >> Flink 1.5.0 features a TableSource for Kafka and JSON [1], incl. >> ti

Re: How to implement Multi-tenancy in Flink

2018-07-04 Thread Fabian Hueske
Hi Ahmad, Some tricks that might help to bring down the effort per tenant if you run one job per tenant (or key per tenant): - Pre-aggregate records in a 5 minute Tumbling window. However, pre-aggregation does not work for FoldFunctions. - Implement the window as a custom ProcessFunction that

Re: Flink memory management in table api

2018-07-04 Thread Fabian Hueske
ording to above conversation flink will persist state forever for non > windowed operations. I want to know how flink persiat the state i.e. > Database or file system or in memory etc. > > On Wed, 4 Jul 2018 at 2:12 PM, Fabian Hueske wrote: > >> Hi Amol, >> >> The m

Re: [Table API/SQL] Finding missing spots to resolve why 'no watermark' is presented in Flink UI

2018-07-04 Thread Fabian Hueske
Hi Jungtaek, Flink 1.5.0 features a TableSource for Kafka and JSON [1], incl. timestamp & watemark generation [2]. It would be great if you could let us know, if that addresses your use case and if not what's missing or not working. So far Table API / SQL does not have support for late-data side

Re: Flink memory management in table api

2018-07-04 Thread Fabian Hueske
Hi Amol, The memory consumption depends on the query/operation that you are doing. Time-based operations like group-window-aggregations, over-window-aggregations, or window-joins can automatically clean up their state once data is not no longer needed. Operations such as non-windowed aggregations

Re: Why should we use an evictor operator in flink window

2018-07-04 Thread Fabian Hueske
Hi, The Evictor is useful if you want to remove some elements from the window state but not all. This also implies that a window is evaluated multiple times because otherwise you could just filter in the the user function (as you suggested) and purge the whole window afterwards. Evictors are

Re: Description of Flink event time processing

2018-07-04 Thread Fabian Hueske
Hi Elias, I agree, the docs lack a coherent discussion of event time features. Thank you for this write up! I just skimmed your document and will provide more detailed feedback later. It would be great to add such a page to the documentation. Best, Fabian 2018-07-03 3:07 GMT+02:00 Elias Levy :

Re: How to partition within same physical node in Flink

2018-07-04 Thread Fabian Hueske
:37 GMT+02:00 ashish pok : > Thanks Fabian! It sounds like KeyGroup will do the trick if that can be > made publicly accessible. > > On Monday, July 2, 2018, 5:43:33 AM EDT, Fabian Hueske > wrote: > > > Hi Ashish, hi Vijay, > > Flink does not distinguish between di

Re: run time error java.lang.NoClassDefFoundError: org/apache/flink/streaming/connectors/kafka/FlinkKafkaConsumer09

2018-07-04 Thread Fabian Hueske
Looking at the other threads, I assume you solved this issue. The problem should have been that FlinkKafka09Consumer is not included in the flink-connector-kafka-0.11 module, because it is the connector for Kafka 0.9 and not Kafka 0.11. Best, Fabian 2018-07-02 11:20 GMT+02:00 Mich Talebzadeh :

Re: Passing type information to JDBCAppendTableSink

2018-07-04 Thread Fabian Hueske
There is also the SQL:2003 MERGE statement that can be used to implement UPSERT logic. It is a bit verbose but supported by Derby [1]. Best, Fabian [1] https://issues.apache.org/jira/browse/DERBY-3155 2018-07-04 10:10 GMT+02:00 Fabian Hueske : > Hi Chris, > > MySQL (and maybe o

Re: Passing type information to JDBCAppendTableSink

2018-07-03 Thread Fabian Hueske
Hi, In addition to what Rong said: - The types look OK. - You can also use Types.STRING, and Types.LONG instead of BasicTypeInfo.xxx - Beware that in the failure case, you might have multiple entries in the database table. Some databases support an upsert syntax which (together with key or

Re: Let BucketingSink roll file on each checkpoint

2018-07-03 Thread Fabian Hueske
Hi Xilang, Let me try to summarize your requirements. If I understood you correctly, you are not only concerned about the exactly-once guarantees but also need a consistent view of the data. The data in all files that are finalized need to originate from a prefix of the stream, i.e., all records

Re: Regarding external metastore like HIVE

2018-07-03 Thread Fabian Hueske
Hi, The docs explain that the ExternalCatalog interface *can* be used to implement a catalog for HCatalog or Metastore. However, there is no such implementation in Flink yet. You would need to implement such as catalog connector yourself. I think there would be quite a few people interested in

Re: Kafka Avro Table Source

2018-07-03 Thread Fabian Hueske
Hi Will, The community is currently working on improving the Kafka Avro integration for Flink SQL. There's a PR [1]. If you like, you could try it out and give some feedback. Timo (in CC) has been working Kafka Avro and should be able to help with any specific questions. Best, Fabian [1]

Re: java.lang.NoSuchMethodError: org.apache.kafka.clients.consumer.KafkaConsumer.assign

2018-07-03 Thread Fabian Hueske
Hi Mich, FlinkKafkaConsumer09 is the connector for Kafka 0.9.x. Have you tried to use FlinkKafkaConsumer011 instead of FlinkKafkaConsumer09? Best, Fabian 2018-07-02 22:57 GMT+02:00 Mich Talebzadeh : > This is becoming very tedious. > > As suggested I changed the kafka dependency from > >

Re: The program didn't contain a Flink job

2018-07-03 Thread Fabian Hueske
Hi, Let me summarize: 1) Sometimes you get the error message "org.apache.flink.client.program.ProgramMissingJobException: The program didn't contain a Flink job.". when submitting a program through the YarnClusterClient 2) The logs and the dashboard state that the job ran successful 3) The job

Re: How to partition within same physical node in Flink

2018-07-02 Thread Fabian Hueske
e physical partitioning in a way where physical partiotioning happens >> first by parent key and localize grouping by child key, is there a need to >> using custom partitioner? Obviously we can keyBy twice but was wondering if >> we can minimize the re-partition stress. >> >

Re: How to partition within same physical node in Flink

2018-06-28 Thread Fabian Hueske
ne. > > I guess I might have to use a ThreadPool within each Slot(cam partition) > to work on each seq# ?? > > TIA > > On Tue, Jun 26, 2018 at 1:06 AM Fabian Hueske wrote: > >> Hi, >> >> keyBy() does not work hierarchically. Each keyBy() overrides the pre

Re: DataSet with Multiple reduce Actions

2018-06-28 Thread Fabian Hueske
Hi Osh, You can certainly apply multiple reduce function on a DataSet, however, you should make sure that the data is only partitioned and sorted once. Moreover, you would end up with multiple data sets that you need to join afterwards. I think the easier approach is to wrap your functions in a

Re: Over Window Not Processing Messages

2018-06-28 Thread Fabian Hueske
ta is >>>> loaded into the system before the watermark advances. At that point the >>>> checkpoints stall indefinitely with a couple of the tasks in the 'over' >>>> operator never acknowledging. Any thoughts on what would cause that? Or how >>>

Re: Over Window Not Processing Messages

2018-06-27 Thread Fabian Hueske
Hi, The OVER window operator can only emit result when the watermark is advanced, due to SQL semantics which define that all records with the same timestamp need to be processed together. Can you check if the watermarks make sufficient progress? Btw. did you observe state size or IO issues? The

Re: high-availability.storageDir clean up?

2018-06-27 Thread Fabian Hueske
Hi Elias, Till (in CC) is familiar with Flink's HA implementation. He might be able to answer your question. Thanks, Fabian 2018-06-25 23:24 GMT+02:00 Elias Levy : > I noticed in one of our cluster that they are relatively old > submittedJobGraph* and completedCheckpoint* files. I was

Re: env.setStateBackend deprecated in 1.5 for FsStateBackend

2018-06-27 Thread Fabian Hueske
Hi, You can just add a cast to StateBackend to get rid of the deprecation warning: env.setStateBackend((StateBackend) new FsStateBackend("hdfs://myhdfsmachine:9000/flink/checkpoints")); Best, Fabian 2018-06-27 5:47 GMT+02:00 Rong Rong : > Hmm. > > If you have a wrapper function like this,

Re: [Flink-9407] Question about proposed ORC Sink !

2018-06-27 Thread Fabian Hueske
Hi Sagar, That's more a question for the ORC community, but AFAIK, the top-level type is always a struct because it needs to wrap the fields, e.g., struct(name:string, age:int) Best, Fabian 2018-06-26 22:38 GMT+02:00 sagar loke : > @zhangminglei, > > Question about the schema for ORC format: >

Re: Measure Latency from source to sink

2018-06-26 Thread Fabian Hueske
Hi, Measuring latency is tricky and you have to be careful about what you measure. Aggregations like window operators make things even more difficult because you need to decide which timestamp(s) to forward (smallest?, largest?, all?) Depending on the operation, the measurement code might even

Re: How to partition within same physical node in Flink

2018-06-26 Thread Fabian Hueske
nt >> slots/threads on the same Task Manager instance(aka cam1 partition) using >> keyBy(seq#) & setParallelism() ? Can *forward* Strategy be used to >> achieve this ? >> >> TIA >> >> >> On Mon, Jun 25, 2018 at 1:03 AM Fabian Hueske wrote: >>

Re: Few question about upgrade from 1.4 to 1.5 flink ( some very basic )

2018-06-25 Thread Fabian Hueske
Hi Vishal, 1. I don't think a rolling update is possible. Flink 1.5.0 changed the process orchestration and how they communicate. IMO, the way to go is to start a Flink 1.5.0 cluster, take a savepoint on the running job, start from the savepoint on the new cluster and shut the old job down. 2.

Re: How to partition within same physical node in Flink

2018-06-25 Thread Fabian Hueske
Hi, Flink distributes task instances to slots and does not expose physical machines. Records are partitioned to task instances by hash partitioning. It is also not possible to guarantee that the records in two different operators are send to the same slot. Sharing information by side-passing it

Re: Custom Watermarks with Flink

2018-06-25 Thread Fabian Hueske
Hi, I would not encode this information in watermarks. Watermarks are rather an internal mechanism to reason about event-time. Flink also generates watermarks internally. This makes the behavior less predictive. You could either inject special meta data records (which Flink handles just like

Re: Strictly use TLSv1.2

2018-06-22 Thread Fabian Hueske
Great, thank you! 2018-06-22 10:16 GMT+02:00 Vinay Patil : > Hi Fabian, > > Created a JIRA ticket : https://issues.apache.org/jira/browse/FLINK-9643 > > Regards, > Vinay Patil > > > On Fri, Jun 22, 2018 at 1:25 PM Fabian Hueske wrote: > >> Hi Vinay, >&

Re: Strictly use TLSv1.2

2018-06-22 Thread Fabian Hueske
Hi Vinay, This looks like a bug. Would you mind creating a Jira ticket [1] for this issue? Thank you very much, Fabian [1] https://issues.apache.org/jira/projects/FLINK 2018-06-21 9:25 GMT+02:00 Vinay Patil : > Hi, > > I have deployed Flink 1.3.2 and enabled SSL settings. From the ssl debug >

Re: SQL Do Not Support Custom Trigger

2018-06-22 Thread Fabian Hueske
Hi, Although this solution looks straight-forward, custom triggers cannot be added that easily. The problem is that a window operator with a Trigger that emit early results produces updates, i.e., results that have been emitted might be updated later. The default Trigger only emits the final

Re: How to use broadcast variables in data stream

2018-06-21 Thread Fabian Hueske
seems not supported in Flink-1.3 . > I found this in Flink-1.3: > Broadcasting > DataStream → DataStream > > Broadcasts elements to every partition. > > dataStream.broadcast(); > > But I don’t know how to convert it to list and get it in stream context . > > 在 2018年6月

Re: How to use broadcast variables in data stream

2018-06-21 Thread Fabian Hueske
Hi, if the list is static and not too large, you can pass it as a parameter to the function. Function objects are serialized (using Java's default serialization) and shipped to the workers for execution. If the data is dynamic, you might want to have a look at Broadcast state [1]. Best, Fabian

Re: # of active session windows of a streaming job

2018-06-21 Thread Fabian Hueske
ger.clear(), not > Trigger.onClose(). > > Best, > - Dongwon > > > On Wed, Jun 20, 2018 at 7:30 PM, Chesnay Schepler > wrote: > >> Checkpointing of metrics is a manual process. >> The operator must write the current value into state, retrieve it on >> re

Re: Debug job execution from savepoint

2018-06-21 Thread Fabian Hueske
Hi Manuel, I had a look and couldn't find a way to do it. However, this sounds like a very useful feature to me. Would you mind creating a Jira issue [1] for that? Thanks, Fabian [1] https://issues.apache.org/jira/projects/FLINK 2018-06-18 16:23 GMT+02:00 Haddadi Manuel : > Hi all, > > > I

<    1   2   3   4   5   6   7   8   9   10   >