Re: Need to understand the execution model of the Flink

2018-02-20 Thread Darshan Singh
:34 AM, Fabian Hueske <fhue...@gmail.com> wrote: > No, there is no size or cardinality estimation happening at the moment. > > Best, Fabian > > 2018-02-19 21:56 GMT+01:00 Darshan Singh <darshan.m...@gmail.com>: > >> Thanks , is there a metric or other way to

Need to understand the execution model of the Flink

2018-02-18 Thread Darshan Singh
Hi I would like to understand the execution model. 1. I have a csv files which is say 10 GB. 2. I created a table from this file. 3. Now I have created filtered tables on this say 10 of these. 4. Now I created a writetosink for all these 10 filtered tables. Now my question is that are these 10

Re: Need to understand the execution model of the Flink

2018-02-18 Thread Darshan Singh
n step > 2 table and read them out again. > > Don't ask *me* about what happens in failure scenarios... I have myself > not figured that out yet. > > HTH > Niclas > > On Mon, Feb 19, 2018 at 3:11 AM, Darshan Singh <darshan.m...@gmail.com> > wrote: > >> H

Re: Need to understand the execution model of the Flink

2018-02-19 Thread Darshan Singh
The underlying problem is that the SQL optimizer cannot translate queries with multiple sinks. Instead, each sink is individually translated and the optimizer does not know that common execution paths could be shared. Best, Fabian 2018-02-19 2:19 GMT+01:00 Darshan Singh <darshan.m...@gmail.com>: &

Re: Need to understand the execution model of the Flink

2018-02-19 Thread Darshan Singh
agers? > The web UI can help to diagnose such problems. > > Best, Fabian > > 2018-02-19 11:22 GMT+01:00 Darshan Singh <darshan.m...@gmail.com>: > >> Thanks Fabian for such detailed explanation. >> >> I am using a datset in between so i guess csv is read once. N

flink command line with save point job is not being submitted

2018-07-30 Thread Darshan Singh
I am trying to submit a job with the savepoint/checkpoint and it is failing with below error. Without -s flag it works fine. Am i missing something here? Thanks >bin/flink run -d -c st -s file:///tmp/db/checkpoint/ ./target/poc-1.0-SNAPSHOT-jar-with-dependencies.jar Starting execution of

Limit on number of files to read for Dataset

2018-08-13 Thread Darshan Singh
Hi Guys, Is there a limit on number of files flink dataset can read? My question is will there be any sort of issues if I have say millions of files to read to create single dataset. Thanks

Re: Limit on number of files to read for Dataset

2018-08-14 Thread Darshan Singh
urthermore if you have them stored on HDFS then the bottleneck is the >> namenode which will have to answer millions of requests. >> The latter point will change in future Hadoop versions with >> http://ozone.hadoop.apache.org/ >> >> On 13. Aug 2018, at 21:01, Dar

Re: Limit on number of files to read for Dataset

2018-08-14 Thread Darshan Singh
0 Jörn Franke : > >> It causes more overhead (processes etc) which might make it slower. >> Furthermore if you have them stored on HDFS then the bottleneck is the >> namenode which will have to answer millions of requests. >> The latter point will change in future Hadoop ve

Introduce Barriers in stream source

2018-08-13 Thread Darshan Singh
Hi, I am implementing a source and I want to use checkpointing and would like to restore the job from these external checkpoints. I used Kafka for my tests and it worked fine. However, I would like to know if I have my own source what do I need to do. I am sure that I will need to implement

Re: Backpressure? for Batches

2018-08-29 Thread Darshan Singh
upstream > finishes, so the slower downstream will not block upstream running, then > the backpressure may not exist in this case. > > Best, > Zhijiang > > ------ > 发件人:Darshan Singh > 发送时间:2018年8月29日(星期三) 16:20 >

Backpressure? for Batches

2018-08-29 Thread Darshan Singh
I faced the issue with back pressure in streams. I was wondering if we could face the same with the batches as well. In theory it should be possible. But in Web UI for backpressure tab for batches I was seeing that it was just showing the tasks status and no status like "OK" etc. So I was

Kryo Serialization Issue

2018-08-22 Thread Darshan Singh
Hi, I am using a map function on a data stream which has 1 column i.e. a json string. Map function simply uses Jackson mapper and convert the String to ObjectNode and also assign key based on one of the value in Object node. The code seems to work fine for 2-3 minutes as expected and then

Re: Kryo Serialization Issue

2018-08-27 Thread Darshan Singh
have been a corrupted message in your > decoding, for example a malformed JSON string. > > -- > Rong > > [1] https://issues.apache.org/jira/browse/FLINK-8836 > > On Wed, Aug 22, 2018 at 8:41 AM Darshan Singh > wrote: > >> Hi, >> >> I am using a map function

Re: Backpressure? for Batches

2018-08-29 Thread Darshan Singh
record was emitted by > the previous operator. As such back-pressure exists in batch just like in > streaming. > > On 29.08.2018 11:39, Darshan Singh wrote: > > Thanks, > > My job is simple. I am using table Api > 1. Read from hdfs > 2. Deserialize json to pojo and

Question regarding State in full outer join

2018-07-23 Thread Darshan Singh
Hi I was looking at the new full outer join. This seems to be working fine for my use case however I have a question regarding the state size. I have 2 streams each will have 100's of million unique keys. Also, Each of these will get the updated value of keys 100's of times per day. As per my

Queryable State

2018-09-05 Thread Darshan Singh
Hi All, I was playing with queryable state. As queryable stream can not be modified how do I use the output of say my reduce function for further processing. Below is 1 example. I can see that I am processing the data twice. One for the Queryable stream and once for the my original stream. That

Getting runtime context from scalar and table functions

2018-04-04 Thread Darshan Singh
Hi, I would like to use accumulators with table /scalar functions. However, I am not able to figure out how to get the runtime context from inside of scalar function open method. Only thing i see is function context which can not provide the runtime context. I tried using

Any metrics to get the shuffled and intermediate data in flink

2018-04-13 Thread Darshan Singh
Hi Is there any useful metrics in flink which tells me that a given operator read say 1 GB of data and shuffled(or anything else) and written(in case it was written to temp or anywhere else) say 1 or 2 GB data. One of my job is failing with disk space and there are many sort, group and join is

How to rebalance a table without converting to dataset

2018-04-13 Thread Darshan Singh
Hi I have a table and I want to rebalance the data so that each partition is equal. I cna convert to dataset and rebalance and then convert to table. I couldnt find any rebalance on table api. Does anyone know any better idea to rebalance table data? Thanks

Re: Any metrics to get the shuffled and intermediate data in flink

2018-04-13 Thread Darshan Singh
state per stage, which would be nice to have as well. > > Michael > > > On Apr 13, 2018, at 6:37 AM, Darshan Singh <darshan.m...@gmail.com> > wrote: > > > > Hi > > > > Is there any useful metrics in flink which tells me that a given > operator read say

Re: Getting runtime context from scalar and table functions

2018-04-04 Thread Darshan Singh
once job is done. The number of these strings is very small. I thought I could create a custom accumulator which will work more like list and in the end I will get all values. I guess I will need to keep trying something. Thanks On Wed, Apr 4, 2018 at 9:01 PM, Darshan Singh <darshan.m...@gmail.

bad data output

2018-03-29 Thread Darshan Singh
Hi I have a dataset which has almost 99% of correct data. As of now if say some data is bad I just ignore it and log it and return only correct data. I do this inside a map function. The part which decides whether data is correct or not is expensive one. Now I want to store the bad data

Re: bad data output

2018-04-03 Thread Darshan Singh
s is up to > you to benchmark and see if it fits you). > > Thanks, > Kostas > > [1] https://ci.apache.org/projects/flink/flink-docs- > release-1.4/dev/stream/side_output.html > > > On Mar 29, 2018, at 8:53 PM, Darshan Singh <darshan.m...@gmail.com> wrote: >

how to query the output of the scalar table function

2018-03-30 Thread Darshan Singh
Hi, I am not able to find what is best way to query the output of a scalar table function. Suppose I have table which has column col1 which is string. I have a scalar function and returns a POJO {col1V1 String, col1V2 String , col1V3 String}. I am using following. so table.select("sf(col1)

flink custom stream source

2018-10-02 Thread Darshan Singh
Hi , I am creating a new custom source for reading some streaming data which has different streams. So I assign streams to each task slots and then read it. This works fine but in some cases I have less streams than task slots and in that case some of workers are not assigned any streams and

Queryable State

2018-09-04 Thread Darshan Singh
Hi All, I was playing with queryable state. As queryable stream can not be modified how do I use the output of say my reduce function for further processing. Below is 1 example. I am sure I have done it wrong :). I am using reduce function twice or do I need to use rich reduce function and use

Unit/Integration test for stateful source

2018-09-20 Thread Darshan Singh
Hi, I am writing a stateful source very similar to KafkaBaseConsumer but not as generic. I was looking on how we can use unit test cases and integration tests on this. I looked at the kafka-connector-based unit test cases. It seems that there is too much external things at play here like lots of