Re: Using Custom Partitioner in Streaming with parallelism 1 adding latency

2017-06-15 Thread sohimankotia
You are right Aljoscha . Jog graph is splitted after introducing partitioner . I was under impression that If parallelism is set everything will be chained together . Can you explain how data will flow for map -> partitioner -> flatmap if parallelism or It would be great if point me to right

Re: Guava version conflict

2017-06-15 Thread Flavio Pompermaier
Hi Gordon, any news on this? On Mon, Jun 12, 2017 at 9:54 AM, Tzu-Li (Gordon) Tai wrote: > This seems like a shading problem then. > I’ve tested this again with Maven 3.0.5, even without building against CDH > Hadoop binaries the flink-dist jar contains non-shaded Guava

Re: Stateful streaming question

2017-06-15 Thread Flavio Pompermaier
Hi Aljoscha, we're still investigating possible solutions here. Yes, as you correctly said there are links between data of different keys so we can only proceed with the next job only once we are sure at 100% that all input data has been consumed and no other data will be read until this last jobs

Re: Streaming use case: Row enrichment

2017-06-15 Thread Flavio Pompermaier
Hi Aljosha, thanks for the great suggestions, I wasn't aware of AsyncDataStream.unorderedWait and BucketingSink setBucketer(). Most probably that's exactly what I was looking for...(I should just have the time to test it. Just one last question: what are you referring to with "you could use a

Re: Unable to use Flink RocksDB state backend due to endianness mismatch

2017-06-15 Thread Ziyad Muhammed
Hi Stefan I could solve the issue by building frocksdb with a patch for ppc architecture. How ever this patch is already applied to the latest version of rocksdb, where as frocksdb seems not updated in a while. It would be nice to have it updated with this patch. Thanks Ziyad On Tue, Jun 6,

Re: Process event with last 1 hour, 1week and 1 Month data

2017-06-15 Thread shashank agarwal
Thanks Aljoscha Krettek I will try the same. On Thu, Jun 15, 2017 at 3:11 PM, Aljoscha Krettek wrote: > Hi, > > How would you evaluate such a query? I think the answer could be that you > have to keep all that older data around so that you can evaluate when a new > event

Re: coGroup exception or something else in Gelly job

2017-06-15 Thread Kaepke, Marc
Hi Greg, I wanna ask if there was any news about the implementation or opportunities? Thanks and best regards, Marc Am 12.06.2017 um 19:28 schrieb Kaepke, Marc >: I’m working on an implementation of SemiClustering [1]. I used two

Re: Using Custom Partitioner in Streaming with parallelism 1 adding latency

2017-06-15 Thread Aljoscha Krettek
Hi, Before adding the partitioner all your functions are chained together. That is, everything is executed in one Thread and sending elements from one function to the next is essentially just a method call. By introducing a partitioner you break this chain and therefore your job now has to send

Using Custom Partitioner in Streaming with parallelism 1 adding latency

2017-06-15 Thread sohimankotia
Hi, I have streaming job which is running with parallelism 1 as of now . (This job will run with parallelism > 1 in future ) So I have added custom partitioner to partition the data based on one tuple field . The flow is : source -> map -> partitioner -> flatmap -> sink The partitioner is

Re: User self resource file.

2017-06-15 Thread yunfan123
So your suggestion is I create an archive of all the file in the resources. Then I get the distributed cache of this file and extracted it to a path. Use this path as my resource path? But in which time I should clear the temp path? -- View this message in context:

Re: User self resource file.

2017-06-15 Thread Aljoscha Krettek
I think the code that submits the job can create an archive of all the files in the “resources”, this making sure that they stay together. This file would then be placed in the distributed cache. When executing the contents of the archive can be extracted again and be used, since they still

Re: Streaming use case: Row enrichment

2017-06-15 Thread Aljoscha Krettek
Ok, just trying to make sure I understand everything: You have this: 1. A bunch of data in HDFS that you want to enrich 2. An external service (Solr/ES) that you query for enriching the data rows stored in 1. 3. You need to store the enriched rows in HDFS again I think you could just do this

Re: User self resource file.

2017-06-15 Thread yunfan123
Yes. My resource file is python or other script reference to each other by relative path. What I want is all my resource file in one job place in one directory. And the resource files in different jobs can't place in one directory. The distributedCache can not guarantee this. -- View this

Re: Storm topology running in flink.

2017-06-15 Thread yunfan123
That returns a String specific the resource path. Any suggestion about this? What I want is copy the resource to specific path in task manger, and pass the specific path to my operator. -- View this message in context:

Re: User self resource file.

2017-06-15 Thread Aljoscha Krettek
I’m sensing this is related to your other question about adding a method to the RuntimeContext. Would it be possible to extract the resources from the Jar when submitting the program and placing them in the distributed cache? Files can be registered using

Re: Storm topology running in flink.

2017-06-15 Thread Aljoscha Krettek
What would getUserResourceDir() return? In general, Flink devs are very reluctant when adding new methods to such central interfaces because it’s not easy to fix them if they’re broken, unneeded. Best, Aljoscha > On 15. Jun 2017, at 12:40, yunfan123 wrote: > > It

Re: union followed by timestamp assignment / watermark generation

2017-06-15 Thread Aljoscha Krettek
Hi, Yes, I can’t think of cases right now where placing the extractor after a union makes sense. In general, I think it’s always best to place the timestamp extractor as close to the sources (or in the sources, for Kafka) as possible. Right now it would be quite hard (and probably a bit hacky)

Re: union followed by timestamp assignment / watermark generation

2017-06-15 Thread Petr Novotnik
Hello Aljoscha, Fortunately, I found the program in Google's caches :) I've attached below for reference. I'm stunned by how accurately you have hit the point given the few pieces of information I left in the original text. +1 Yes, it's exactly as you explained. Can you think of a scenario where

Re: Storm topology running in flink.

2017-06-15 Thread yunfan123
It seems ok, but flink-storm not support storm codeDir. I'm working on to make the flink-storm support the codeDir. To support the code dir, I have to add a new funtion such as getUserResourceDir(may be return null) in flink RuntimeContext. I know this may be a big change. What do you think of

Re: How to divide streams on key basis and deliver them

2017-06-15 Thread AndreaKinn
Thank you a lot Carst, Flink runs at an higher level than I imagined. I will try with some experiments! -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/How-to-divide-streams-on-key-basis-and-deliver-them-tp13743p13755.html Sent from the

Re: Storm topology running in flink.

2017-06-15 Thread Aljoscha Krettek
Hi, I’m afraid the Storm compatibility layer has not been touched in a while and was never more than a pet project. Do you have an actual use case for running Storm topologies on top of Flink? Do be honest, I think it might be easier to simply port spouts to Flink SourceFunctions and bolts to

Re: Process event with last 1 hour, 1week and 1 Month data

2017-06-15 Thread Aljoscha Krettek
Hi, How would you evaluate such a query? I think the answer could be that you have to keep all that older data around so that you can evaluate when a new event arrives. In Flink, you could use a ProcessFunction for that and use a MapState that keeps events bucketed into one-week intervals.

Re: Fink: KafkaProducer Data Loss

2017-06-15 Thread Aljoscha Krettek
Hi Ninad, I discussed a bit with Gordon and we have some follow-up questions and some theories as to what could be happening. Regarding your first case (the one where data loss is observed): How do you ensure that you only shut down the brokers once Flink has read all the data that you expect

Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-15 Thread Kurt Young
I think the only way is adding more managed memory. The large record handler only take effects in reduce side which used by the merge sorter. According to the exception, it is thrown during the combing phase which only uses an in-memory sorter, which doesn't have large record handle mechanism.

Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-15 Thread Stephan Ewen
Here are some pointers - You would rather need MORE managed memory, not less, because the sorter uses that. - We added the "large record handler" to the sorter for exactly these use cases. Can you check in the code whether it is enabled? You'll have to go through a bit of the code to see

Re: How to divide streams on key basis and deliver them

2017-06-15 Thread Carst Tankink
Ugh, accidentally pressed send already…. When you run your application, Flink will map your logical/application topology onto a number of task slots (documented in more detail here: https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/job_scheduling.html). Basically,

Re: How to divide streams on key basis and deliver them

2017-06-15 Thread Carst Tankink
Hi, Let me try to explain this from another user’s perspective ☺ When you run your application, Flink will map your logical/application topology onto a number of task slots (documented in more detail here: