RE: Kyro Intermittent Exception for Large Data

2016-02-18 Thread Ken Krugler
fink use Java serialization ? > > Thanks a lot. > > ERROR 1 > > > > ERROR 2 > > > > > -- > Welly Tambunan > Triplelands > > http://weltam.wordpress.com > http://www.triplelands.com -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Input on training exercises

2016-03-18 Thread Ken Krugler
Hi list, What's the right way to provide input on the training exercises? Thanks, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: Javadoc version

2016-03-18 Thread Ken Krugler
javadocs are slightly out of > sync, but usually we only merge fixes to the "release-x.y." branches, so > there should never be inconsistencies, just fixes. > > > > On Tue, Mar 15, 2016 at 5:34 PM, Ken Krugler <kkrugler_li...@transpac.com> > wrote: >

Javadoc version

2016-03-15 Thread Ken Krugler
Perusing the docs, and noticed this... https://ci.apache.org/projects/flink/flink-docs-release-1.0/api/java/ says "flink 1.0-SNAPSHOT API" I assume this shouldn't be called the snapshot version. -- Ken ------ Ken Krugler +1 530-210-6378 http://www.scaleunl

RE: S3 Timeouts with lots of Files Using Flink 0.10.2

2016-03-19 Thread Ken Krugler
InputFormat$InputSplitOpenThread.run(FileInputFormat.java:841) > > at > org.apache.flink.api.common.io.FileInputFormat$InputSplitOpenThread.waitForCompletion(FileInputFormat.java:890) > > at > org.apache.flink.api.common.io.FileInputFormat.open(FileInputFormat.java:676)

Re: Integrate Flink with S3 on EMR cluster

2016-04-04 Thread Ken Krugler
InputFormat.java:450) > > at > org.apache.flink.api.common.io.FileInputFormat.createInputSplits(FileInputFormat.java:57) > > at > org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:156) > > ... 23 more > > > > Somehow, it's still tries to use EMRFS (which may be a valid thing?), but it > is failing to initialize. I don't know enough about EMRFS/S3 interop so don't > know how diagnose it further. > > I run Flink 1.0.0 compiled for Scala 2.11. > > Any advice on how to make it work is highly appreciated. > > > > Thanks, > > Timur > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Reducing parallelism leads to NoResourceAvailableException

2016-04-28 Thread Ken Krugler
e job is created via the Cascading-Flink planner. Fabian would know best. — Ken > > Cheers, > Aljoscha > > On Thu, 28 Apr 2016 at 00:52 Ken Krugler <kkrugler_li...@transpac.com > <mailto:kkrugler_li...@transpac.com>> wrote: > Hi all, > > In trying out

Re: Reducing parallelism leads to NoResourceAvailableException

2016-04-28 Thread Ken Krugler
t; appear as a FAILED job in the web frontend). Is it really 15? Same as above, the Flink web UI is gone once the job has failed. Any suggestions for how to check the actual parallelism in this type of transient YARN environment? Thanks, — Ken > On Thu, Apr 28, 2016 at 12:52 AM, Ken Krugle

Re: Checking actual config values used by TaskManager

2016-04-29 Thread Ken Krugler
onf that was actually deployed with tasks is critical. — Ken > > On Thu, Apr 28, 2016 at 9:00 PM, Ken Krugler <kkrugler_li...@transpac.com > <mailto:kkrugler_li...@transpac.com>> wrote: > Hi all, > > I’m running jobs on EMR via YARN, and wondering how to check exactl

Re: Command line arguments getting munged with CLI?

2016-04-27 Thread Ken Krugler
/commons-cli/>) it seems that double vs > single dash could make a difference, so you could try it. > > > On Tue, Apr 26, 2016 at 11:45 AM, Ken Krugler <kkrugler_li...@transpac.com > <mailto:kkrugler_li...@transpac.com>> wrote: > Hi all, > > I’m running this

Reducing parallelism leads to NoResourceAvailableException

2016-04-27 Thread Ken Krugler
with: NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism… But the change was to reduce parallelism - why would that now cause this problem? Thanks, — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data

Wildcards with --classpath parameter in CLI

2016-04-26 Thread Ken Krugler
Hi all, If I want to include all of the jars in a directory, I thought I could do --classpath file://http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Anyone going to ApacheCon Big Data in Vancouver?

2016-04-28 Thread Ken Krugler
ter-way-for-faster-workflows-ken-krugler-scale-unlimited> on my experience with using the Flink planner for Cascading.

EMR vCores and slot allocation

2016-04-28 Thread Ken Krugler
/config.html <https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html>, the recommended configuration would then be 4 slots/TaskManager, yes? Thanks, — Ken ------ Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions &

Re: How to measure Flink performance

2016-05-13 Thread Ken Krugler
;>> Regards >>> Prateek >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/How-to-measure-Flink-performance-tp6741p6863.html >>> >>>

Re: Checking actual config values used by TaskManager

2016-05-04 Thread Ken Krugler
anager.tmp.dirs, as an example. Thanks, — Ken > On Fri, Apr 29, 2016 at 3:18 PM, Ken Krugler > <kkrugler_li...@transpac.com> wrote: >> Hi Timur, >> >> On Apr 28, 2016, at 10:40pm, Timur Fayruzov <timur.fairu...@gmail.com> >> wrote: >> >>

Re: Processing millions of messages in milliseconds real time -- Architecture guide required

2016-04-21 Thread Ken Krugler
nhancer and then finally a processor. > I thought about using data cache as well for serving the data > The data cache should have the capability to serve the historical data in > milliseconds (may be upto 30 days of data) > > > -- > Thanks > Deepak

Current alternatives for async I/O

2016-10-08 Thread Ken Krugler
proach? Thanks, — Ken ------ Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Iterations vs. combo source/sink

2016-09-30 Thread Ken Krugler
s.apache.org/jira/browse/FLINK-3257> > [2] > https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html#using-the-keyvalue-state-interface > > <https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html#using-the-key

Re: TimelyFlatMapFunction and DataStream

2016-11-02 Thread Ken Krugler
the collector. So unkeyed data, yes. Though it wasn’t clear from the docs if/how I would set up a timer that regularly fires (say every 100ms). In any case I can keep using my current approach of having a “tickler” Tuple0 stream that I use with CoFlatMapFunctions. Regards, — Ken > On Wed,

TimelyFlatMapFunction and DataStream

2016-11-01 Thread Ken Krugler
using it with a DataStream.flatMap(xxx) call. Thanks, — Ken ------ Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Testing iterative data flows

2016-10-26 Thread Ken Krugler
really “done” in the context of a test. Are there any other approaches with current versions of Flink that would be better than an arbitrary timeout? Thanks, — Ken ------ Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & train

Iterating over keys in state backend

2017-04-26 Thread Ken Krugler
Is there a way to iterate over all of the key/value entries in the state backend, from within the operator that’s making use of the same? E.g. I’ve got a ReducingState, and on a timed interval (inside of the onTimer method) I need to iterate over all KV state and emit the N “best” entries.

Re: Iterating over keys in state backend

2017-04-27 Thread Ken Krugler
xture of RAM and disk. But having to checkpoint it isn’t trivial. So I thought that if there was a way to (occasionally) iterate over the keys in the state backend, I could get what I needed with the minimum effort. But sounds like that’s not possible currently. Thanks, — Ken >>

Re: Iterating over keys in state backend

2017-05-01 Thread Ken Krugler
d at each watermark, > elements with timestamps smaller than the watermark are processed. > > Hope this helps, > Kostas > >> On Apr 28, 2017, at 4:08 AM, Ken Krugler <kkrugler_li...@transpac.com >> <mailto:kkrugler_li...@transpac.com>> wrote: >> >&

Unusual log message - Emitter thread got interrupted

2017-10-04 Thread Ken Krugler
avoid this situation? Or is this expected and just the way things are currently? Just FYI, my topology is here: https://s3.amazonaws.com/su-public/flink-crawler+topology.pdf <https://s3.amazonaws.com/su-public/flink-crawler+topology.pdf> Thanks, — Ken ---------- Ken Krugle

Monitoring job w/LocalStreamEnvironment

2017-10-12 Thread Ken Krugler
Hi all, With an iteration-based workflow, it’s helpful to be able to monitor the job counters and explicitly terminate when the test has completed. I didn’t see support for async job creation, though. So I extended LocalStreamEnvironment to add an executeAsync(), which returns the

Re: Unusual log message - Emitter thread got interrupted

2017-10-09 Thread Ken Krugler
; I don't have much experience with streaming iterations. >> Maybe Aljoscha (in CC) has an idea what is happening and if it can be >> prevented. >> >> Best, Fabian >> >> 2017-10-05 1:33 GMT+02:00 Ken Krugler <kkrugler_li...@transpac.com >> <ma

Scale by the Bay Flink talks

2017-11-12 Thread Ken Krugler
Hi all, I’m going to be at Scale by the Bay later this week, and am looking forward to hearing about Flink during the formal talks. But I was wondering if anyone knows whether there will also be informal talks/meetups on Flink during the event. Thanks, — Ken -- Ken

Issue with back pressure and AsyncFunction

2017-11-10 Thread Ken Krugler
hands off the tuple to this AsyncIO-Emitter-Thread thread, which is why none of my code (either AsyncFunctions or threads in my pool doing async stuff) shows up in the thread dump. And I’m assuming that the back pressure calculation isn’t associating these threads with the

Use of AggregateFunction's merge() method

2018-05-04 Thread Ken Krugler
I’m trying to figure out when/why the AggregateFunction.merge() method is called in a streaming job, to ensure I’ve implemented it properly. The

Re: Stashing key with AggregateFunction

2018-05-04 Thread Ken Krugler
/windows.html#processwindowfunction-with-incremental-aggregation > > <https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/operators/windows.html#processwindowfunction-with-incremental-aggregation> > > 2018-05-03 19:53 GMT+02:00 Ken Krugler <kkrugler_li..

Re: pre-initializing global window state

2018-05-07 Thread Ken Krugler
Hi Jelmer, Three comments, if I understand your use case correctly… 1. I would first try using RockDB with incremental checkpointing , before deciding that an alternative approach is required. 2. Have you considered

Flink 1.5 release timing

2018-05-09 Thread Ken Krugler
should block the release, and of those, only one isn’t in progress - that’s FLINK-9190. <https://issues.apache.org/jira/browse/FLINK-9190> ---------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Stashing key with AggregateFunction

2018-05-03 Thread Ken Krugler
he key from the record and set it on the accumulator, so I can use it in the getResult() call. Is this expected, or am I miss-using the functionality? Thanks, — Ken ------ Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Flink, Solr, Hadoop,

Problems caused by use of hadoop classpath bash command

2018-01-24 Thread Ken Krugler
Hi all, With Flink 1.4 and FLINK-7477 , I ran into a problem with jar versions for HttpCore, when using the AWS SDK to read from S3. I believe the issue is that even when setting classloader.resolve-order to child-first in flink-conf.yaml, the

Avoiding deadlock with iterations

2018-01-24 Thread Ken Krugler
Hi all, We’ve run into deadlocks with two different streaming workflows that have iterations. In both cases, the issue is with fan-out; if any operation in the loop can emit more records than consumed, eventually a network buffer fills up, and then everyone in the iteration loop is blocked.

Re: Getting Key from keyBy() in ProcessFunction

2018-02-02 Thread Ken Krugler
verything you need. — Ken PS - Check out https://www.slideshare.net/dataArtisans/apache-flink-training-datastream-api-processfunction <https://www.slideshare.net/dataArtisans/apache-flink-training-datastream-api-processfunction> -- Ken Krugler http://www.scaleunlimited

Spurious warning in logs about flink-queryable-state-runtime

2018-01-31 Thread Ken Krugler
Hi all, In unit tests that use the LocalFilinkMiniCluster, with Flink 1.4, I now get this warning in my logs: > 18/01/31 13:28:19 WARN query.QueryableStateUtils:76 - Could not load > Queryable State Client Proxy. Probable reason: flink-queryable-state-runtime > is not in the classpath. Please

Re: Getting Key from keyBy() in ProcessFunction

2018-02-04 Thread Ken Krugler
ValueState for every element if the > key is already set or just set it every time again the processElement method > is invoked. > > Best, > Jürgen > > On 02.02.2018 18:37, Ken Krugler wrote: >> Hi Jürgen, >> >>> On Feb 2, 2018, at 6:24 AM, Jürgen Thoma

Re: Iterating over state entries

2018-02-19 Thread Ken Krugler
ost efficient approach, yes? Thanks, — Ken > 2018-02-19 0:10 GMT+01:00 Ken Krugler <kkrugler_li...@transpac.com > <mailto:kkrugler_li...@transpac.com>>: > Hi there, > > I’ve got a MapState where I need to iterate over the entries. > > This currently isn’t supported

Re: A "per operator instance" window all ?

2018-02-19 Thread Ken Krugler
hen to use a >>>>> time window. >>>>> >>>>> In my example above, the will be 4. And >>>>> all my keys will be 0, 1, 2 or 3. >>>>> >>>>> The issue with this approach is that due to the way the operatorIdx is >>>>> computed based on the key, it does not distribute well my processing: >>>>> >>>>> when this partitioning logic from the "KeyGroupRangeAssignment" class is >>>>> applied >>>>> /** >>>>> * Assigns the given key to a parallel operator index. >>>>> * >>>>> * @param key the key to assign >>>>> * @param maxParallelism the maximum supported parallelism, aka the >>>>> number of key-groups. >>>>> * @param parallelism the current parallelism of the operator >>>>> * @return the index of the parallel operator to which the given key >>>>> should be routed. >>>>> */ >>>>> public static int assignKeyToParallelOperator(Object key, int >>>>> maxParallelism, int parallelism) { >>>>> return computeOperatorIndexForKeyGroup(maxParallelism, >>>>> parallelism, assignToKeyGroup(key, maxParallelism)); >>>>> } >>>>> >>>>> /** >>>>> * Assigns the given key to a key-group index. >>>>> * >>>>> * @param key the key to assign >>>>> * @param maxParallelism the maximum supported parallelism, aka the >>>>> number of key-groups. >>>>> * @return the key-group to which the given key is assigned >>>>> */ >>>>> public static int assignToKeyGroup(Object key, int maxParallelism) { >>>>> return computeKeyGroupForKeyHash(key.hashCode(), maxParallelism); >>>>> } >>>>> key 0, 1, 2 and 3 are only assigned to operator 2 and 3 (so 2 over my 4 >>>>> operators will not have anything to do) >>>>> >>>>> >>>>> So, what will be the best way to deal with that? >>>>> >>>>> >>>>> >>>>> Thank you in advance for your support. >>>>> -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Which test cluster to use for checkpointing tests?

2018-02-23 Thread Ken Krugler
Hi all, For testing checkpointing, is it possible to use LocalFlinkMiniCluster? Asking because I’m not seeing checkpoint calls being made to my custom function (implements ListCheckpointed) when I’m running with LocalFlinkMiniCluster. Though I do see entries like this logged: 18/02/23

Re: Which test cluster to use for checkpointing tests?

2018-02-26 Thread Ken Krugler
nting. Anything else I should do to debug whether checkpointing is operating as expected? In the logs, at DEBUG level, I don’t see any errors or warnings related to this. Thanks, — Ken > > > Nico > > On 23/02/18 21:42, Ken Krugler wrote: >> Hi all, >

Re: Flink keyed stream windows

2018-08-13 Thread Ken Krugler
ique users every 30 minutes. > > Since I have a lot of unique users(rpm 1.5 million), how to use Flink's timed > windows on keyed stream to solve this problem. > > Please help! > > Thanks, > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

OneInputStreamOperatorTestHarness.snapshot doesn't include timers?

2018-08-15 Thread Ken Krugler
context.timerService().currentProcessingTime() + _period; LOGGER.info("Setting initial timer for {}", firstTimestamp); context.timerService().registerProcessingTimeTimer(firstTimestamp); } out.collect(inp

Re: OneInputStreamOperatorTestHarness.snapshot doesn't include timers?

2018-08-16 Thread Ken Krugler
org/jira/browse/FLINK-10159 > <https://issues.apache.org/jira/browse/FLINK-10159> > > Piotrek > >> On 16 Aug 2018, at 00:24, Ken Krugler > <mailto:kkrugler_li...@transpac.com>> wrote: >> >> Hi all, >> >> It looks to me like the OperatorSu

Re: Parallel stream partitions

2018-07-17 Thread Ken Krugler
> I need to remove the crossover in the middle box, so [1] -> [1] -> [1] and > [2] -> [2] -> [2], instead of [1] -> [1] -> [1 or 2] . > > Nick -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: [DISCUSS] Inverted (child-first) class loading

2018-03-09 Thread Ken Krugler
I can’t believe I’m suggesting this, but perhaps the Elasticsearch “Hammer of Thor” (aka “jar hell”) approach would be appropriate here. Basically they prevent a program from running if there are duplicate classes on the classpath. This causes headaches when you really need a different version

Re: [DISCUSS] Inverted (child-first) class loading

2018-03-12 Thread Ken Krugler
> - forbid duplicate classes > - parent first conflict resolution > - child first conflict resolution > > Having number one as the default and let the error message suggest options > two and three as options would definitely make users aware of the issue... > &g

Re: keyBy and parallelism

2018-04-12 Thread Ken Krugler
I’m not sure I understand the actual use case, but … Using a rebalance() to randomly distribute keys to operators is what I think you’d need to do to support “even if I have less keys that slots, I wants each slot to take his share in the work” So it sounds like you want to (a) broadcast all

Re: Classloading issues after changing to 1.4

2018-04-14 Thread Ken Krugler
When we transitioned from 1.3 to 1.4, we ran into some class loader issues. Though we weren’t using any sophisticated class loader helicopter stunts :) Specifically… 1. Re-worked our pom.xml to set up shading to better mirror what the 1.4 example pom was doing. 2. Enabled child-first

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Ken Krugler
Hi Chesnay, Don’t know if this helps, but I’d run into this as well, though I haven’t hooked up YourKit to analyze exactly what’s causing the memory problem. E.g. after about 3.5 hours running locally, it failed with memory issues. In the TaskManager logs, I start seeing exceptions in my

Re: Substasks - Uneven allocation

2018-04-18 Thread Ken Krugler
Hi Pedro, That’s interesting, and something we’d like to be able to control as well. I did a little research, and it seems like (with some stunts) there could be a way to achieve this via CoLocationConstraint

Re: data enrichment with SQL use case

2018-04-16 Thread Ken Krugler
use flink backend (ROKSDB) and > create read/write through > macanisem ? > > Thanks > > miki > > > > On Mon, Apr 16, 2018 at 2:45 AM, Ken Krugler <kkrugler_li...@transpac.com > <mailto:kkrugler_li...@transpac.com>> wrote: > If the SQL data i

Re: data enrichment with SQL use case

2018-04-15 Thread Ken Krugler
elect the table MD > then convert the kafka stream to table and join the data by the stream key . > > At the end i need to map the joined data to a new POJO and send it to > elesticserch . > > Any suggestions or different ways to solve this use case ? >

Confusing debug level log output with Flink 1.5

2018-04-18 Thread Ken Krugler
Hi Till, I just saw https://issues.apache.org/jira/browse/FLINK-9215 I’ve been trying out 1.5, and noticed similar output in my logs, e.g. 18/04/18 17:33:47 DEBUG slotpool.SlotPool:751 - Releasing slot with slot request id

Re: Multi threaded operators?

2018-04-23 Thread Ken Krugler
Hi Alex, Given that you’re hitting a DB, the approach of using multi-threaded access from a CoFlatMapFunction or AsyncFunction makes sense - you don’t want to try to abuse Flink’s parallelism. I’ve done it both ways, so either is an option. If you use an AsyncFunction, you get the benefit of

Re: InterruptedException when async function is cancelled

2018-04-17 Thread Ken Krugler
java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L358 > > <https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L358> > > > Am 21.03.18 um 23:29 schrieb Ken Krugler: >> Hi all, >>

Re: Multiple Async IO

2018-04-03 Thread Ken Krugler
Hi Maxim, If reducing latency is the goal, then option #1 seems better. Though you’d need additional logic inside of your AsyncFunction to run all 20 queries in parallel. I’d also consider a third option... Use a FlatMapFunction to create 20 copies of the event (assuming it’s not large),

ListCheckpointed function - what happens prior to restoreState() being called?

2018-03-19 Thread Ken Krugler
’t see this typically being done. Thanks, — Ken ------ Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: ListCheckpointed function - what happens prior to restoreState() being called?

2018-03-21 Thread Ken Krugler
ache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L296 > > <https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L296> &g

InterruptedException when async function is cancelled

2018-03-21 Thread Ken Krugler
it? Thanks, — Ken -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Error running on Hadoop 2.7

2018-03-22 Thread Ken Krugler
Hi Ashish, Are you using Flink 1.4? If so, what does the “hadoop classpath” command return from the command line where you’re trying to start the job? Asking because I’d run into issues with https://issues.apache.org/jira/browse/FLINK-7477 ,

Re: Which test cluster to use for checkpointing tests?

2018-03-02 Thread Ken Krugler
Hi Stephan, Thanks for the update. So is support for “running checkpoints with closed sources” part of FLIP-15 , or something separate? Regards, — Ken > On Mar 1, 2018, at 9:07 AM, Stephan Ewen

Re: data enrichment with SQL use case

2018-04-25 Thread Ken Krugler
ementation of this > approach, but there a some workarounds that help in certain cases. > > Note that Flink's SQL support does not add advantages for the either of both > approaches. You should use the DataStream API (and possible ProcessFunctions). > > I'd go for the first ap

Re: data enrichment with SQL use case

2018-04-25 Thread Ken Krugler
ill do record by record processing you have the option of controlling > the timing as needed for your use case. > > Michael > >> On Apr 25, 2018, at 1:57 PM, Ken Krugler <kkrugler_li...@transpac.com >> <mailto:kkrugler_li...@transpac.com>> wrote: >> >> Hi

Re: CsvInputFormat - read header line first

2018-10-31 Thread Ken Krugler
2 times. Is there any better way to do > this? Please suggest. > > -- > Thank you. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Join Dataset in stream

2018-11-15 Thread Ken Krugler
(mutable). I could do this on a batch job, but i will > have to triger it each time and the input are more like a slow stream, but > the computing need to be fast can i do this on a stream way? is there any > better solution ? > Thx -- Ken

Re: Writing to S3

2018-11-15 Thread Ken Krugler
he.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:296) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703) > at java.lang.Thread.run(Thread.java:748) -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: In-Memory Lookup in Flink Operators

2018-09-29 Thread Ken Krugler
>> state backend(RocksDB in my case). >> 2) Having a dedicated refresh thread for each subtask instance(possibly, >> every Task Manager having multiple refresh thread) >> >> Am i thinking in the right direction? Or missing something very obvious? It >> confusing. >> >> Any leads are much appreciated. Thanks in advance. >> >> Cheers, >> Chirag -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Regarding implementation of aggregate function using a ProcessFunction

2018-09-29 Thread Ken Krugler
>> RuntimeContext that allows you to access the state API). >> >> Thanks, vino. >> >> Gaurav Luthra > <mailto:gauravluthra6...@gmail.com>> 于2018年9月28日周五 下午1:38写道: >> Hi, >> >> As we are aware, Currently we cannot use RichAggregateFunction in >> aggregate() method upon windowed stream. So, To access the state in your >> customAggregateFunction, you can implement it using a ProcessFuntion. >> This issue is faced by many developers. >> So, someone must have implemented or tried to implement it. So, kindly share >> your feedback on this. >> As I need to implement this. >> >> Thanks & Regards >> Gaurav Luthra -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Errors in QueryableState sample code?

2018-09-19 Thread Ken Krugler
thing seemed OK, but this doesn’t feel like the right way to solve that problem :) 2. The call to response.get() returns a ValueState>, not the Tuple2 itself. So it seems like there’s a missing “.value()”. Regards, — Ken -- Ken Krugler +1 530-210-6378 http://www.scale

Re: Null Pointer Exception

2018-11-16 Thread Ken Krugler
roundReceive(AkkaProtocolTransport.scala:285) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at akka.actor.ActorCell.invoke(ActorCell.scala:495) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > at akka.dispatch.Mailbox.run(Mailbox.scala:224) > at akka.dispatch.M

Re: Issue with counter metrics for large number of keys

2019-01-16 Thread Ken Krugler
er of references of counter metrics you have heard > from anyone using metrics? > > Thanks & Regards > Gaurav Luthra > Mob:- +91-9901945206 > > > On Thu, Jan 17, 2019 at 9:04 AM Ken Krugler <mailto:kkrugler_li...@transpac.com>> wrote: > I think

Re: Question about key group / key state & parallelism

2018-12-12 Thread Ken Krugler
ean, are the state always managed by the same > instance, or does this depends on the available instance at the moment ? > > "During execution each parallel instance of a keyed operator works with the > keys for one or more Key Groups." > -> this is related, does &qu

Re: How many times Flink initialize an operator?

2018-12-11 Thread Ken Krugler
y element? > stream.map(new MapFunction[I,O]).addSink(discard) > > Hao Sun > Team Lead > 1019 Market St. 7F > San Francisco, CA 94103 -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Iterations and back pressure problem

2018-12-24 Thread Ken Krugler
Hi Sergey, As Andrey noted, it’s a known issue with (currently) no good solution. I talk a bit about how we worked around it on slide 26 of my Flink Forward talk <https://www.slideshare.net/FlinkForward/flink-forward-san-francisco-2018-ken-krugler-building-a-scalable-focused-web-craw

Re: S3A AWSS3IOException from Flink's BucketingSink to S3

2018-12-09 Thread Ken Krugler
does not restart and resume/recover >> from the last successful checkpoint. >> >> What is the cause for this and how can it be resolved? Also, how can the job >> be configured to restart/recover from the last successful checkpoint instead >> of staying in the FAILING state? > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: OutOfMemoryError while doing join operation in flink

2018-11-23 Thread Ken Krugler
(RecordWriter.java:131) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > > > From the exception view in flink jo

Re: About the issue caused by flink's dependency jar package submission method

2018-11-20 Thread Ken Krugler
he.flink.streaming.api.graph.StreamConfig.getStreamOperator(StreamConfig.java:224) > ... 4 more > > is there anyone know about this? > > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ > <http://apache-flink-u

Re: windowAll and AggregateFunction

2019-01-09 Thread Ken Krugler
w operator gets a slice of the unique values. — Ken > On Wed, Jan 9, 2019, 10:22 PM Ken Krugler <mailto:kkrugler_li...@transpac.com> wrote: > Hi there, > > You should be able to use a regular time-based window(), and emit the > HyperLogLog binary data as your result, which the

Re: windowAll and AggregateFunction

2019-01-09 Thread Ken Krugler
e function). Instead, it sends all data to single instance and > > call add function there. > > > > Is here any way to make flink behave like this? I mean calculate partial > > results after consuming from kafka with paralelism of sources without > > shuffling(so

Re: Question regarding Streaming Resources

2018-09-12 Thread Ken Krugler
k internal task manager is having > some strategy to utilize them for other new streams that are coming? > Regards > Bhaskar -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com <http://www.scaleunlimited.com/> Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Test harness for validating proper checkpointing of custom SourceFunction

2018-09-12 Thread Ken Krugler
). What’s the recommended approach for this? We can of course fire up the workflow with checkpointing, and add some additional logic that kills the job after a checkpoint has happened, etc. But it seems like there should be a better way. Thanks, — Ken -- Ken Krugler +1 530-210

Re: Question regarding Streaming Resources

2018-09-12 Thread Ken Krugler
Hi Bhaskar, > On 2018/09/12 20:42:22, Ken Krugler wrote: >> Hi Bhaskar, >> >> I assume you don’t have 1000 streams, but rather one (keyed) stream with >> 1000 different key values, yes? >> >> If so, then this one stream is physically partitioned based

Re: Best pattern for achieving stream enrichment (side-input) from a large static source

2019-01-26 Thread Ken Krugler
simple chart for convenience, > > Thanks you very much, > > Nimrod. > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Best pattern for achieving stream enrichment (side-input) from a large static source

2019-01-26 Thread Ken Krugler
r issue with a UnionedSources <https://github.com/ScaleUnlimited/flink-streaming-kmeans/blob/master/src/main/java/com/scaleunlimited/flinksources/UnionedSources.java> source function, but I haven’t validated that it handles checkpointing correctly. — Ken > > And than

Re: How to load Avro file in a Dataset

2019-01-27 Thread Ken Krugler
My > question is can Flink detect the Avro file schema automatically? How can I > load Avro file without any predefined class? ---------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Use different functions for different signal values

2019-04-02 Thread Ken Krugler
uld use for each > signal set a special function. E.g. Signal1, Signal2 ==> function1, Signal3, > Signal4 ==> function2. > What is the recommended way to implement this pattern? > > Thanks! > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimite

Re: Partitioning key range

2019-04-08 Thread Ken Krugler
00 and 1001, only one partition receives all of the >> upstream data in ksA. >> Is there any way to get information about key ranges for each downstream >> partitions? >> Or is there any way to overcome this issue? >> We can assume that I know all possible keys (in this case

Re: Checkpoints and catch-up burst (heavy back pressure)

2019-03-01 Thread Ken Krugler
roblem does not seem to be new, but I was unable to find any practical > solution in the documentation. > > Best regards, > Arnaud > > > > > > > L'intégrité de ce message n'étant pas assurée sur internet, la société > expéditrice ne peut être tenue resp

Re: Checkpoints and catch-up burst (heavy back pressure)

2019-03-04 Thread Ken Krugler
INZ, Arnaud wrote: > > Hi, > > My source checkpoint is actually the file list. But it's not trivially small > as I may have hundreds of thousand of files, with long filenames. > My sink checkpoint is a smaller hdfs file list with current size. > > Message d'origine

Re: Set partition number of Flink DataSet

2019-03-14 Thread Ken Krugler
to reducer the number of parallel sinks, and > may also try sortPartition() so each sink could write files one by one. > Looking forward to your solution. :) > > Thanks, > Qi > >> On Mar 14, 2019, at 2:54 AM, Ken Krugler > <mailto:kkrugler_li...@transpac.com>> wrote:

Re: Set partition number of Flink DataSet

2019-03-13 Thread Ken Krugler
t of all possible bucket values. I’m actually dealing with something similar now, so I might have a solution to share soon. — Ken > I will check Blink and give it a try anyway. > > Thank you, > Qi > >> On Mar 12, 2019, at 11:58 PM, Ken Krugler > <mailto:kkrugler_l

Re: Batch jobs stalling after initial progress

2019-03-13 Thread Ken Krugler
he task earlier in the DAG from the parquet output task shows the back > pressure status as "OK", the one earlier is shown with back pressure status > "High" > > Are there any specific logs I should enable to get more information on this? > Has anyone else seen

Re: Async Function Not Generating Backpressure

2019-03-19 Thread Ken Krugler
;> >> Job 1 - without async function in front of Cassandra >> Job 2 - with async function in front of Cassandra >> >> Job 1 is backpressured because Cassandra cannot handle all the writes and >> eventually slows down the source rate to 6.5k/s. >> Job 2

Re: Async Function Not Generating Backpressure

2019-03-19 Thread Ken Krugler
2 is slightly backpressured but was able to run at 14k/s. > > Is the AsyncFunction somehow not reporting the backpressure correctly? > > Thanks, > Seed -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Set partition number of Flink DataSet

2019-03-11 Thread Ken Krugler
her way we can achieve similar result? Thank you! > > Qi -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Set partition number of Flink DataSet

2019-03-12 Thread Ken Krugler
ve to use > setParallelism() to set the output partition/file number, but when the > partition number is too large (~100K), the parallelism would be too high. Is > there any other way to achieve this? > > Thanks, > Qi > >> On Mar 11, 2019, at 11:22 PM, Ken Krugler > <

  1   2   >