Re: Need help in understanding PojoSerializer

2024-03-20 Thread Ken Krugler
ese fields and even when I have disabled > generics types. > why I am getting message that it will be processed as GenericType? > > Any help in understanding what I need to do to ensure all the fields of my > object are handled using PojoSerializer. > > Thanks > Sachin > > -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink & Pinot

Re: Completeablefuture in a flat map operator

2024-02-19 Thread Ken Krugler
> sleep or wait for the future) it works. > Can anyone explain what going on behind the screen and if possible any hints > for a working solution. > > Thanks in advance > > Med venlig hilsen / Best regards > Lasse Nedergaard > -- Ken Krugle

Re: Conditional multi collect in flink

2023-12-04 Thread Ken Krugler
a different flatMap and then convert and push using collector2 > > Thanks, > Tauseef -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink & Pinot

Re: Question regarding asyncIO timeout

2023-09-05 Thread Ken Krugler
are transient and we were > expecting a retry will fix the issue. > We tried using the asyncIO retry strategy but it doesn't seem to help much. > `AsyncDataStream.orderedWaitWithRetry` > > Do you have any suggestions on how to better reduce these timeout errors? > ---

Re: In Flink, is there a way to merge two streams in stateful manner

2023-08-29 Thread Ken Krugler
Forget the link: Hybrid Source[1] >> >> [1] >> https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/connectors/datastream/hybridsource/ >> >> Hang Ruan mailto:ruanhang1...@gmail.com>> >> 于2023年8月11日周五 10:14写道: >>> Hi, Muazim. >>> >

Re: In Flink, is there a way to merge two streams in stateful manner

2023-08-10 Thread Ken Krugler
anks and regards > > On Thu, 10 Aug, 2023, 10:59 pm Ken Krugler, <mailto:kkrugler_li...@transpac.com>> wrote: >> Hi Muazim, >> >> In Flink, a stream of data (unless bounded) is assumed to never end. >> >> So in your example below, this means stream 2

Re: In Flink, is there a way to merge two streams in stateful manner

2023-08-10 Thread Ken Krugler
g data of stream 1 should be emitted before stream2. > I tried to store the data in ListState in KeyedProcessFunction but I am not > able to access state outside proccessElement(). > Is there any way I could achieve this? > Thanks and regards > -- Ken Krug

Re: Using HybridSource

2023-07-06 Thread Ken Krugler
Váry <mailto:peter.vary.apa...@gmail.com>> wrote: > Was it a conscious decision that HybridSource only accept Sources, and does > not allow mapping functions applied to them before combining them? > > On Tue, Jul 4, 2023, 23:53 Ken Krugler <mailto:kkrugler_li...@transpac.c

Re: Using HybridSource

2023-07-04 Thread Ken Krugler
of different nature. The > KafkaSource returns a protobuf event while the CSV is a POJO with just 3 > fields. > > We could hack the kafkasource implementation and then in the > valuedeserializer do the mapping from protobuf to the CSV POJO but that seems > rather hackish. Is there a way more elegant to unify both datatypes from both > sources using Hybrid Source? > > thanks > Oscar -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Re: Flink bulk and record file source format metrices

2023-06-16 Thread Ken Krugler
I as below for parquet records > reading. > > FileSource.FileSourceBuilder source = > FileSource.forRecordStreamFormat(streamformat, path); > source.monitorContinuously(Duration.ofMillis(1)); > > Want to log/generate metrices for corrupt records and

Re: Watermark idleness and alignment - are they exclusive?

2023-06-15 Thread Ken Krugler
uld > I find more information on this? > > -- > Piotr Domagalski -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Re: A WordCount job using DataStream API but behave like the batch WordCount example

2023-04-30 Thread Ken Krugler
atMap(B -> some Tuple2.of(C, 1)) > .keyBy(t, t.f0) // a.k.a. C > .sum(1) > .map(Tuple2.of(C, ) -> d) > ; > > So just illustrative, and I am not writing a WordCount job either. > > - Luke > > > On Sat, Apr 29, 2023 at 10:31 PM Ken Krugler <mailto:k

Re: A WordCount job using DataStream API but behave like the batch WordCount example

2023-04-29 Thread Ken Krugler
I can get expected result with > .collect(Collectors.groupingByConcurrent(Function.identity(), > Collectors.counting())) > with the same stream using the Java Stream API. > Is there any other reason that causes it, and what should I do to get a > stream with only one element per key? > -lxiong > >

Re: Handling JSON Serialization without Kryo

2023-03-21 Thread Ken Krugler
Kryo with the existing job > leveraging JsonObjects (via Gson) is horrific (~1-2 records/second) and can't > keep up with the speed of the producers, which is the impetus behind > reevaluating the serialization. > > I'll explore this a bit more. > > Thanks, > >

Re: Avoiding data shuffling when reading pre-partitioned data from Kafka

2023-03-04 Thread Ken Krugler
e we don't have full control over the producers & input topics. > > Regarding the addition of a more flexible way to take advantage of > pre-partitioned sources like in FLIP-186, would you suggest I forward this > chain over to the dev Flink mailing list? > > Thanks, > Tomm

Re: Avoiding data shuffling when reading pre-partitioned data from Kafka

2023-03-04 Thread Ken Krugler
; straightforward addition, so it may be best handled by someone with more > internal knowledge. > > Thanks, > Tommy -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

consume_receipts.py and user_cart_event.py

2023-01-09 Thread Ken Krugler
Hi Gordon, I seem to remember you talking about these helper functions, to poll and write to Kinesis, as part of your StateFun shopping cart demo. But I didn’t see them anywhere…was I imagining things? Thanks, — Ken -- Ken Krugler http://www.scaleunlimited.com Custom

Using filesystem plugin with MiniCluster

2023-01-03 Thread Ken Krugler
e’s a way to specify the PluginManager for the MiniClusterConfiguration, but I don’t see that in Flink 1.15.x So is there a workaround to allow me to run a test from inside of my IDE, using the MiniCluster, that reads from S3? Thanks, — Ken ------ Ken Kru

Re: Slow restart from savepoint with large broadcast state when increasing parallelism

2022-12-16 Thread Ken Krugler
help to confirm whether this is the case. > > In addition, in those TMs where the restarting was slow, do you see anything > suspicious in the logs, e.g., reconnecting? > > Thanks > Jun > > > > > 发自我的手机 > > > 原始邮件 > 发件人: Ken Kru

Slow restart from savepoint with large broadcast state when increasing parallelism

2022-12-14 Thread Ken Krugler
-- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Re: Registering serializer for RowData

2022-12-12 Thread Ken Krugler
For example, PojoTypeInfo or RowTypeInfo? > > Best regards, > Yuxia > > 发件人: "Ken Krugler" > 收件人: "User" > 发送时间: 星期三, 2022年 12 月 07日 上午 9:11:17 > 主题: Registering serializer for RowData > > Hi there, > > I’m using the Hudi sink to writ

Registering serializer for RowData

2022-12-06 Thread Ken Krugler
) … So I’m wondering if the Flink table code configures this serializer, and I need to do the same in my Java API-based workflow. Thanks, — Ken PS - This is with Flink 1.15.1 and Hudi 0.12.0 -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot

Re: Best Practice for Querying Flink State

2022-08-29 Thread Ken Krugler
list. > 3. Flink State API. This requires additional development. > > I am wondering what are some best practices applied in production. For me, I > really hope there is one product that 1. let me query the flink state using > SQL 2. decouple with flink job > &g

How to include path information in data extracted from text files with FileSource

2022-08-15 Thread Ken Krugler
really ugly hacks to TextLineFormat, where it would reverse engineer the FSDataInputStream to try to find information about the original file, but feels very fragile. Any suggestions? Thanks! — Ken ------ Ken Krugler http://www.scaleunlimited.com Custom big data solutions

Re: Source vs SourceFunction and testing

2022-05-24 Thread Ken Krugler
/apache/flink-training/blob/master/common/src/test/java/org/apache/flink/training/exercises/testing/ParallelTestSource.java#L26 > > <https://github.com/apache/flink-training/blob/master/common/src/test/java/org/apache/flink/training/exercises/testing/ParallelTestSource.java#L26>

Re: Source vs SourceFunction and testing

2022-05-24 Thread Ken Krugler
xercises/ridecleansing/RideCleansingIntegrationTest.java#L61> > > -- > Piotr Domagalski -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Setting boundedness for legacy Hadoop sequence file sources

2022-05-03 Thread Ken Krugler
, — Ken -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Re: Controlling group partitioning with DataStream

2022-03-18 Thread Ken Krugler
Scheduler-Futureimprovements > > <https://cwiki.apache.org/confluence/display/FLINK/FLIP-187%3A+Adaptive+Batch+Job+Scheduler#FLIP187:AdaptiveBatchJobScheduler-Futureimprovements> > > Best, > Guowei > > > On Wed, Mar 9, 2022 at 8:44 AM Ken Krugler <mailto:kkrugl

Correct way to cleanly shut down StateFun Harness in test code

2022-03-17 Thread Ken Krugler
2.2 -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Re: Controlling group partitioning with DataStream

2022-03-08 Thread Ken Krugler
et-distributed-to-each-task-slots-separately-acc> > I think then you could technically create your partitioner - though little > bit cubersome - by mapping your existing keys to new keys who will have then > an output to the desired > group & slot. > > Hope this may help,

Controlling group partitioning with DataStream

2022-03-04 Thread Ken Krugler
question to https://stackoverflow.com/questions/71357833/equivalent-of-dataset-groupby-withpartitioner-for-datastream <https://stackoverflow.com/questions/71357833/equivalent-of-dataset-groupby-withpartitioner-for-datastream> Thanks, — Ken ------ Ken Krugle

Exception thrown during batch job execution on YARN even though job succeeded

2021-09-30 Thread Ken Krugler
blame didn’t show most lines as being modified by “Rufus Refactor”…sigh) ------ Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Re: Running Flink Dataset jobs Sequentially

2021-07-14 Thread Ken Krugler
you said write a > custom mapPartition that writes to files, did you actually write the file > inside the mapPartition function itself? So the Flink DAG ends at > mapPartition? Did you notice any performance issues as a result of this? > > Thanks again, > Jason > > On

Re: Running Flink Dataset jobs Sequentially

2021-07-09 Thread Ken Krugler
there a way to tell Flink > to run it sequentially? I tried moving the execution environment inside the > loop but it seems like it only runs the job on the first directory. I'm > running this on AWS Kinesis Data Analytics, so it's a bit hard for me to > submit new jobs.

Re: Memory Usage - Total Memory Usage on UI and Metric

2021-07-02 Thread Ken Krugler
g this with 3 yarn containers(2 tasks in each > container), total parallelism as 6. As soon as one container fails with this > error, the job re-starts. However, within minutes other 2 containers also > fail with the same error one by one. > > Thanks, > Hemant -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Re: Error with extracted type from custom partitioner key

2021-06-12 Thread Ken Krugler
xample how you would > like to cann partitionCustom()? > > Regards, > Timo > > On 04.06.21 15:38, Ken Krugler wrote: >> Hi all, >> I'm using Flink 1.12 and a custom partitioner/partitioning key (batch mode, >> with a DataSet) to do a better job of distributing data to ta

Error with extracted type from custom partitioner key

2021-06-04 Thread Ken Krugler
, and have Flink treat it as a non-POJO? This seemed to be working fine. • Is it a bug in Flink that the extracted field from the key is being used as the expected type for partitioning? Thanks! — Ken -- Ken Krugler http://www.scaleunlimited.com Custom big data

Re: Flink auto-scaling feature and documentation suggestions

2021-05-05 Thread Ken Krugler
wastage of resources, even with operator chaining in > place. > > That's why I think more toggles are needed to make current auto-scaling > truly shine. > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabbl

Re: SocketException: Too many open files

2020-09-25 Thread Ken Krugler
nment. > > I have updated the limits.conf and also set the value of file-max > (fs.file-max = 2097152) on the master node as well as on all worker nodes > and still getting the same issue. > > Thanks > Sateesh > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Speeding up CoGroup in batch job

2020-09-17 Thread Ken Krugler
h SSD, so that we > can spread the load across all SSDs. > - If you are saying the CPU load is 40% on a TM, we have to assume we are IO > bound: Is it the network or the disk(s)? > > I hope this is some helpful inspiration for improving the performance. > > > On F

Re: Use of slot sharing groups causing workflow to hang

2020-09-09 Thread Ken Krugler
w to a specific slot sharing > group, it may require more slots to run the workflow than before. > Could you share logs of the ResourceManager and SlotManager, I think > there are more clues in it. > > Best, > Yangze Guo > > On Thu, Sep 3, 2020 at 4:39 AM Ken Krugler <

Re: Should the StreamingFileSink mark the files "finished" when all bounded input sources are depleted?

2020-09-07 Thread Ken Krugler
der know and delete the message > immediately. > - -- Ken Krugler http://www.scaleunlimited.com <http://www.scaleunlimited.com/> custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Speeding up CoGroup in batch job

2020-09-04 Thread Ken Krugler
for improving the performance of a CoGroup? Thanks! — Ken -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Bug: Kafka producer always writes to partition 0, because KafkaSerializationSchemaWrapper does not call open() method of FlinkKafkaPartitioner

2020-09-03 Thread Ken Krugler
afkaSerializationSchemaWrapper#serialize is later called by >> the FlinkKafkaProducer, FlinkFiexedPartitioner#partition would always >> return 0, because parallelInstanceId is not properly initialized. >> >> >> Eventually, all of the data would go only to partition 0 of the given Kafka >> topic, creating severe data skew in the sink. >> >> >> -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Use of slot sharing groups causing workflow to hang

2020-09-02 Thread Ken Krugler
I see that issue is still open, so wondering what Til and Konstantin have to say about it. ------ Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Change in sub-task id assignment from 1.9 to 1.10?

2020-08-06 Thread Ken Krugler
post 1.8 Thanks, — Ken -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Issue with single job yarn flink cluster HA

2020-08-05 Thread Ken Krugler
played. And this stays the > same even after 30 minutes. If I leave the job without yarn kill, it stays > the same forever. > Based on your suggestions till now, I guess it might be some zookeeper > problem. If that is the case, what can I lookout for in zookeeper to figure > ou

Re: Flink Logging on EMR

2020-07-02 Thread Ken Krugler
rn-site.xml > > >yarn.log-aggregation-enable >true > > > It might be the case that i can only see the logs through yarn once the > application completes/finishes/fails > > Thanks > Sateesh > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Simple stateful polling source

2020-06-07 Thread Ken Krugler
>>> > } >>> > >>> > @Override >>> > public void run(SourceContext ctx) throws Exception { >>> >final Object lock = ctx.getCheckpointLock(); >>> >Client httpClient = getHttpClient(); >>> >try { >>> > pollingThread = new MyPollingThread.Builder(baseUrl, >>> > httpClient)// >>> > .setStartDate(startDateStr, datePatternStr)// >>> > .build(); >>> > // start the polling thread >>> > new Thread(pr).start(); >>> > (etc) >>> > } >>> > >>> > Is this the correct approach or did I misunderstood how stateful >>> > source functions work? >>> > >>> > Best, >>> > Flavio >>> >>> >>> >> >> > -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Flink restart strategy on specific exception

2020-05-12 Thread Ken Krugler
gt; > > > > Your Personal Data: We may collect and process information about you that may > be subject to data protection laws. For more information about how we use and > disclose your personal data, how we protect your information, our legal basis > to use your information, you

Re: Flink restart strategy on specific exception

2020-05-12 Thread Ken Krugler
; > > > Your Personal Data: We may collect and process information about you that may > be subject to data protection laws. For more information about how we use and > disclose your personal data, how we protect your information, our legal basis > to use your information, your r

Re: Restore from savepoint with Iterations

2020-05-04 Thread Ken Krugler
e not light (around half billion every day) :) > >> On May 4, 2020, at 10:13 PM, Ken Krugler > <mailto:kkrugler_li...@transpac.com>> wrote: >> >> Hi Ashish, >> >> Wondering if you’re running into the gridlock problem I mention on slide #25 >> h

Re: Restore from savepoint with Iterations

2020-05-04 Thread Ken Krugler
Hi Ashish, Wondering if you’re running into the gridlock problem I mention on slide #25 here: https://www.slideshare.net/FlinkForward/flink-forward-san-francisco-2018-ken-krugler-building-a-scalable-focused-web-crawler-with-flink <https://www.slideshare.net/FlinkForward/flink-forward-

Status of FLINK-12692 (Support disk spilling in HeapKeyedStateBackend)

2020-01-29 Thread Ken Krugler
Hi Yu Li, It looks like this stalled out a bit, from May of last year, and won’t make it into 1.10. I’m wondering if there’s a version in Blink (as a completely separate state backend?) that could be tried out? Thanks, — Ken -- Ken Krugler http

Re: StreamingFileSink doesn't close multipart uploads to s3?

2020-01-10 Thread Ken Krugler
1/part-0-1" > >> }, > >> { > >> "Initiated": "2019-12-06T21:04:15.000Z", > >> "Key": "2019-12-06--21/part-0-2" > >> }, > >> { > >>

How to assign a UID to a KeyedStream?

2020-01-09 Thread Ken Krugler
ollowing the keyBy(), as a KeyedStream creates the PartitionTransformation without a UID. Any insight into setting the UID properly here? Or should StreamGraphGenerator.transform() skip the no-uid check for PartitionTransformation, since that’s not an operator with state? Thanks, — Ken -

How to assign a UID to a KeyedStream?

2020-01-09 Thread Ken Krugler
put is a PartitionTransformation. I don’t see a way to set the UID following the keyBy(), as a KeyedStream creates the PartitionTransformation without a UID. Any insight into setting the UID properly here? Or should StreamGraphGenerator.transform() skip the no-uid check for PartitionT

Re: Batch mode with Flink 1.8 unstable?

2019-09-18 Thread Ken Krugler
y it is > not an advertised feature (it only works for streaming so far). > > The goal is that this works in the 1.9 release (aka the batch fixup release) > > (3) Hang in Processing > > I think a thread dump (jstack) from the TMs would be helpful to diagnose that. > There are kn

Potential block size issue with S3 binary files

2019-08-28 Thread Ken Krugler
it applies to all output files in the job. — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Best way to compute the difference between 2 datasets

2019-07-22 Thread Ken Krugler
tly in the > local environment? If that is the case, is there any other alternative > environment recommended for development in a single machine, where I won't be > experiencing these issues with those operations? Should I expect the function > `minussWithSortPartition` above to run

Re: Disk full problem faced due to the Flink tmp directory contents

2019-07-10 Thread Ken Krugler
it stored to the aforementioned directory? > Why does the respective files have such an enormous size? > How can we limit the size of the data written to the respective directory? > Is there any way to delete such files automatically when not needed yet? > > Thanks in advanc

Re: Job tasks are not balance among taskmanagers

2019-07-02 Thread Ken Krugler
nt taskManagers, just as > Example 2 below: > > > But it came out that all 3 tasks are all located at the same taskmanager. > <3503f...@0bbe000a.a3d21a5d.jpg> > > <2608f...@99c54575.a3d21a5d.jpg> > > Why? -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Batch mode with Flink 1.8 unstable?

2019-07-01 Thread Ken Krugler
tForChannel(PartitionRequestClientFactory.java:196) > > Where the TM it’s trying to connect to is the one that was released and > hasn’t been restarted yet. > > 3. Hang in processing > > Sometimes it finishes the long-running (10 hour) operator, and then the two > downstream operators get stuck (these have a different parallelism, so > there’s a rebalance) > > In the most recent example of this, they processed about 20% of the data > emitted by the long running operator. There are no errors in any of the logs. > The last real activity in the jobmanager.log shows that all of the downstream > operators were deployed... > > 2019-06-22 14:58:36,648 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph- CHAIN Map > (Packed features) -> Map (Key Extractor) (7/32) > (4a13a1d471c0ed5c2d9e66d2e4a98fd9) switched from DEPLOYING to RUNNING. > > Then nothing anywhere, until this msg starts appearing in the log file every > 5 seconds or so… > > 2019-06-22 22:56:11,303 INFO > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Updating with > new AMRMToken > > > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Random errors reading binary files in batch workflow

2019-07-01 Thread Ken Krugler
-- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Batch mode with Flink 1.8 unstable?

2019-07-01 Thread Ken Krugler
is > not an advertised feature (it only works for streaming so far). > > The goal is that this works in the 1.9 release (aka the batch fixup release) > > (3) Hang in Processing > > I think a thread dump (jstack) from the TMs would be helpful to diagnose that. > There are

Batch mode with Flink 1.8 unstable?

2019-06-23 Thread Ken Krugler
Hi all, I’ve been running a somewhat complex batch job (in EMR/YARN) with Flink 1.8.0, and it regularly fails, but for varying reasons. Has anyone else had stability with 1.8.0 in batch mode and non-trivial workflows? Thanks, — Ken 1. TimeoutException getting input splits The batch job

Re: Use Partitioner to forward messages to subtask by index

2019-06-21 Thread Ken Krugler
a message to > partition 1 would send it to subtask 0. > > Thanks, > Josh -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Unable to set S3 like object storage for state backend.

2019-06-20 Thread Ken Krugler
.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:234) > at > org.apache.flink.fs.s3.common.AbstractS3FileSystemFactory.create(AbstractS3FileSystemFactory.java:124) > ... 23 more > > > It seems like the way I setup the state backed causes this exception ie. > > env.setStateBackend(new > FsStateBackend("s3://aip_featuretoolkit/checkpoints/")) > > How can I resolve this issue, are S3 like object stores supported by 1.7.2 ? > > Thanks, > Vishwas -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: privacy preserving on streaming Kmeans in Flink

2019-06-14 Thread Ken Krugler
a library for K means in Flink? Because I didn't see it on > FlinkML? > >How we can I implement streaming kmeans clustering on Flink? > > If anybody know , can you please explain some of the key steps that I have > to follow?* -- Ken Krugler +1 530-2

Re: StreamingFileSink in version 1.8

2019-06-11 Thread Ken Krugler
.StreamTask.invoke(StreamTask.java:289) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) > at java.lang.Thread.run(Thread.java:748) -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Troubleshooting java.io.IOException: Connecting the channel failed: Connecting to remote task manager has failed. This might indicate that the remote task manager has been lost.

2019-06-10 Thread Ken Krugler
Unsafe$1.run(AbstractNioChannel.java:267) > ... 7 more > > It happens for all task manager. What seems to be the problem here and how > can I troubleshoot it? > > The task managers all seem active and show up on the web ui. > > Thanks, > Harshith -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Getting java.lang.BootstrapMethodError: java.lang.NoClassDefFoundError when stopping/canceling job.

2019-05-16 Thread Ken Krugler
executed in vertx.io > <http://vertx.io/> code after the local task has been stopped and the class > loader for the user code has been unloaded. Maybe from some daemon thread > pool. > > Best, > Andrey > > > On Wed, May 15, 2019 at 4:58 PM John Smith <mailto:

Re: Error While Initializing S3A FileSystem

2019-05-15 Thread Ken Krugler
-Djavax.net >> <http://djavax.net/>.ssl.trustStore=/etc/ssl/java-certs/cacerts -XX:+UseG1GC >> -Xms5530M -Xmx5530M -XX:MaxDirectMemorySize=8388607T >> -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties >> -Dlogback.configurationFile=file:/opt/flink/conf/logb

Re: Error While Initializing S3A FileSystem

2019-05-15 Thread Ken Krugler
ache.flink.runtime.taskexecutor.TaskManagerRunner --configDir > /opt/flink/conf > > and i see that `FluentPropertyBeanIntrospector` is contained within the > following two jars: > > flink-s3-fs-hadoop-1.7.2.jar:org/apache/flink/fs/shaded/hadoop3/org/apache/commons/beanutils/

Re: Write simple text into hdfs

2019-04-29 Thread Ken Krugler
t; thing ? > > Many thanks. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Partitioning key range

2019-04-08 Thread Ken Krugler
00 and 1001, only one partition receives all of the >> upstream data in ksA. >> Is there any way to get information about key ranges for each downstream >> partitions? >> Or is there any way to overcome this issue? >> We can assume that I know all possible keys (in this case

Re: Use different functions for different signal values

2019-04-02 Thread Ken Krugler
uld use for each > signal set a special function. E.g. Signal1, Signal2 ==> function1, Signal3, > Signal4 ==> function2. > What is the recommended way to implement this pattern? > > Thanks! > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimite

Re: Async Function Not Generating Backpressure

2019-03-21 Thread Ken Krugler
ine previous steps like Kafka fetching and > Cassandra IO but I am also not sure about this explanation. > > Best, > Andrey > > > On Tue, Mar 19, 2019 at 8:05 PM Ken Krugler <mailto:kkrugler_li...@transpac.com>> wrote: > Hi Seed, > > I was assuming the Cassa

Re: Async Function Not Generating Backpressure

2019-03-19 Thread Ken Krugler
;> >> Job 1 - without async function in front of Cassandra >> Job 2 - with async function in front of Cassandra >> >> Job 1 is backpressured because Cassandra cannot handle all the writes and >> eventually slows down the source rate to 6.5k/s. >> Job 2

Re: Async Function Not Generating Backpressure

2019-03-19 Thread Ken Krugler
2 is slightly backpressured but was able to run at 14k/s. > > Is the AsyncFunction somehow not reporting the backpressure correctly? > > Thanks, > Seed -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Set partition number of Flink DataSet

2019-03-14 Thread Ken Krugler
to reducer the number of parallel sinks, and > may also try sortPartition() so each sink could write files one by one. > Looking forward to your solution. :) > > Thanks, > Qi > >> On Mar 14, 2019, at 2:54 AM, Ken Krugler > <mailto:kkrugler_li...@transpac.com>> wrote:

Re: Batch jobs stalling after initial progress

2019-03-13 Thread Ken Krugler
he task earlier in the DAG from the parquet output task shows the back > pressure status as "OK", the one earlier is shown with back pressure status > "High" > > Are there any specific logs I should enable to get more information on this? > Has anyone else seen

Re: Set partition number of Flink DataSet

2019-03-13 Thread Ken Krugler
t of all possible bucket values. I’m actually dealing with something similar now, so I might have a solution to share soon. — Ken > I will check Blink and give it a try anyway. > > Thank you, > Qi > >> On Mar 12, 2019, at 11:58 PM, Ken Krugler > <mailto:kkrugler_l

Re: Set partition number of Flink DataSet

2019-03-12 Thread Ken Krugler
ve to use > setParallelism() to set the output partition/file number, but when the > partition number is too large (~100K), the parallelism would be too high. Is > there any other way to achieve this? > > Thanks, > Qi > >> On Mar 11, 2019, at 11:22 PM, Ken Krugler > <

Re: Side Output from AsyncFunction

2019-03-11 Thread Ken Krugler
be overcomplicated. > > Am I missing something? Any help/ideas are much appreciated! > > Cheers, > Mike Pryakhin > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Set partition number of Flink DataSet

2019-03-11 Thread Ken Krugler
her way we can achieve similar result? Thank you! > > Qi -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Checkpoints and catch-up burst (heavy back pressure)

2019-03-04 Thread Ken Krugler
INZ, Arnaud wrote: > > Hi, > > My source checkpoint is actually the file list. But it's not trivially small > as I may have hundreds of thousand of files, with long filenames. > My sink checkpoint is a smaller hdfs file list with current size. > > Message d'origine

Re: Checkpoints and catch-up burst (heavy back pressure)

2019-03-01 Thread Ken Krugler
roblem does not seem to be new, but I was unable to find any practical > solution in the documentation. > > Best regards, > Arnaud > > > > > > > L'intégrité de ce message n'étant pas assurée sur internet, la société > expéditrice ne peut être tenue resp

Re: Starting Flink cluster and running a job

2019-02-19 Thread Ken Krugler
"config file: " && grep '^[^\n#]' >> "$FLINK_HOME/conf/flink-conf.yaml" >> exec $(drop_privs_cmd) "$FLINK_HOME/bin/taskmanager.sh" >> start-foreground >> else >> sed -i -e "s/jobmanager.rpc.address: >> lo

Re: ElasticSearchSink - retrying doesn't work in ActionRequestFailureHandler

2019-02-13 Thread Ken Krugler
gt; Maybe Gordon meant 1.7.2 rc2? > > Thanks and regards, > Averell > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: ElasticSearchSink - retrying doesn't work in ActionRequestFailureHandler

2019-02-13 Thread Ken Krugler
Averell > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Per-workflow configurations for an S3-related property

2019-02-08 Thread Ken Krugler
Hi all, When running in EMR, we’re encountering the oh-so-common HTTP timeout that’s caused by the connection pool being too small (see below) I’d found one SO answer that said to bump fs.s3.maxConnections for the EMR S3 filesystem implementation.

Re: Regarding json/xml/csv file splitting

2019-02-04 Thread Ken Krugler
mp2 rec and reads emp3 and emp4 > # Operator3 ignores partial emp4 and reads emp5 > Record delimiter is used to skip partial record and identifying new record. > > > > > > -- > Thank you, > Madan. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: How to load Avro file in a Dataset

2019-01-27 Thread Ken Krugler
My > question is can Flink detect the Avro file schema automatically? How can I > load Avro file without any predefined class? ---------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Best pattern for achieving stream enrichment (side-input) from a large static source

2019-01-26 Thread Ken Krugler
r issue with a UnionedSources <https://github.com/ScaleUnlimited/flink-streaming-kmeans/blob/master/src/main/java/com/scaleunlimited/flinksources/UnionedSources.java> source function, but I haven’t validated that it handles checkpointing correctly. — Ken > > And than

Re: Best pattern for achieving stream enrichment (side-input) from a large static source

2019-01-26 Thread Ken Krugler
simple chart for convenience, > > Thanks you very much, > > Nimrod. > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Issue with counter metrics for large number of keys

2019-01-16 Thread Ken Krugler
er of references of counter metrics you have heard > from anyone using metrics? > > Thanks & Regards > Gaurav Luthra > Mob:- +91-9901945206 > > > On Thu, Jan 17, 2019 at 9:04 AM Ken Krugler <mailto:kkrugler_li...@transpac.com>> wrote: > I think

Re: windowAll and AggregateFunction

2019-01-09 Thread Ken Krugler
w operator gets a slice of the unique values. — Ken > On Wed, Jan 9, 2019, 10:22 PM Ken Krugler <mailto:kkrugler_li...@transpac.com> wrote: > Hi there, > > You should be able to use a regular time-based window(), and emit the > HyperLogLog binary data as your result, which the

Re: windowAll and AggregateFunction

2019-01-09 Thread Ken Krugler
e function). Instead, it sends all data to single instance and > > call add function there. > > > > Is here any way to make flink behave like this? I mean calculate partial > > results after consuming from kafka with paralelism of sources without > > shuffling(so

Re: Iterations and back pressure problem

2018-12-24 Thread Ken Krugler
Hi Sergey, As Andrey noted, it’s a known issue with (currently) no good solution. I talk a bit about how we worked around it on slide 26 of my Flink Forward talk <https://www.slideshare.net/FlinkForward/flink-forward-san-francisco-2018-ken-krugler-building-a-scalable-focused-web-craw

  1   2   >