Need Help/Code Examples with reading/writing Parquet File with Flink ?

2018-04-17 Thread sohimankotia
Hi .. I have file in hdfs in format file.snappy.parquet . Can someone please point/help with code example of reading parquet files . -Sohi -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Unsure how to further debug - operator threads stuck on java.lang.Thread.State: WAITING

2018-04-17 Thread Miguel Coimbra
Hello James, Thanks for the information. I noticed something suspicious as well: I have chains of operators where the first operator will ingest the expected amount of records but will not emit any, leaving the following operator empty in a "RUNNING" state. For example: I will get back if I

Re: assign time attribute after first window group when using Flink SQL

2018-04-17 Thread Ivan Wang
Thanks Fabian. I tried to use "rowtime" and Flink tells me below exception: *Exception in thread "main" org.apache.flink.table.api.ValidationException: SlidingGroupWindow('w2, 'end, 150.rows, 1.rows) is invalid: Event-time grouping windows on row intervals in a stream environment are currently

Re: Unsure how to further debug - operator threads stuck on java.lang.Thread.State: WAITING

2018-04-17 Thread James Yu
Miguel, I and my colleague ran into same problem yesterday. We were expecting Flink to get 4 inputs from Kafka and write the inputs to Cassandra, but the operators got stuck after the 1st input is written into Cassandra. This is how DAG looks like: Source: Custom Source -> Map -> (Sink: Unnamed,

Re: Unsure how to further debug - operator threads stuck on java.lang.Thread.State: WAITING

2018-04-17 Thread Miguel Coimbra
Chesnay, following your suggestions I got access to the web interface and also took a closer look at the debugging logs. I have noticed one problem regarding the web interface port - it keeps changing port now and then during my Java program's execution. Not sure if that is due to my program

Re: Flink/Kafka POC performance issue

2018-04-17 Thread TechnoMage
Also, I note that none of the operations show any back pressure issues, and the records out from the kafka connector slow down to a crawl. Are there any known issues with kafka throughput that could be the issue rather than flink? I have a java program that monitors the test that reads all

Re: Flink/Kafka POC performance issue

2018-04-17 Thread TechnoMage
Also, I note some messages in the log about my java class not being a valid POJO because it is missing accessors for a field. Would this impact performance significantly? Michael > On Apr 17, 2018, at 12:54 PM, TechnoMage wrote: > > No checkpoints are active. > I will

Re: Flink/Kafka POC performance issue

2018-04-17 Thread TechnoMage
No checkpoints are active. I will try that back end. Yes, using JSONObject subclass for most of the intermediate state, with JSON strings in and out of Kafka. I will look at the config page for how to enable that. Thank you, Michael > On Apr 17, 2018, at 12:51 PM, Stephan Ewen

Re: Flink/Kafka POC performance issue

2018-04-17 Thread Stephan Ewen
A few ideas how to start debugging this: - Try deactivating checkpoints. Without that, no work goes into persisting rocksdb data to the checkpoint store. - Try to swap RocksDB for the FsStateBackend - that reduces serialization cost for moving data between heap and offheap (rocksdb). - Do

Re: InterruptedException when async function is cancelled

2018-04-17 Thread Stephan Ewen
Agreed. It is fixed in 1.5 and in the 1.4.x branch. The fix came after 1.4.2, so it s not released as of now. On Tue, Apr 17, 2018 at 7:47 PM, Ken Krugler wrote: > Hi Timo, > > [Resending from an address the Apache list server likes…] > > I discussed this with Till

Re: CaseClassSerializer and/or TraversableSerializer may still not be threadsafe?

2018-04-17 Thread Stephan Ewen
Thanks for reporting this, also thanks for checking out that this works with RocksDB and also with synchronous checkpoints. I would assume that this issue lies not in the serializer itself, but in accidental sharing in the FsStateBackend async snapshots. Do you know if the issue still exists in

Re: InterruptedException when async function is cancelled

2018-04-17 Thread Ken Krugler
Hi Timo, [Resending from an address the Apache list server likes…] I discussed this with Till during Flink Forward, and he said it looks like the expected result when cancelling, as that will cause all operators to be interrupted, which in turn generates the stack trace I’m seeing. As to

CaseClassSerializer and/or TraversableSerializer may still not be threadsafe?

2018-04-17 Thread joshlemer
Hello all, I am running Flink 1.4.0 on Amazon EMR, and find that asynchronous snapshots fail when using the Filesystem back-end. Synchronous snapshots succeed, and RocksDB snapshots succeed (both async and sync), but async Filesystem snapshots fail with this error:

Re: Flink/Kafka POC performance issue

2018-04-17 Thread TechnoMage
Memory use is steady throughout the job, but the CPU utilization drops off a cliff. I assume this is because it becomes I/O bound shuffling managed state. Are there any metrics on managed state that can help in evaluating what to do next? Michael > On Apr 17, 2018, at 7:11 AM, Michael Latta

Flink job testing with

2018-04-17 Thread Chauvet, Thomas
Hi everybody, I would like to test a kafka / flink process in scala. I would like to proceed as in the integration testing documentation : https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/testing.html#integration-testing with Kafka as source and sink. For example, I have

Re: Flink/Kafka POC performance issue

2018-04-17 Thread Michael Latta
Thanks for the suggestion. The task manager is configured for 8GB of heap, and gets to about 8.3 total. Other java processes (job manager and Kafka). Add a few more. I will check it again but the instances have 16GB same as my laptop that completes the test in <90 min. Michael Sent from my

Re: How to configure the reporting interval for the flink metric monitoring.

2018-04-17 Thread Chesnay Schepler
You can configure the interval by setting |metrics.reporter..interval| as described in the documentation . On 17.04.2018 13:40, Ganesh Manal wrote: HI, When I configure the counter ( in default system

How to configure the reporting interval for the flink metric monitoring.

2018-04-17 Thread Ganesh Manal
HI, When I configure the counter ( in default system metric ). I could see the counter getting monitored on reporting tool ( graphite in my case ). But the default reporting interval is 60 seconds. Is there a way to configure the interval for metric reporting ? Thanks & Regards, Ganesh Manal

Re: State-machine-based search logic in Flink ?

2018-04-17 Thread Fabian Hueske
Hi Esa, What do you mean by "individual searches in the Table API"? There is some work (a pending PR [1]) to integrate the MATCH_RECOGNIZE clause (SQL 2016) [2] into Flink's SQL which basically adds a SQL syntax for the CEP library. Best, Fabian [1] https://github.com/apache/flink/pull/4502 [2]

State-machine-based search logic in Flink ?

2018-04-17 Thread Esa Heikkinen
Hi I am not sure I have understand all, but it is possible to build some kind of state-machine-based search logic for example on top of the individual searches in Table API (using CsvTableSource) ? Best, Esa

rest.port is reset to 0 by YarnEntrypointUtils

2018-04-17 Thread Dongwon Kim
Hi, I'm trying to launch a dispatcher on top of YARN by executing "yarn-session.sh" on the command line. To access the rest endpoint outside the cluster, I need to assign a port from an allowed range. YarnEntrypointUtils, however, sets rest.port to 0 for random binding. Is there any reason

Re: How to rebalance a table without converting to dataset

2018-04-17 Thread Shuyi Chen
Hi Darshan, thanks for raising the problem. We do have similar use of rebalancing in Flink SQL, where we want to rebalance the Kafka input with more partitions to increase parallelism in streaming. As Fabian suggests, rebalancing is not relation algebra. The closest use of the operation I can

Re: Kafka topic partition skewness causes watermark not being emitted

2018-04-17 Thread Juho Autio
A possible workaround while waiting for FLINK-5479, if someone is hitting the same problem: we chose to send "heartbeat" messages periodically to all topics & partitions found on our Kafka. We do that through the service that normally writes to our Kafka. This way every partition always has some