Re: Small-files source - partitioning based on prefix of file

2018-08-09 Thread vino yang
Hi Averell, In this case, I think you may need to extend Flink's existing source. First, read your tar.gz large file, when it been decompressed, use the multi-threaded ability to read the record in the source, and then parse the data format (map / flatmap might be a suitable operator, you can

Re: Standalone cluster instability

2018-08-09 Thread Shailesh Jain
Hi, I hit a similar issue yesterday, the task manager died suspiciously, no error logs in the task manager logs, but I see the following exceptions in the job manager logs: 2018-08-05 18:03:28,322 ERROR akka.remote.Remoting - Association to

Re: Small-files source - partitioning based on prefix of file

2018-08-09 Thread Averell
Hi Fabian, Vino, I have one more question, which I initially planned to create a new thread, but now I think it is better to ask here: I need to process one big tar.gz file which contains multiple small gz files. What is the best way to do this? I am thinking of having one single thread process

Re: Small-files source - partitioning based on prefix of file

2018-08-09 Thread Averell
Thank you Vino and Fabien for your help in answering my questions. As my files are small, I think there would not be much benefit in checkpointing file offset state. Thanks and regards, Averell -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Flink Rebalance

2018-08-09 Thread Paul Lam
Hi Antonio, AFAIK, there are two reasons for this: 1. Rebalancing itself brings latency because it takes time to redistribute the elements. 2. Rebalancing also messes up the order in the Kafka topic partitions, and often makes a event-time window wait longer to trigger in case you’re using

Re: Flink Rebalance

2018-08-09 Thread antonio saldivar
Hello Sending ~450 elements per second ( the values are in milliseconds start to end) I went from: with Rebalance *++* *| **AVGWINDOW ** |* *++* *| *32131.0853 * |* *++* to this without rebalance *++* *| **AVGWINDOW ** |* *++*

Re: Flink Rebalance

2018-08-09 Thread Elias Levy
What do you consider a lot of latency? The rebalance will require serializing / deserializing the data as it gets distributed. Depending on the complexity of your records and the efficiency of your serializers, that could have a significant impact on your performance. On Thu, Aug 9, 2018 at

Flink Rebalance

2018-08-09 Thread antonio saldivar
Hello Does anyone know why when I add "rebalance()" to my .map steps is adding a lot of latency rather than not having rebalance. I have kafka partitions in my topic 44 and 44 flink task manager execution plan looks like this when I add rebalance but it is adding a lot of latency kafka-src ->

Re: Getting compilation error in Array[TypeInformation]

2018-08-09 Thread Mich Talebzadeh
Thanks those suggestions helped Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at

Dataset.distinct - Question on deterministic results

2018-08-09 Thread Will Bastian
I'm operating on a data set with some challenges to overcome. They are: 1. There is possibility for multiple entries for a single key and 2. For a single key, there may be multiple unique value-tuples For example key, val1, val2, val3 1, 0,0,0 1, 0,0,0 1,

Re: Getting compilation error in Array[TypeInformation]

2018-08-09 Thread Timo Walther
Hi Mich, I strongly recommend to read a good Scala programming tutorial before writing on a mailing list. As the error indicates you are missing generic parameters. If you don't know the parameter use `Array[TypeInformation[_]]` or `TableSink[_]`. For the types class you need to import the

Getting compilation error in Array[TypeInformation]

2018-08-09 Thread Mich Talebzadeh
This is the code in Scala val tableEnv = TableEnvironment.getTableEnvironment(streamExecEnv) tableEnv.registerDataStream("priceTable", splitStream, 'key, 'ticker, 'timeissued, 'price) val result = tableEnv.scan("priceTable").filter('ticker === "VOD" && 'price > 99.0).select('key,

Re: Table API, custom window

2018-08-09 Thread Fabian Hueske
Hi, regarding the plans. There are no plans to support custom window assigners and evictors. There were some thoughts about supporting different result update strategies that could be used to return early results or update results in case of late data. However, these features are currently not

Re: Table API, custom window

2018-08-09 Thread Timo Walther
Hi Oleksandr, currenlty, we don't support custom windows for Table API. The Table & SQL API try to solve the most common cases but for more specific logic we recommend the DataStream API. Regards, Timo Am 09.08.18 um 14:15 schrieb Oleksandr Nitavskyi: Hello guys, I am curious, is there a

Re: UTF-16 support for TextInputFormat

2018-08-09 Thread David Dreyfus
Hi Fabian, Thank you for taking my email. TextInputFormat.setCharsetName("UTF-16") appears to set the private variable TextInputFormat.charsetName. It doesn't appear to cause additional behavior that would help interpret UTF-16 data. The method I've tested is calling

2 node cluster reading file from ftp

2018-08-09 Thread Mohan mohan
Hi, *Context* : # Started cluster with 2 nodes node1 (Master & slave) node2 (Slave) # Source path to CsvInputFormat is like "*ftp*://test:test@x.x.x.x:5678/source.csv" and DataSet parallelism is 2 # Path to FileOutputFormat is like "*ftp*://test:test@x.x.x.x:5678/target.csv" and

Table API, custom window

2018-08-09 Thread Oleksandr Nitavskyi
Hello guys, I am curious, is there a way to define custom window (assigners/trigger/evictor) for Table/Sql Flink API? Looks like documentation keep silence about this, but is there are plans for it? Or should we go with DataStream API in case we need such kind of functionality? Thanks

Re: Heap Problem with Checkpoints

2018-08-09 Thread Piotr Nowojski
Hi, Thanks for getting back with more information. Apparently this is a known bug of JDK since 2003 and is still not resolved: https://bugs.java.com/view_bug.do?bug_id=4872014 https://bugs.java.com/view_bug.do?bug_id=6664633

Re: Window state with rocksdb backend

2018-08-09 Thread Stefan Richter
Hi, it is not quiet clear to me what your window function is doing, so sharing some (pseudo) code would be helpful. Is it ever calling a update-function for the state you are trying to modify? From the information I have it seems not the be the case and that is a wrong use of the API which

Re: Window state with rocksdb backend

2018-08-09 Thread Aljoscha Krettek
Hi, the objects you get in the WindowFunction are not supposed to be mutated. Any changes to them are not guaranteed to be synced back to the state backend. Why are you modifying in the objects? Maybe there's another way of achieving what you want to do. Best, Aljoscha > On 9. Aug 2018, at

Re: [ANNOUNCE] Apache Flink 1.6.0 released

2018-08-09 Thread vino yang
Congratulations! Great work! Till, thank you for advancing the smooth release of Flink 1.6. Vino. Till Rohrmann 于2018年8月9日周四 下午7:21写道: > The Apache Flink community is very happy to announce the release of Apache > Flink 1.6.0. > > Apache Flink® is an open-source stream processing framework

[ANNOUNCE] Apache Flink 1.6.0 released

2018-08-09 Thread Till Rohrmann
The Apache Flink community is very happy to announce the release of Apache Flink 1.6.0. Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications. The release is available for download at:

Re: Heap Problem with Checkpoints

2018-08-09 Thread Ayush Verma
Hello Piotr, I work with Fabian and have been investigating the memory leak associated with issues mentioned in this thread. I took a heap dump of our master node and noticed that there was >1gb (and growing) worth of entries in the set, /files/, in class *java.io.DeleteOnExitHook*. Almost all the

Re: JDBCInputFormat and SplitDataProperties

2018-08-09 Thread Alexis Sarda
Hi Fabian, Thanks a lot for the help. The scala DataSet, at least in version 1.5.0, declares javaSet as private[flink], so I cannot access it directly. Nevertheless, I managed to get around it by using the java environment: val env = org.apache.flink.api.java.ExecutionEnvironment.

Re: Using sensitive configuration/credentials

2018-08-09 Thread vino yang
Hi Chesnay, Oh, I did not know this feature. Any more description in Flink official documentation? Thanks, vino. Chesnay Schepler 于2018年8月9日周四 下午4:29写道: > If you change the name of your configuration key ti include "secret" or > "password" it should be hidden from the logs and UI. > > On

Window state with rocksdb backend

2018-08-09 Thread 祁明良
Hi all, This is mingliang, I got a problem with rocksdb backend. I'm currently using a 15min SessionWindow which also fires every 10s, there's no pre-aggregation, so the input of WindowFunction would be the whole Iterator of input object. For window operator, I assume this collection is

Re: Using sensitive configuration/credentials

2018-08-09 Thread Chesnay Schepler
If you change the name of your configuration key ti include "secret" or "password" it should be hidden from the logs and UI. On 09.08.2018 04:28, vino yang wrote: Hi Matt, Flink is currently enhancing its security, such as the current data transmission can be configured with SSL mode[1].

Re: JDBCInputFormat and SplitDataProperties

2018-08-09 Thread Fabian Hueske
Hi Alexis, The Scala API does not expose a DataSource object but only a Scala DataSet which wraps the Java object. You can get the SplitDataProperties from the Scala DataSet as follows: val dbData: DataSet[...] = ??? val sdp = dbData.javaSet.asInstanceOf[DataSource].getSplitDataProperties So

Re: UTF-16 support for TextInputFormat

2018-08-09 Thread Fabian Hueske
Hi David, Did you try to set the encoding on the TextInputFormat with TextInputFormat tif = ... tif.setCharsetName("UTF-16"); Best, Fabian 2018-08-08 17:45 GMT+02:00 David Dreyfus : > Hello - > > It does not appear that Flink supports a charset encoding of "UTF-16". It > particular, it

Re: Could not cancel job (with savepoint) "Ask timed out"

2018-08-09 Thread vino yang
Hi Juho, We use REST client API : triggerSavepoint(), this API returns a CompletableFuture, then we call it's get() API. You can understand that I am waiting for it to complete in sync. Because cancelWithSavepoint is actually waiting for savepoint to complete synchronization, and then execute

Re: Could not cancel job (with savepoint) "Ask timed out"

2018-08-09 Thread Juho Autio
Thanks for the suggestion. Is the separate savepoint triggering async? Would you then separately poll for the savepoint's completion before executing cancel? If additional polling is needed, then I would say that for my purpose it's still easier to call cancel with savepoint and simply ignore the