Multiclass classification example

2016-10-18 Thread Kürşat Kurt
Hi;


I am trying to learn Flink Ml lib.

Where can i find detailed multiclass classification example?



Re: Issue while restarting from SavePoint

2016-10-18 Thread Anirudh Mallem
Hi,
The issue seems to be connected with trying to restart the job in the detached 
mode. The stack trace is as follows:

-bash-3.2$ bin/flink run -d -s jobmanager://savepoints/1 -c 
com.tfs.rtdp.precompute.Flink.FlinkTest 
/tmp/flink-web-upload-2155906f-be54-47f3-b9f7-7f6f0f54f74b/448724f9-f69f-455f-99b9-c57289657e29_uber-flink-test-1.0-SNAPSHOT.jar
 file:///home/amallem/capitalone.properties
Cluster configuration: Standalone cluster with JobManager at /10.64.119.90:33167
Using address 10.64.119.90:33167 to connect to JobManager.
JobManager web interface address http://10.64.119.90:8081
Starting execution of program
Submitting Job with JobID: 6c5596772627d3c9366deaa0c47ab0ad. Returning after 
job submission.


 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The program 
execution failed: JobManager did not respond within 6 milliseconds
at 
org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:432)
at 
org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:93)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:378)
at 
org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:75)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:323)
at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:777)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:253)
at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1005)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1048)
Caused by: org.apache.flink.runtime.client.JobTimeoutException: JobManager did 
not respond within 6 milliseconds
at 
org.apache.flink.runtime.client.JobClient.submitJobDetached(JobClient.java:227)
at 
org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:429)
... 8 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
[6 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at scala.concurrent.Await.result(package.scala)
at 
org.apache.flink.runtime.client.JobClient.submitJobDetached(JobClient.java:223)
... 9 more

When running it without the detached mode, the job manager is responding but 
the job is failing as it thinks that the structure of the job is modified.

-bash-3.2$ bin/flink run -s jobmanager://savepoints/1 -c 
com.tfs.rtdp.precompute.Flink.FlinkTest 
/tmp/flink-web-upload-2155906f-be54-47f3-b9f7-7f6f0f54f74b/448724f9-f69f-455f-99b9-c57289657e29_uber-flink-test-1.0-SNAPSHOT.jar
 file:///home/amallem/capitalone.properties
Cluster configuration: Standalone cluster with JobManager at /10.64.119.90:33167
Using address 10.64.119.90:33167 to connect to JobManager.
JobManager web interface address http://10.64.119.90:8081
Starting execution of program
Submitting job with JobID: fe7bf69676a5127eecb392e3a0743c6d. Waiting for job 
completion.
Connected to JobManager at 
Actor[akka.tcp://flink@10.64.119.90:33167/user/jobmanager#-2047134036]
10/18/2016 14:56:11 Job execution switched to status FAILING.
org.apache.flink.runtime.execution.SuppressRestartsException: Unrecoverable 
failure. This suppresses job restarts. Please check the stack trace for the 
root cause.
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:1305)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1291)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1291)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.IllegalStateException: Failed to rollback to savepoint 
jobmanager://savepoints/1. Cannot map old state for task 
4ca3b7c2641c4299a3378fab220c4e5c to the new program. This indicates that the 

Issue while restarting from SavePoint

2016-10-18 Thread Anirudh Mallem
Hi,
I am relatively new to Flink and I was experimenting with the save points 
feature. I have an HA cluster running with 1 Master and 4 Workers. The 
flink-config.yaml is as follows :

#==
# Common
#==
jobmanager.rpc.address: stable-stream-master01.app.shared.int.sv2.247-inc.net

# The port where the JobManager's main actor system listens for messages.
jobmanager.rpc.port: 6123

# The heap size for the JobManager JVM
jobmanager.heap.mb: 512

# The heap size for the TaskManager JVM
taskmanager.heap.mb: 2048

# The number of task slots that each TaskManager offers. Each slot runs one 
parallel pipeline.
taskmanager.numberOfTaskSlots: 8

# Specify whether TaskManager memory should be allocated when starting up 
(true) or when
# memory is required in the memory manager (false)
taskmanager.memory.preallocate: false

# The parallelism used for programs that did not specify and other parallelism.
parallelism.default: 1

env.java.home: /usr/local/java

restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 2
restart-strategy.fixed-delay.delay: 10 s
#==
# Web Frontend
#==

# The port under which the web-based runtime monitor listens.
# A value of -1 deactivates the web server.

jobmanager.web.port: 8081

# Flag to specify whether job submission is enabled from the web-based
# runtime monitor. Uncomment to disable.

#jobmanager.web.submit.enable: false

#==
# Streaming state checkpointing
#==

# The backend that will be used to store operator state checkpoints if
# checkpointing is enabled.
#
# Supported backends: jobmanager, filesystem, 
#
state.backend: filesystem


# Directory for storing checkpoints in a Flink-supported filesystem
# Note: State backend must be accessible from the JobManager and all 
TaskManagers.
# Use "hdfs://" for HDFS setups, "file://" for UNIX/POSIX-compliant file 
systems,
# (or any local file system under Windows), or "S3://" for S3 file system.
#
 state.backend.fs.checkpointdir: file:///home/amallem/
 state.savepoints.dir: file:///home/amallem/save/

#==
# Master High Availability (required configuration)
#==

# The list of ZooKepper quorum peers that coordinate the high-availability
# setup. This must be a list of the form:
# "host1:clientPort,host2[:clientPort],..." (default clientPort: 2181)
#
 recovery.mode: zookeeper
#
 recovery.zookeeper.quorum: 
stable-stream-zookeeper01.app.shared.int.net:2181,stable-stream-zookeeper02.app.shared.int.net:2181
#
# Note: You need to set the state backend to 'filesystem' and the checkpoint
# directory (see above) before configuring the storageDir.
#
 recovery.zookeeper.storageDir: file:///home/amallem/recovery
 recovery.zookeeper.path.root: /flink
 recovery.zookeeper.path.namespace: /cluster_one

Query : I am able to take a save point and then cancel my current job but when 
I try to start the job again using the save point I get the error stating 
“ProgramInvocationException: JobManager did not respond within 6ms”. I 
checked Zookeeper and everything seems fine. I also followed the 
recommendations in the following post : 
http://stackoverflow.com/questions/36625742/cant-deploy-flow-to-ha-cluster-of-apache-flink-using-flink-cli
 Is there any configuration I am missing to enable restarting of the job. Any 
help will be appreciated. Thanks.

Regards,
Anirudh


Re: Read Apache Kylin from Apache Flink

2016-10-18 Thread Till Rohrmann
Great to see Alberto. Thanks for sharing it with the community :-)

Cheers,
Till

On Tue, Oct 18, 2016 at 7:40 PM, Alberto Ramón 
wrote:

> Hello
>
> I made a small contribution / manual about:
> *"How-to Read  Apache Kylin data from  Apache Flink With Scala"
> *
>
>
>
>
> For any suggestions, feel free to contact me
>
> Thanks, Alberto
>


Read Apache Kylin from Apache Flink

2016-10-18 Thread Alberto Ramón
Hello

I made a small contribution / manual about:
*"How-to Read  Apache Kylin data from  Apache Flink With Scala"
*




For any suggestions, feel free to contact me

Thanks, Alberto


Re: Queryable state using JDBC

2016-10-18 Thread Till Rohrmann
Sorry, this seems to be a private repository. Then you could take a look at
QueryableStateITCase in the Flink code base.

Cheers,
Till

On Tue, Oct 18, 2016 at 4:33 PM, Alberto Ramón 
wrote:

> I don't have acess  or link don't work :(
>
> 2016-10-18 16:15 GMT+02:00 Till Rohrmann :
>
>> Oh now I understand. Sorry for my confusion. Flink does not offer a JDBC
>> interface to query its state. The only way to do that is to use Flink's
>> proprietary interface. Since this feature is highly experimental there is
>> not a lot of documentation out there. I can recommend you this github repo
>> [1] where you can see an example of how to access Flink state.
>>
>> [1] https://github.com/dataArtisans/queryable-state-demo
>>
>> Cheers,
>> Till
>>
>> On Tue, Oct 18, 2016 at 3:22 PM, Alberto Ramón > > wrote:
>>
>>> Hi  :)
>>>
>>> *SQL Client ---> JDBC  ---> Flink (queryable state)*
>>>
>>> (I don't read from Flink a JDBC Source)
>>>
>>> I want the opposite: See Flink as "realtime database" from my SQL Client
>>> ... for example: Tableau ---> Flink states
>>>
>>> 2016-10-18 14:52 GMT+02:00 Till Rohrmann :
>>>
 Hi Alberto,

 have you checked out Flink's JDBCInputFormat? As far as I can tell,
 Kylin has support for JDBC and, thus, you should be able to read from it
 with this input format.

 Cheers,
 Till

 On Tue, Oct 18, 2016 at 11:28 AM, Alberto Ramón <
 a.ramonporto...@gmail.com> wrote:

> Hello
>
> I'm investigating about Flink + Calcite, StreamSQL, Queryable State
> Is possible connect to Kylin using SQL Client *via JDBC *?
> (I always see API examples)
>
> BR
>


>>>
>>
>


Re: Queryable state using JDBC

2016-10-18 Thread Till Rohrmann
Oh now I understand. Sorry for my confusion. Flink does not offer a JDBC
interface to query its state. The only way to do that is to use Flink's
proprietary interface. Since this feature is highly experimental there is
not a lot of documentation out there. I can recommend you this github repo
[1] where you can see an example of how to access Flink state.

[1] https://github.com/dataArtisans/queryable-state-demo

Cheers,
Till

On Tue, Oct 18, 2016 at 3:22 PM, Alberto Ramón 
wrote:

> Hi  :)
>
> *SQL Client ---> JDBC  ---> Flink (queryable state)*
>
> (I don't read from Flink a JDBC Source)
>
> I want the opposite: See Flink as "realtime database" from my SQL Client
> ... for example: Tableau ---> Flink states
>
> 2016-10-18 14:52 GMT+02:00 Till Rohrmann :
>
>> Hi Alberto,
>>
>> have you checked out Flink's JDBCInputFormat? As far as I can tell, Kylin
>> has support for JDBC and, thus, you should be able to read from it with
>> this input format.
>>
>> Cheers,
>> Till
>>
>> On Tue, Oct 18, 2016 at 11:28 AM, Alberto Ramón <
>> a.ramonporto...@gmail.com> wrote:
>>
>>> Hello
>>>
>>> I'm investigating about Flink + Calcite, StreamSQL, Queryable State
>>> Is possible connect to Kylin using SQL Client *via JDBC *?
>>> (I always see API examples)
>>>
>>> BR
>>>
>>
>>
>


Re: Queryable state using JDBC

2016-10-18 Thread Alberto Ramón
Hi  :)

*SQL Client ---> JDBC  ---> Flink (queryable state)*

(I don't read from Flink a JDBC Source)

I want the opposite: See Flink as "realtime database" from my SQL Client
... for example: Tableau ---> Flink states

2016-10-18 14:52 GMT+02:00 Till Rohrmann :

> Hi Alberto,
>
> have you checked out Flink's JDBCInputFormat? As far as I can tell, Kylin
> has support for JDBC and, thus, you should be able to read from it with
> this input format.
>
> Cheers,
> Till
>
> On Tue, Oct 18, 2016 at 11:28 AM, Alberto Ramón  > wrote:
>
>> Hello
>>
>> I'm investigating about Flink + Calcite, StreamSQL, Queryable State
>> Is possible connect to Kylin using SQL Client *via JDBC *?
>> (I always see API examples)
>>
>> BR
>>
>
>


Re: Distributing Tasks over Task manager

2016-10-18 Thread Jürgen Thomann

Hi Robert,

Do you already had a chance to look on it? If you need more information 
just let me know.


Regards,
Jürgen

On 12.10.2016 21:12, Jürgen Thomann wrote:


Hi Robert,

Thanks for your suggestions. We are using the DataStream API and I 
tried it with disabling it completely, but that didn't help.


I attached the plan and to add some context, it starts with a Kafka 
source followed by a map operation ( parallelism 4). The next map is 
the expensive part with a parallelism of 18 which produces a Tuple2 
which is used for splitting. Starting here the parallelism is always 2 
except the sink with 1. Both resulting streams have two maps, a 
filter, one more map and are ending with an 
assignTimestampsAndWatermarks. If there is now a small box in the 
picture it is a filter operation and otherwise it goes directly to a 
keyBy, timewindow and apply operation followed by a sink.


If one task manager contains more sub tasks of the expensive map than 
any other task manager, everything later in the stream is running on 
the same task manager. If two task manager have the same amount of sub 
tasks, the following tasks with a parallelism of 2 are distributed 
over the two task manager.


Interesting is also that the task manager have 6 task slots configured 
and the expensive part has 6 sub tasks on one task manager but still 
everything later in the flow is running on this task manager. This 
also happens if operator chaining is disabled.


Best,
Jürgen


On 12.10.2016 17:43, Robert Metzger wrote:

Hi Jürgen,

Are you using the DataStream or the DataSet API?
Maybe the operator chaining is causing too many operations to be 
"packed" into one task. Check out this documentation page: 
https://ci.apache.org/projects/flink/flink-docs-master/dev/datastream_api.html#task-chaining-and-resource-groups 

You could try to disable chaining completely to see if that resolves 
the issue (you'll probably pay for this by having more serialization 
overhead and network traffic).


If my suggestions don't help, can you post a screenshot of your job 
plan (from the web interface) here, so that we see what operations 
you are performing?


Regards,
Robert



Re: Queryable state using JDBC

2016-10-18 Thread Till Rohrmann
Hi Alberto,

have you checked out Flink's JDBCInputFormat? As far as I can tell, Kylin
has support for JDBC and, thus, you should be able to read from it with
this input format.

Cheers,
Till

On Tue, Oct 18, 2016 at 11:28 AM, Alberto Ramón 
wrote:

> Hello
>
> I'm investigating about Flink + Calcite, StreamSQL, Queryable State
> Is possible connect to Kylin using SQL Client *via JDBC *?
> (I always see API examples)
>
> BR
>


Re: ContinuousFileMonitoringFunction - deleting file after processing

2016-10-18 Thread Kostas Kloudas
Hi Maciek,

I agree with you that 1ms is often too long :P

This is the reason why I will open a discussion to have
all the ideas/ requirements / shortcomings in a single place.
This way the community can track and influence what
is coming next. 

Hopefully I will do it in the afternoon and I will send you 
the discussion thread.

Cheers,
Kostas

> On Oct 18, 2016, at 11:43 AM, Maciek Próchniak  wrote:
> 
> Hi Kostas,
> 
> thanks for quick answer.
> 
> I wouldn't dare to delete files in InputFormat if they were splitted and 
> processed in parallel...
> 
> As for using notifyCheckpointComplete - thanks for suggestion, it looks 
> pretty interesting, I'll try to try it out. Although I wonder a bit if 
> relying only on modification timestamp is enough - many things may happen in 
> one ms :)
> 
> thanks,
> 
> macie
> 
> 
> On 18/10/2016 11:14, Kostas Kloudas wrote:
>> Hi Maciek,
>> 
>> Just a follow-up on the previous email, given that splits are read in 
>> parallel, when the
>> ContinuousFileMonitoringFunction forwards the last split, it does not mean 
>> that the
>> final splits is going to be processed last. If the node it gets assigned is 
>> fast enough
>> then it may be processed faster than others.
>> 
>> This assumption only holds if you have a parallelism of 1.
>> 
>> Cheers,
>> Kostas
>> 
>>> On Oct 18, 2016, at 11:05 AM, Kostas Kloudas  
>>> wrote:
>>> 
>>> Hi Maciek,
>>> 
>>> Currently this functionality is not supported but this seems like a good 
>>> addition.
>>> Actually, give that the feature is rather new, we were thinking of opening 
>>> a discussion
>>> in the dev mailing list in order to
>>> 
>>> i) discuss some current limitations of the Continuous File Processing source
>>> ii) see how people use it and adjust our features accordingly
>>> 
>>> I will let you know as soon as I open this thread.
>>> 
>>> By the way for your use-case, we should probably have a callback in the 
>>> notifyCheckpointComplete()
>>> that will inform the source that a given checkpoint was successfully 
>>> performed and then
>>> we can purge the already processed files. This can be a good solution.
>>> 
>>> Thanks,
>>> Kostas
>>> 
 On Oct 18, 2016, at 9:40 AM, Maciek Próchniak  wrote:
 
 Hi,
 
 we want to monitor hdfs (or local) directory, read csv files that appear 
 and after successful processing - delete them (mainly not to run out of 
 disk space...)
 
 I'm not quite sure how to achieve it with current implementation. 
 Previously, when we read binary data (unsplittable files) we made small 
 hack and deleted them
 
 in our FileInputFormat - but now we want to use splits and detecting which 
 split is 'the last one' is no longer so obvious - of course it's also 
 problematic when it comes to checkpointing...
 
 So my question is - is there a idiomatic way of deleting processed files?
 
 
 thanks,
 
 maciek
 
>> 
> 



Re: ContinuousFileMonitoringFunction - deleting file after processing

2016-10-18 Thread Maciek Próchniak

Hi Kostas,

thanks for quick answer.

I wouldn't dare to delete files in InputFormat if they were splitted and 
processed in parallel...


As for using notifyCheckpointComplete - thanks for suggestion, it looks 
pretty interesting, I'll try to try it out. Although I wonder a bit if 
relying only on modification timestamp is enough - many things may 
happen in one ms :)


thanks,

macie


On 18/10/2016 11:14, Kostas Kloudas wrote:

Hi Maciek,

Just a follow-up on the previous email, given that splits are read in parallel, 
when the
ContinuousFileMonitoringFunction forwards the last split, it does not mean that 
the
final splits is going to be processed last. If the node it gets assigned is 
fast enough
then it may be processed faster than others.

This assumption only holds if you have a parallelism of 1.

Cheers,
Kostas


On Oct 18, 2016, at 11:05 AM, Kostas Kloudas  
wrote:

Hi Maciek,

Currently this functionality is not supported but this seems like a good 
addition.
Actually, give that the feature is rather new, we were thinking of opening a 
discussion
in the dev mailing list in order to

i) discuss some current limitations of the Continuous File Processing source
ii) see how people use it and adjust our features accordingly

I will let you know as soon as I open this thread.

By the way for your use-case, we should probably have a callback in the 
notifyCheckpointComplete()
that will inform the source that a given checkpoint was successfully performed 
and then
we can purge the already processed files. This can be a good solution.

Thanks,
Kostas


On Oct 18, 2016, at 9:40 AM, Maciek Próchniak  wrote:

Hi,

we want to monitor hdfs (or local) directory, read csv files that appear and 
after successful processing - delete them (mainly not to run out of disk 
space...)

I'm not quite sure how to achieve it with current implementation. Previously, 
when we read binary data (unsplittable files) we made small hack and deleted 
them

in our FileInputFormat - but now we want to use splits and detecting which 
split is 'the last one' is no longer so obvious - of course it's also 
problematic when it comes to checkpointing...

So my question is - is there a idiomatic way of deleting processed files?


thanks,

maciek







Queryable state using JDBC

2016-10-18 Thread Alberto Ramón
Hello

I'm investigating about Flink + Calcite, StreamSQL, Queryable State
Is possible connect to Kylin using SQL Client *via JDBC *?
(I always see API examples)

BR


Re: "Slow ReadProcessor" warnings when using BucketSink

2016-10-18 Thread static-max
Hi Robert,

thanks for your reply. I also didn't find anything helpful on Google.

I checked all GC Times, they look OK. Here are GC Times for the Job Manager
(the job is running fine since 5 days):

Collector Count Time
PS-MarkSweep 3 1s
PS-Scavenge 5814 2m 12s

I have no window or any computation, just reading from Kafka and directly
writing to HDFS.

I can also run a terasort or teragen in parallel without any problems.

Best,
Max

2016-10-12 11:32 GMT+02:00 Robert Metzger :

> Hi,
> I haven't seen this error before. Also, I didn't find anything helpful
> searching for the error on Google.
>
> Did you check the GC times also for Flink? Is your Flink job doing any
> heavy tasks (like maintaining large windows, or other operations involving
> a lot of heap space?)
>
> Regards,
> Robert
>
>
> On Tue, Oct 11, 2016 at 10:51 AM, static-max 
> wrote:
>
>> Hi,
>>
>> I have a low throughput job (approx. 1000 messager per Minute), that
>> consumes from Kafka und writes directly to HDFS. After an hour or so, I get
>> the following warnings in the Task Manager log:
>>
>> 2016-10-10 01:59:44,635 WARN  org.apache.hadoop.hdfs.DFSClient
>>- Slow ReadProcessor read fields took 30001ms
>> (threshold=3ms); ack: seqno: 66 reply: SUCCESS reply: SUCCESS reply:
>> SUCCESS downstreamAckTimeNanos: 1599276 flag: 0 flag: 0 flag: 0, targets:
>> [DatanodeInfoWithStorage[Node1, Node2, Node3]]
>> 2016-10-10 02:04:44,635 WARN  org.apache.hadoop.hdfs.DFSClient
>>- Slow ReadProcessor read fields took 30002ms
>> (threshold=3ms); ack: seqno: 13 reply: SUCCESS reply: SUCCESS reply:
>> SUCCESS downstreamAckTimeNanos: 2394027 flag: 0 flag: 0 flag: 0, targets:
>> [DatanodeInfoWithStorage[Node1, Node2, Node3]]
>> 2016-10-10 02:05:14,635 WARN  org.apache.hadoop.hdfs.DFSClient
>>- Slow ReadProcessor read fields took 30001ms
>> (threshold=3ms); ack: seqno: 17 reply: SUCCESS reply: SUCCESS reply:
>> SUCCESS downstreamAckTimeNanos: 2547467 flag: 0 flag: 0 flag: 0, targets:
>> [DatanodeInfoWithStorage[Node1, Node2, Node3]]
>>
>> I have not found any erros or warning at the datanodes or the namenode.
>> Every other application using HDFS performs fine. I have very little load
>> and network latency is fine also. I also checked GC, disk I/O.
>>
>> The files written are very small (only a few MB), so writing the blocks
>> should be fast.
>>
>> The threshold is crossed only 1 or 2 ms, this makes me wonder.
>>
>> Does anyone have an Idea where to look next or how to fix these warnings?
>>
>
>


Re: Flink SQL Stream Parser based on calcite

2016-10-18 Thread PedroMrChaves
Thank you.
Great presentation about the high-level translation process.

Regards,
Pedro



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-SQL-Stream-Parser-based-on-calcite-tp9592p9608.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: ContinuousFileMonitoringFunction - deleting file after processing

2016-10-18 Thread Kostas Kloudas
Hi Maciek,

Just a follow-up on the previous email, given that splits are read in parallel, 
when the 
ContinuousFileMonitoringFunction forwards the last split, it does not mean that 
the 
final splits is going to be processed last. If the node it gets assigned is 
fast enough
then it may be processed faster than others. 

This assumption only holds if you have a parallelism of 1.

Cheers,
Kostas

> On Oct 18, 2016, at 11:05 AM, Kostas Kloudas  
> wrote:
> 
> Hi Maciek,
> 
> Currently this functionality is not supported but this seems like a good 
> addition.
> Actually, give that the feature is rather new, we were thinking of opening a 
> discussion 
> in the dev mailing list in order to 
> 
> i) discuss some current limitations of the Continuous File Processing source
> ii) see how people use it and adjust our features accordingly
> 
> I will let you know as soon as I open this thread.
> 
> By the way for your use-case, we should probably have a callback in the 
> notifyCheckpointComplete()
> that will inform the source that a given checkpoint was successfully 
> performed and then 
> we can purge the already processed files. This can be a good solution.
> 
> Thanks,
> Kostas
> 
>> On Oct 18, 2016, at 9:40 AM, Maciek Próchniak  wrote:
>> 
>> Hi,
>> 
>> we want to monitor hdfs (or local) directory, read csv files that appear and 
>> after successful processing - delete them (mainly not to run out of disk 
>> space...)
>> 
>> I'm not quite sure how to achieve it with current implementation. 
>> Previously, when we read binary data (unsplittable files) we made small hack 
>> and deleted them
>> 
>> in our FileInputFormat - but now we want to use splits and detecting which 
>> split is 'the last one' is no longer so obvious - of course it's also 
>> problematic when it comes to checkpointing...
>> 
>> So my question is - is there a idiomatic way of deleting processed files?
>> 
>> 
>> thanks,
>> 
>> maciek
>> 
> 



Re: Flink Metrics

2016-10-18 Thread Aljoscha Krettek
https://ci.apache.org/projects/flink/flink-docs-release-1.2/monitoring/metrics.html
Or this:
https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/metrics.html
if
you prefer Flink 1.1

On Mon, 17 Oct 2016 at 19:16 amir bahmanyari  wrote:

> Hi colleagues,
> Is there a link that described Flink Matrices & provides example on how to
> utilize it pls?
> I really appreciate it...
> Cheers
>
> --
> *From:* Till Rohrmann 
> *To:* user@flink.apache.org
> *Cc:* d...@flink.apache.org
> *Sent:* Monday, October 17, 2016 12:52 AM
> *Subject:* Re: Flink Metrics
>
> Hi Govind,
>
> I think the DropwizardMeterWrapper implementation is just a reference
> implementation where it was decided to report the minute rate. You can
> define your own meter class which allows to configure the rate interval
> accordingly.
>
> Concerning Timers, I think nobody requested this metric so far. If you
> want, then you can open a JIRA issue and contribute it. The community would
> really appreciate that.
>
> Cheers,
> Till
> ​
>
> On Mon, Oct 17, 2016 at 5:26 AM, Govindarajan Srinivasaraghavan <
> govindragh...@gmail.com> wrote:
>
> > Hi,
> >
> > I am currently using flink 1.2 snapshot and instrumenting my pipeline
> with
> > flink metrics. One small suggestion I have is currently the Meter
> interface
> > only supports getRate() which is always the one minute rate.
> >
> > It would great if all the rates (1 min, 5 min & 15 min) are exposed to
> get
> > a better picture in terms of performance.
> >
> > Also is there any reason why timers are not part of flink metrics core?
> >
> > Regards,
> > Govind
> >
>
>
>


ContinuousFileMonitoringFunction - deleting file after processing

2016-10-18 Thread Maciek Próchniak

Hi,

we want to monitor hdfs (or local) directory, read csv files that appear 
and after successful processing - delete them (mainly not to run out of 
disk space...)


I'm not quite sure how to achieve it with current implementation. 
Previously, when we read binary data (unsplittable files) we made small 
hack and deleted them


in our FileInputFormat - but now we want to use splits and detecting 
which split is 'the last one' is no longer so obvious - of course it's 
also problematic when it comes to checkpointing...


So my question is - is there a idiomatic way of deleting processed files?


thanks,

maciek