Strata San Jose

2018-02-08 Thread ashish pok
Wondering if any of the core Flink team members are planning to be at the 
conference? It would be great to meet in peson.
Thanks,

-- Ashish

RE: Kafka as source for batch job

2018-02-08 Thread Marchant, Hayden
Forget to mention that my target Kafka version is 0.11.x  with aim to upgrade 
to 1.0 when 1.0.x fixpack is released.

From: Tzu-Li (Gordon) Tai [mailto:tzuli...@apache.org]
Sent: Thursday, February 08, 2018 8:05 PM
To: user@flink.apache.org; Marchant, Hayden [ICG-IT] 
Subject: RE: Kafka as source for batch job

Hi Marchant,

Yes I agree. In general, the isEndOfStream method has a very ill-defined 
semantic, with actually different behaviors across different Kafka connector 
versions.
This method will definitely need to be revisited in the future (we are thinking 
about a rework of the connector).

What is your target Kafka version? And do you know the ending offsets of _all_ 
partitions which you want to only consume a range of?
I can probably double check for you if your specific case is possible, given 
the above information.

Cheers,
Gordon


On 8 February 2018 at 3:22:24 PM, Marchant, Hayden 
(hayden.march...@citi.com) wrote:
Gordon,

Thanks for the pointer. I did some searches for usages of isEndOfStream and 
it’s a little confusing. I see that all implementors of DeserializationSchema 
must implement this method, but it’s not called from anyone central in the 
Flink streaming engine, but rather each source can decide to use this in it’s 
own implementation – for example Kafka stops processing the topic when 
isEndOfStream returns true. This is nice, but localizes the treatment just to 
that Operator, and, even though it goers a long way in ensuring that I get just 
my bounded data, it still does not give me the ability to stop my job when I 
have finished consuming the elements.

Also, in my case I need to ensure that I have reached a certain offset for each 
of the Kafka partitions that are assigned to the instance of source function. 
It seems from the code that I need a different implementation of 
KafkaFetcher.runFetchLoop that has slightly different logic for changing 
running to be false.

What would you recommend in this case?

From: Tzu-Li (Gordon) Tai [mailto:tzuli...@apache.org]
Sent: Thursday, February 08, 2018 12:24 PM
To: user@flink.apache.org; Marchant, Hayden 
[ICG-IT] >
Subject: Re: Kafka as source for batch job

Hi Hayden,

Have you tried looking into `KeyedDeserializationSchema#isEndOfStream` [1]?
I think that could be what you are looking for. It signals the end of the 
stream when consuming from Kafka.

Cheers,
Gordon


On 8 February 2018 at 10:44:59 AM, Marchant, Hayden 
(hayden.march...@citi.com) wrote:
I know that traditionally Kafka is used as a source for a streaming job. In our 
particular case, we are looking at extracting records from a Kafka topic from a 
particular well-defined offset range (per partition) - i.e. from offset X to 
offset Y. In this case, we'd somehow want the application to know that it has 
finished when it gets to offset Y. This is basically changes Kafka stream to be 
bounded data as opposed to unbounded in the usual Stream paradigm.

What would be the best approach to do this in Flink? I see a few options, 
though there might be more:

1. Use a regular streaming job, and have some external service that monitors 
the current offsets of the consumer group of the topic and manually stops job 
when the consumer group of the topic has finished
Pros - simple wrt Flink, Cons - hacky

2. Create a batch job, and a new InputFormat based on Kafka that reads the 
specified subset of Kafka topic into the source.
Pros - represent bounded data from Kafka topic as batch source, Cons - requires 
implementation of source.

3. Dump the subset of Kafka into a file and then trigger a more 'traditional' 
Flink batch job that reads from a file.
Pros - simple, cons - unnecessary I/O.

I personally prefer 1 and 3 for simplicity. Has anyone done anything like this 
before?

Thanks,
Hayden Marchant


Measure the memory consumption of a job at runtime

2018-02-08 Thread Wang, Chao
Hi,

I would like to measure the memory consumption of a job at runtime. I came 
across some discussion (here: 
https://stackoverflow.com/questions/35315522/flink-memory-usage ), and it seems 
that it’s not possible two years ago. Is it possible in the current status, and 
if yes, how to do it?

Thank you,

Chao


RE: Kafka as source for batch job

2018-02-08 Thread Marchant, Hayden
Hi Gordon,

Actually our use case is that we have start/end timestamp, and we plan on 
calling KafkaConsumer.offsetForTimes to get the offsets for each partition. So, 
I guess our logic is different in that we have an ‘and’ predicate between each 
partition arriving at offset, as opposed to the current ‘or’ predicate – i.e. 
any partition that fulfills a condition is enough to stop the job.

Either way, I’d still need to figure out when to stop the job.

Would it make more sense to implement an InputFormat that could wrap this 
‘bounded’ Kafka source, and use the DataSet / Batch Table API ?

Thanks
Hayden

From: Tzu-Li (Gordon) Tai [mailto:tzuli...@apache.org]
Sent: Thursday, February 08, 2018 8:05 PM
To: user@flink.apache.org; Marchant, Hayden [ICG-IT] 
Subject: RE: Kafka as source for batch job

Hi Marchant,

Yes I agree. In general, the isEndOfStream method has a very ill-defined 
semantic, with actually different behaviors across different Kafka connector 
versions.
This method will definitely need to be revisited in the future (we are thinking 
about a rework of the connector).

What is your target Kafka version? And do you know the ending offsets of _all_ 
partitions which you want to only consume a range of?
I can probably double check for you if your specific case is possible, given 
the above information.

Cheers,
Gordon


On 8 February 2018 at 3:22:24 PM, Marchant, Hayden 
(hayden.march...@citi.com) wrote:
Gordon,

Thanks for the pointer. I did some searches for usages of isEndOfStream and 
it’s a little confusing. I see that all implementors of DeserializationSchema 
must implement this method, but it’s not called from anyone central in the 
Flink streaming engine, but rather each source can decide to use this in it’s 
own implementation – for example Kafka stops processing the topic when 
isEndOfStream returns true. This is nice, but localizes the treatment just to 
that Operator, and, even though it goers a long way in ensuring that I get just 
my bounded data, it still does not give me the ability to stop my job when I 
have finished consuming the elements.

Also, in my case I need to ensure that I have reached a certain offset for each 
of the Kafka partitions that are assigned to the instance of source function. 
It seems from the code that I need a different implementation of 
KafkaFetcher.runFetchLoop that has slightly different logic for changing 
running to be false.

What would you recommend in this case?

From: Tzu-Li (Gordon) Tai [mailto:tzuli...@apache.org]
Sent: Thursday, February 08, 2018 12:24 PM
To: user@flink.apache.org; Marchant, Hayden 
[ICG-IT] >
Subject: Re: Kafka as source for batch job

Hi Hayden,

Have you tried looking into `KeyedDeserializationSchema#isEndOfStream` [1]?
I think that could be what you are looking for. It signals the end of the 
stream when consuming from Kafka.

Cheers,
Gordon


On 8 February 2018 at 10:44:59 AM, Marchant, Hayden 
(hayden.march...@citi.com) wrote:
I know that traditionally Kafka is used as a source for a streaming job. In our 
particular case, we are looking at extracting records from a Kafka topic from a 
particular well-defined offset range (per partition) - i.e. from offset X to 
offset Y. In this case, we'd somehow want the application to know that it has 
finished when it gets to offset Y. This is basically changes Kafka stream to be 
bounded data as opposed to unbounded in the usual Stream paradigm.

What would be the best approach to do this in Flink? I see a few options, 
though there might be more:

1. Use a regular streaming job, and have some external service that monitors 
the current offsets of the consumer group of the topic and manually stops job 
when the consumer group of the topic has finished
Pros - simple wrt Flink, Cons - hacky

2. Create a batch job, and a new InputFormat based on Kafka that reads the 
specified subset of Kafka topic into the source.
Pros - represent bounded data from Kafka topic as batch source, Cons - requires 
implementation of source.

3. Dump the subset of Kafka into a file and then trigger a more 'traditional' 
Flink batch job that reads from a file.
Pros - simple, cons - unnecessary I/O.

I personally prefer 1 and 3 for simplicity. Has anyone done anything like this 
before?

Thanks,
Hayden Marchant


dataset sort

2018-02-08 Thread david westwood
Hi:

I would like to sort historical data using the dataset api.

env.setParallelism(10)

val dataset = [(Long, String)] ..
.paritionByRange(_._1)
.sortPartition(_._1, Order.ASCEDING)
.writeAsCsv("mydata.csv").setParallelism(1)

the data is out of order (in local order)
but
.print()
prints the data in to correct order. I have run a small toy sample multiple
times.

Is there a way to sort the entire dataset with parallelism > 1 and write it
to a single file in ascending order?


RE: Kafka as source for batch job

2018-02-08 Thread Tzu-Li (Gordon) Tai
Hi Marchant,

Yes I agree. In general, the isEndOfStream method has a very ill-defined 
semantic, with actually different behaviors across different Kafka connector 
versions.
This method will definitely need to be revisited in the future (we are thinking 
about a rework of the connector).

What is your target Kafka version? And do you know the ending offsets of _all_ 
partitions which you want to only consume a range of?
I can probably double check for you if your specific case is possible, given 
the above information.

Cheers,
Gordon

On 8 February 2018 at 3:22:24 PM, Marchant, Hayden (hayden.march...@citi.com) 
wrote:

Gordon,

 

Thanks for the pointer. I did some searches for usages of isEndOfStream and 
it’s a little confusing. I see that all implementors of DeserializationSchema 
must implement this method, but it’s not called from anyone central in the 
Flink streaming engine, but rather each source can decide to use this in it’s 
own implementation – for example Kafka stops processing the topic when 
isEndOfStream returns true. This is nice, but localizes the treatment just to 
that Operator, and, even though it goers a long way in ensuring that I get just 
my bounded data, it still does not give me the ability to stop my job when I 
have finished consuming the elements.

 

Also, in my case I need to ensure that I have reached a certain offset for each 
of the Kafka partitions that are assigned to the instance of source function. 
It seems from the code that I need a different implementation of 
KafkaFetcher.runFetchLoop that has slightly different logic for changing 
running to be false.

 

What would you recommend in this case?

 

From: Tzu-Li (Gordon) Tai [mailto:tzuli...@apache.org]  
Sent: Thursday, February 08, 2018 12:24 PM
To: user@flink.apache.org; Marchant, Hayden [ICG-IT] 
Subject: Re: Kafka as source for batch job

 

Hi Hayden,

 

Have you tried looking into `KeyedDeserializationSchema#isEndOfStream` [1]?

I think that could be what you are looking for. It signals the end of the 
stream when consuming from Kafka.

 

Cheers,

Gordon

 

On 8 February 2018 at 10:44:59 AM, Marchant, Hayden (hayden.march...@citi.com) 
wrote:

I know that traditionally Kafka is used as a source for a streaming job. In our 
particular case, we are looking at extracting records from a Kafka topic from a 
particular well-defined offset range (per partition) - i.e. from offset X to 
offset Y. In this case, we'd somehow want the application to know that it has 
finished when it gets to offset Y. This is basically changes Kafka stream to be 
bounded data as opposed to unbounded in the usual Stream paradigm.  

What would be the best approach to do this in Flink? I see a few options, 
though there might be more:  

1. Use a regular streaming job, and have some external service that monitors 
the current offsets of the consumer group of the topic and manually stops job 
when the consumer group of the topic has finished  
Pros - simple wrt Flink, Cons - hacky  

2. Create a batch job, and a new InputFormat based on Kafka that reads the 
specified subset of Kafka topic into the source.  
Pros - represent bounded data from Kafka topic as batch source, Cons - requires 
implementation of source.  

3. Dump the subset of Kafka into a file and then trigger a more 'traditional' 
Flink batch job that reads from a file.  
Pros - simple, cons - unnecessary I/O.  

I personally prefer 1 and 3 for simplicity. Has anyone done anything like this 
before?  

Thanks,  
Hayden Marchant

Re: CEP for time series in csv-file

2018-02-08 Thread Timo Walther
You can also take a look at the Flink training from data Artisans and 
the code examples there. They also use CEP and basically read also from 
a file:


http://training.data-artisans.com/exercises/CEP.html

Regards,
Timo


Am 2/8/18 um 6:09 PM schrieb Kostas Kloudas:

Hi Esa,

I think the best place to start is the documentation available at the 
flink website.


Some pointers are the following:

CEP documentation: 
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/libs/cep.html


Blog post with CEP example: 
https://data-artisans.com/blog/complex-event-processing-flink-cep-update


Cheers,
Kostas

On Feb 8, 2018, at 4:28 PM, Esa Heikkinen 
> 
wrote:


Hi
I have cvs-file(s) that contain an event in every row and first 
column is time stamp of event. Rest of columns are data and 
“attributes” of event.
I’d want to write simple Scala code that: 1) reads data of csv-file 
2) converts data of csv-file compatible for CEP 3) sets pattern for 
CEP 4) Runs CEP  5) writes results

Do you have any hints or examples how to do that ?
By the way, what kind of time stamp should be in csv-file ?






Re: CEP for time series in csv-file

2018-02-08 Thread Kostas Kloudas
Hi Esa,

I think the best place to start is the documentation available at the flink 
website.

Some pointers are the following: 

CEP documentation: 
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/libs/cep.html 


Blog post with CEP example: 
https://data-artisans.com/blog/complex-event-processing-flink-cep-update 


Cheers,
Kostas

> On Feb 8, 2018, at 4:28 PM, Esa Heikkinen  
> wrote:
> 
> Hi
>  
> I have cvs-file(s) that contain an event in every row and first column is 
> time stamp of event. Rest of columns are data and “attributes” of event.
>  
> I’d want to write simple Scala code that: 1) reads data of csv-file 2) 
> converts data of csv-file compatible for CEP 3) sets pattern for CEP 4) Runs 
> CEP  5) writes results
>  
> Do you have any hints or examples how to do that ?
>  
> By the way, what kind of time stamp should be in csv-file ?



CEP for time series in csv-file

2018-02-08 Thread Esa Heikkinen
Hi

I have cvs-file(s) that contain an event in every row and first column is time 
stamp of event. Rest of columns are data and "attributes" of event.

I'd want to write simple Scala code that: 1) reads data of csv-file 2) converts 
data of csv-file compatible for CEP 3) sets pattern for CEP 4) Runs CEP  5) 
writes results

Do you have any hints or examples how to do that ?

By the way, what kind of time stamp should be in csv-file ?




RE: Kafka as source for batch job

2018-02-08 Thread Marchant, Hayden
Gordon,

Thanks for the pointer. I did some searches for usages of isEndOfStream and 
it’s a little confusing. I see that all implementors of DeserializationSchema 
must implement this method, but it’s not called from anyone central in the 
Flink streaming engine, but rather each source can decide to use this in it’s 
own implementation – for example Kafka stops processing the topic when 
isEndOfStream returns true. This is nice, but localizes the treatment just to 
that Operator, and, even though it goers a long way in ensuring that I get just 
my bounded data, it still does not give me the ability to stop my job when I 
have finished consuming the elements.

Also, in my case I need to ensure that I have reached a certain offset for each 
of the Kafka partitions that are assigned to the instance of source function. 
It seems from the code that I need a different implementation of 
KafkaFetcher.runFetchLoop that has slightly different logic for changing 
running to be false.

What would you recommend in this case?

From: Tzu-Li (Gordon) Tai [mailto:tzuli...@apache.org]
Sent: Thursday, February 08, 2018 12:24 PM
To: user@flink.apache.org; Marchant, Hayden [ICG-IT] 
Subject: Re: Kafka as source for batch job

Hi Hayden,

Have you tried looking into `KeyedDeserializationSchema#isEndOfStream` [1]?
I think that could be what you are looking for. It signals the end of the 
stream when consuming from Kafka.

Cheers,
Gordon


On 8 February 2018 at 10:44:59 AM, Marchant, Hayden 
(hayden.march...@citi.com) wrote:
I know that traditionally Kafka is used as a source for a streaming job. In our 
particular case, we are looking at extracting records from a Kafka topic from a 
particular well-defined offset range (per partition) - i.e. from offset X to 
offset Y. In this case, we'd somehow want the application to know that it has 
finished when it gets to offset Y. This is basically changes Kafka stream to be 
bounded data as opposed to unbounded in the usual Stream paradigm.

What would be the best approach to do this in Flink? I see a few options, 
though there might be more:

1. Use a regular streaming job, and have some external service that monitors 
the current offsets of the consumer group of the topic and manually stops job 
when the consumer group of the topic has finished
Pros - simple wrt Flink, Cons - hacky

2. Create a batch job, and a new InputFormat based on Kafka that reads the 
specified subset of Kafka topic into the source.
Pros - represent bounded data from Kafka topic as batch source, Cons - requires 
implementation of source.

3. Dump the subset of Kafka into a file and then trigger a more 'traditional' 
Flink batch job that reads from a file.
Pros - simple, cons - unnecessary I/O.

I personally prefer 1 and 3 for simplicity. Has anyone done anything like this 
before?

Thanks,
Hayden Marchant


Developing and running Flink applications in Linux through Windows editors or IDE's ?

2018-02-08 Thread Esa Heikkinen
Hello

I am newbie with Flink.

I'd want to develop my Flink scala-application in Windows IDE (for examples 
IntelliJ IDEA) and run them in Linux (Ubuntu).
Is that good or bad idea ? Or is it some remote use possible ?

At this moment there are no graphical interface (GUI) in Linux. Or would it be 
better to be ?

I have ssh-tunneled the port 8081 to Windows and I can run (or monitor) my 
Flink applications in browser (Firefox) of Windows.

Or is it better to use a simple editor like nano in Linux and forgot all 
"smart" GUI editors and IDE's in Windows-side ?

Do you have recommendations how to do that ?

---







Re: Two issues when deploying Flink on DC/OS

2018-02-08 Thread Lasse Nedergaard
And We see the same too

Med venlig hilsen / Best regards
Lasse Nedergaard


> Den 8. feb. 2018 kl. 11.58 skrev Stavros Kontopoulos 
> :
> 
> We see the same issue here (2):
> 2018-02-08 10:55:11,447 ERROR 
> org.apache.flink.runtime.webmonitor.files.StaticFileServerHandler  - Caught 
> exception
> java.io.IOException: Connection reset by peer
> 
> Stavros
> 
>> On Sat, Jan 13, 2018 at 9:59 PM, Eron Wright  wrote:
>> Hello Dongwon,
>> 
>> Flink doesn't support a 'unique host' constraint at this time; it simply 
>> accepts adequate offers without any such consideration.   Flink does support 
>> a 'host attributes' constraint to filter certain hosts, but that's not 
>> applicable here.
>> 
>> Under the hood, Flink uses a library called Netflix Fenzo to optimize 
>> placement, and a uniqueness constraint could be added by more deeply 
>> leveraging Fenzo's constraint system.   You mentioned that you're trying to 
>> make good use of your GPU resources, which could also be achieved by 
>> treating GPU as a scalar resource (similar to how memory and cores are 
>> treated).   Mesos does support that, but Fenzo may require some enhancement. 
>>   So, these are two potential ways to enhance Flink to support your 
>> scenario.  I'm happy to help; reach out to me.
>> 
>> The obvious, ugly workaround is to configure your TMs to be large enough to 
>> consume the whole host.
>> 
>> Eron
>> 
>> 
>> 
>> 
>> 
>>> On Thu, Jan 11, 2018 at 7:18 AM, Gary Yao  wrote:
>>> Hi Dongwon,
>>> 
>>> I am not familiar with the deployment on DC/OS. However, Eron Wright and 
>>> Jörg
>>> Schad (cc'd), who have worked on the Mesos integration, might be able to 
>>> help
>>> you.
>>> 
>>> Best,
>>> Gary
>>> 
 On Tue, Jan 9, 2018 at 10:29 AM, Dongwon Kim  wrote:
 Hi,
 
 I've launched JobManager and TaskManager on DC/OS successfully.
 Now I have two new issues:
 
 1) All TaskManagers are scheduled on a single node. 
 - Is it intended to maximize data locality and minimize network 
 communication cost?
 - Is there an option in Flink to adjust the behavior of JobManager when it 
 considers multiple resource offers from different Mesos agents?
 - I want to schedule TaskManager processes on different GPU servers so 
 that each TaskManger process can use its own GPU cards exclusively.  
 - Below is a part of JobManager log that is occurring while JobManager is 
 negotiating resources with the Mesos master:
 2018-01-09 07:34:54,872 INFO  
 org.apache.flink.mesos.runtime.clusterframework.MesosJobManager  - 
 JobManager akka.tcp://flink@dnn-g08-233:18026/user/jobmanager was granted 
 leadership with leader session ID 
 Some(----).
 2018-01-09 07:34:55,889 INFO  
 org.apache.flink.mesos.scheduler.ConnectionMonitor- Connecting 
 to Mesos...
 2018-01-09 07:34:55,962 INFO  
 org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  
 - Trying to associate with JobManager leader 
 akka.tcp://flink@dnn-g08-233:18026/user/jobmanager
 2018-01-09 07:34:55,977 INFO  
 org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  
 - Resource Manager associating with leading JobManager 
 Actor[akka://flink/user/jobmanager#-1481183359] - leader session 
 ----
 2018-01-09 07:34:56,479 INFO  
 org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  
 - Scheduling Mesos task taskmanager-1 with (10240.0 MB, 8.0 cpus).
 2018-01-09 07:34:56,481 INFO  
 org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  
 - Scheduling Mesos task taskmanager-2 with (10240.0 MB, 8.0 cpus).
 2018-01-09 07:34:56,481 INFO  
 org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  
 - Scheduling Mesos task taskmanager-3 with (10240.0 MB, 8.0 cpus).
 2018-01-09 07:34:56,481 INFO  
 org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  
 - Scheduling Mesos task taskmanager-4 with (10240.0 MB, 8.0 cpus).
 2018-01-09 07:34:56,481 INFO  
 org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  
 - Scheduling Mesos task taskmanager-5 with (10240.0 MB, 8.0 cpus).
 2018-01-09 07:34:56,483 INFO  
 org.apache.flink.mesos.scheduler.LaunchCoordinator- Now 
 gathering offers for at least 5 task(s).
 2018-01-09 07:34:56,484 INFO  
 org.apache.flink.mesos.scheduler.ConnectionMonitor- Connected 
 to Mesos as framework ID 59b85b42-a4a2-4632-9578-9e480585ecdc-0004.
 2018-01-09 07:34:56,690 INFO  
 org.apache.flink.mesos.scheduler.LaunchCoordinator- Received 
 offer(s) of 606170.0 MB, 234.2 cpus:
 2018-01-09 07:34:56,692 INFO  

Re: Two issues when deploying Flink on DC/OS

2018-02-08 Thread Stavros Kontopoulos
We see the same issue here (2):
2018-02-08 10:55:11,447 ERROR
org.apache.flink.runtime.webmonitor.files.StaticFileServerHandler  - Caught
exception
java.io.IOException: Connection reset by peer

Stavros

On Sat, Jan 13, 2018 at 9:59 PM, Eron Wright  wrote:

> Hello Dongwon,
>
> Flink doesn't support a 'unique host' constraint at this time; it simply
> accepts adequate offers without any such consideration.   Flink does
> support a 'host attributes' constraint to filter certain hosts, but that's
> not applicable here.
>
> Under the hood, Flink uses a library called Netflix Fenzo to optimize
> placement, and a uniqueness constraint could be added by more deeply
> leveraging Fenzo's constraint system.   You mentioned that you're trying to
> make good use of your GPU resources, which could also be achieved by
> treating GPU as a scalar resource (similar to how memory and cores are
> treated).   Mesos does support that, but Fenzo may require some
> enhancement.   So, these are two potential ways to enhance Flink to support
> your scenario.  I'm happy to help; reach out to me.
>
> The obvious, ugly workaround is to configure your TMs to be large enough
> to consume the whole host.
>
> Eron
>
>
>
>
>
> On Thu, Jan 11, 2018 at 7:18 AM, Gary Yao  wrote:
>
>> Hi Dongwon,
>>
>> I am not familiar with the deployment on DC/OS. However, Eron Wright and
>> Jörg
>> Schad (cc'd), who have worked on the Mesos integration, might be able to
>> help
>> you.
>>
>> Best,
>> Gary
>>
>> On Tue, Jan 9, 2018 at 10:29 AM, Dongwon Kim 
>> wrote:
>>
>>> Hi,
>>>
>>> I've launched JobManager and TaskManager on DC/OS successfully.
>>> Now I have two new issues:
>>>
>>> 1) All TaskManagers are scheduled on a single node.
>>> - Is it intended to maximize data locality and minimize network
>>> communication cost?
>>> - Is there an option in Flink to adjust the behavior of JobManager when
>>> it considers multiple resource offers from different Mesos agents?
>>> - I want to schedule TaskManager processes on different GPU servers so
>>> that each TaskManger process can use its own GPU cards exclusively.
>>> - Below is a part of JobManager log that is occurring while JobManager
>>> is negotiating resources with the Mesos master:
>>>
>>> 2018-01-09 07:34:54,872 INFO  
>>> org.apache.flink.mesos.runtime.clusterframework.MesosJobManager  - 
>>> JobManager akka.tcp://flink@dnn-g08-233:18026/user/jobmanager was granted 
>>> leadership with leader session ID 
>>> Some(----).
>>> 2018-01-09 07:34:55,889 INFO  
>>> org.apache.flink.mesos.scheduler.ConnectionMonitor- Connecting 
>>> to Mesos...
>>> 2018-01-09 07:34:55,962 INFO  
>>> org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  
>>> - Trying to associate with JobManager leader 
>>> akka.tcp://flink@dnn-g08-233:18026/user/jobmanager
>>> 2018-01-09 07:34:55,977 INFO  
>>> org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  
>>> - Resource Manager associating with leading JobManager 
>>> Actor[akka://flink/user/jobmanager#-1481183359] - leader session 
>>> ----
>>> 2018-01-09 07:34:56,479 INFO  
>>> org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  
>>> - Scheduling Mesos task taskmanager-1 with (10240.0 MB, 8.0 cpus).
>>> 2018-01-09 07:34:56,481 INFO  
>>> org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  
>>> - Scheduling Mesos task taskmanager-2 with (10240.0 MB, 8.0 cpus).
>>> 2018-01-09 07:34:56,481 INFO  
>>> org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  
>>> - Scheduling Mesos task taskmanager-3 with (10240.0 MB, 8.0 cpus).
>>> 2018-01-09 07:34:56,481 INFO  
>>> org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  
>>> - Scheduling Mesos task taskmanager-4 with (10240.0 MB, 8.0 cpus).
>>> 2018-01-09 07:34:56,481 INFO  
>>> org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  
>>> - Scheduling Mesos task taskmanager-5 with (10240.0 MB, 8.0 cpus).
>>> 2018-01-09 07:34:56,483 INFO  
>>> org.apache.flink.mesos.scheduler.LaunchCoordinator- Now 
>>> gathering offers for at least 5 task(s).
>>> 2018-01-09 07:34:56,484 INFO  
>>> org.apache.flink.mesos.scheduler.ConnectionMonitor- Connected 
>>> to Mesos as framework ID 59b85b42-a4a2-4632-9578-9e480585ecdc-0004.
>>> 2018-01-09 07:34:56,690 INFO  
>>> org.apache.flink.mesos.scheduler.LaunchCoordinator- Received 
>>> offer(s) of 606170.0 MB, 234.2 cpus:
>>> 2018-01-09 07:34:56,692 INFO  
>>> org.apache.flink.mesos.scheduler.LaunchCoordinator-   
>>> 59b85b42-a4a2-4632-9578-9e480585ecdc-O2174 from 50.1.100.233 of 86.0 
>>> MB, 45.9 cpus for [*]
>>> 2018-01-09 07:34:56,692 INFO  
>>> org.apache.flink.mesos.scheduler.LaunchCoordinator-   
>>> 

Re: Kafka as source for batch job

2018-02-08 Thread Tzu-Li (Gordon) Tai
Hi Hayden,

Have you tried looking into `KeyedDeserializationSchema#isEndOfStream` [1]?
I think that could be what you are looking for. It signals the end of the 
stream when consuming from Kafka.

Cheers,
Gordon

On 8 February 2018 at 10:44:59 AM, Marchant, Hayden (hayden.march...@citi.com) 
wrote:

I know that traditionally Kafka is used as a source for a streaming job. In our 
particular case, we are looking at extracting records from a Kafka topic from a 
particular well-defined offset range (per partition) - i.e. from offset X to 
offset Y. In this case, we'd somehow want the application to know that it has 
finished when it gets to offset Y. This is basically changes Kafka stream to be 
bounded data as opposed to unbounded in the usual Stream paradigm. 

What would be the best approach to do this in Flink? I see a few options, 
though there might be more: 

1. Use a regular streaming job, and have some external service that monitors 
the current offsets of the consumer group of the topic and manually stops job 
when the consumer group of the topic has finished 
Pros - simple wrt Flink, Cons - hacky 

2. Create a batch job, and a new InputFormat based on Kafka that reads the 
specified subset of Kafka topic into the source. 
Pros - represent bounded data from Kafka topic as batch source, Cons - requires 
implementation of source. 

3. Dump the subset of Kafka into a file and then trigger a more 'traditional' 
Flink batch job that reads from a file. 
Pros - simple, cons - unnecessary I/O. 

I personally prefer 1 and 3 for simplicity. Has anyone done anything like this 
before? 

Thanks, 
Hayden Marchant 



Kafka as source for batch job

2018-02-08 Thread Marchant, Hayden
I know that traditionally Kafka is used as a source for a streaming job. In our 
particular case, we are looking at extracting records from a Kafka topic from a 
particular well-defined offset range (per partition) - i.e.  from offset X to 
offset Y. In this case, we'd somehow want the application to know that it has 
finished when it gets to offset Y.  This is basically changes Kafka stream to 
be bounded data as opposed to unbounded in the usual Stream paradigm.

What would be the best approach to do this in Flink? I see a few options, 
though there might be more:

1. Use a regular streaming job, and have some external service that monitors 
the current offsets of the consumer group of the topic and manually stops job 
when the consumer group of the topic has finished 
   Pros - simple wrt Flink, Cons - hacky

2. Create a batch job, and a new InputFormat based on Kafka that reads the 
specified subset of Kafka topic into the source. 
   Pros - represent bounded data from Kafka topic as batch source, Cons  - 
requires implementation of source.

3. Dump the subset of Kafka into a file and then trigger a more 'traditional' 
Flink batch job that reads from a file.  
   Pros - simple, cons - unnecessary I/O.

I personally prefer 1 and 3 for simplicity. Has anyone done anything like this 
before?

Thanks,
Hayden Marchant



Re: Question about flink checkpoint

2018-02-08 Thread Fabian Hueske
Great, thank you!

Best, Fabian

2018-02-07 23:52 GMT+01:00 Chengzhi Zhao :

> Thanks, Fabian,
>
> I opened an JIRA ticket and I'd like to work on it if people think this
> would be a improvement:
> https://issues.apache.org/jira/browse/FLINK-8599
>
> Best,
> Chengzhi
>
> On Wed, Feb 7, 2018 at 4:17 AM, Fabian Hueske  wrote:
>
>> Hi Chengzhi Zhao,
>>
>> I think this is rather an issue with the ContinuousFileReaderOperator
>> than with the checkpointing algorithm in general.
>> A source can decide which information to store as state and also how to
>> handle failures such as file paths that have been put into state but have
>> been removed from the file system.
>>
>> It would be great if you could open a JIRA issue with a feature request
>> to improve the failure behavior of the ContinuousFileReaderOperator.
>> It could for example check if a path exists and before trying to read a
>> file and ignore the input split instead of throwing an exception and
>> causing a failure.
>> If you want to, you can also work on a fix and contribute it back.
>>
>> Best, Fabian
>>
>> 2018-02-06 19:15 GMT+01:00 Chengzhi Zhao :
>>
>>> Hey, I am new to flink and I have a question and want to see if anyone
>>> can help here.
>>>
>>> So we have a s3 path that flink is monitoring that path to see new files
>>> available.
>>>
>>> val avroInputStream_activity = env.readFile(format, path,
>>> FileProcessingMode.PROCESS_CONTINUOUSLY, 1)
>>>
>>> I am doing both internal and external check pointing and let's say there
>>> is a bad file came to the path and flink will do several retries. I want to
>>> take those bad files and let the process continue. However, since the file
>>> path persist in the checkpoint, when I try to resume from external
>>> checkpoint, it threw the following error on no file been found.
>>>
>>> java.io.IOException: Error opening the Input Split s3a://myfile [0,904]:
>>> No such file or directory: s3a://myfile
>>>
>>> Is there a way to skip this bad file and move on?
>>> Thanks in advance.
>>>
>>> Best,
>>> Chengzhi Zhao
>>>
>>>
>>
>