from:"Matthias J. Sax"

Re: Kafka ProducerFencedException after checkpointing

2019-08-12 Thread Matthias J. Sax

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

`transaction.timeout.ms` is a producer setting, thus you can increase
it accordingly.

Note, that brokers bound the range via `transaction.max.timeout.ms`;
thus, you may need to increase this broker configs, too.


- -Matthias

On 8/12/19 2:43 AM, Piotr Nowojski wrote:
> Hi,
> 
> Ok, I see. You can try to rewrite your logic (or maybe records
> schema by adding some ID fields) to manually deduplicating the
> records after processing them with at least once semantic. Such
> setup is usually simpler, with slightly better throughput and
> significantly better latency (end-to-end exactly once latency is
> limited by checkpointing time).
> 
> Piotrek
> 
>> On 12 Aug 2019, at 11:12, Tony Wei > > wrote:
>> 
>> Hi Piotr,
>> 
>> Thanks a lot. I need exactly once in my use case, but instead of 
>> having the risk of losing data, at least once is more acceptable
>> when error occurred.
>> 
>> Best, Tony Wei
>> 
>> Piotr Nowojski > > 於 2019年8月12日 週一 下午3:27寫道：
>> 
>> Hi,
>> 
>> Yes, if it’s due to transaction timeout you will lose the data.
>> 
>> Whether can you fallback to at least once, that depends on
>> Kafka, not on Flink, since it’s the Kafka that timeouts those 
>> transactions and I don’t see in the documentation anything that 
>> could override this [1]. You might try disabling the mechanism
>> via setting 
>> `transaction.abort.timed.out.transaction.cleanup.interval.ms 
>> 
`
>>
>> 
or `transaction.remove.expired.transaction.cleanup.interval.ms
>> `
,
>>
>> 
but that’s question more to Kafka guys. Maybe Becket could help
>> with this.
>> 
>> Also it MIGHT be that Kafka doesn’t remove records from the
>> topics when aborting the transaction and MAYBE you can still
>> access them via “READ_UNCOMMITTED” mode. But that’s again,
>> question to Kafka.
>> 
>> Sorry that I can not help more.
>> 
>> If you do not care about exactly once, why don’t you just set
>> the connector to at least once mode?
>> 
>> Piotrek
>> 
>>> On 12 Aug 2019, at 06:29, Tony Wei >> > wrote:
>>> 
>>> Hi,
>>> 
>>> I had the same exception recently. I want to confirm that if
>>> it is due to transaction timeout, then I will lose those data.
>>> Am I right? Can I make it fall back to at least once semantic
>>> in this situation?
>>> 
>>> Best, Tony Wei
>>> 
>>> Piotr Nowojski >> > 於 2018年3月21日 週三 下午10:28 寫道：
>>> 
>>> Hi,
>>> 
>>> But that’s exactly the case: producer’s transaction timeout 
>>> starts when the external transaction starts - but 
>>> FlinkKafkaProducer011 keeps an active Kafka transaction for the
>>> whole period between checkpoints.
>>> 
>>> As I wrote in the previous message:
>>> 
 in case of failure, your timeout must also be able to cover
>>> the additional downtime required for the successful job 
>>> restart. Thus you should increase your timeout accordingly.
>>> 
>>> I think that 15 minutes timeout is a way too small value. If 
>>> your job fails because of some intermittent failure (for 
>>> example worker crash/restart), you will only have a couple of 
>>> minutes for a successful Flink job restart. Otherwise you will
>>> lose some data (because of the transaction timeouts).
>>> 
>>> Piotrek
>>> 
 On 21 Mar 2018, at 10:30, Dongwon Kim >>> > wrote:
 
 Hi Piotr,
 
 Now my streaming pipeline is working without retries. I
 decreased Flink's checkpoint interval from 15min to 10min as
 you suggested [see screenshot_10min_ckpt.png].
 
 I though that producer's transaction timeout starts when the 
 external transaction starts. The truth is that Producer's
 transaction timeout starts after the last external checkpoint
 is committed. Now that I have 15min for Producer's
 transaction timeout and 10min for Flink's checkpoint
 interval, and every checkpoint takes less than 5 minutes,
 everything is working fine. Am I right?
 
 Anyway thank you very much for the detailed explanation!
 
 Best,
 
 Dongwon
 
 
 
 On Tue, Mar 20, 2018 at 8:10 PM, Piotr Nowojski 
 mailto:pi...@data-artisans.com>> 
 wrote:
 
 Hi,
 
 Please increase transaction.timeout.ms 
  to a greater value or 
 decrease Flink’s checkpoint interval, I’m pretty sure the
 issue here is that those two values are overlapping. I think
 that’s even visible on the screenshots. First checkpoint
 completed started at 14:28:48 and ended at 14:30:43, while
 the second one started at 14:45:53 and ended at 14:49:16.
 That gives you minimal transaction duration of 15 minutes and
 10 seconds, with maximal transaction duration of 21 minutes.
 
 In HAPPY

Re: Latest spark yahoo benchmark

2017-06-18 Thread Matthias J. Sax

From my understanding, the benchmark was done using Structured Streaming
that is still based on micro batching.

There are not throughput numbers for the new "Continuous Processing"
model Spark want to introduce. Only some latency numbers. Also note,
that the new "Continuous Processing" will not give exactly-once
semantics but only at-least-once (at least initially). Thus, there is
some tradeoff to make using "Continuous Processing" once it's available.

-Matthias

On 06/18/2017 03:51 PM, nragon wrote:
> databricks.com/blog/2017/06/06/simple-super-fast-streaming-engine-apache-spark.html
> 
> Should flink users be worry about this huge difference?
> End of microbatch with 65M benchmark.
> 
> 
> 
> --
> View this message in context: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Latest-spark-yahoo-benchmark-tp13820.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at 
> Nabble.com.
> 

signature.asc
Description: OpenPGP digital signature

Re: [ANNOUNCE] Welcome Stefan Richter as a new committer

2017-02-10 Thread Matthias J. Sax

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Congrats!

On 2/10/17 2:00 AM, Ufuk Celebi wrote:
> Hey everyone,
> 
> I'm very happy to announce that the Flink PMC has accepted Stefan 
> Richter to become a committer of the Apache Flink project.
> 
> Stefan is part of the community for almost a year now and worked
> on major features of the latest 1.2 release, most notably rescaling
> and backwards compatibility of program state.
> 
> Please join me in welcoming Stefan. :-)
> 
> – Ufuk
> 
-BEGIN PGP SIGNATURE-
Comment: GPGTools - https://gpgtools.org

iQIYBAEBCgAGBQJYnhBjAAoJELz8Z8hxAGOivzAP3ApZhA/YvlEu/daM4OYl1D3V
I0nn3GVT4aqj7KK824n6EUOkgAUlo79+AlbUxMST+Sf9KW8TFnDzEHbCQB943Ovq
x5uE7aiLGRfFo5xFNBEmjdgHEfUTWUOgx6ZX1UbYoAIoqi1VeNcIAwJm+g/UzVw2
P+d1QuK33Mlf8RqiupuPj/WrxUhBuLDh9F8OkzM3miSehPsf5yXrnTyqlTBJwvjl
Xgx3OHSo5b7Ht2MCmL2p7U7UT6e8vJ9+v/NvBR22kbWva8RKbbHb8xbD9AtZGwL5
LSq9nGMP8wnp8hOHTKDB3/MQXWIA6skqUe/91TEOrTTvbXhQ/hEOF7MJyPAhRR6k
z0YdRpScgv+4ab+MO6OtyMBIcaJVfZ+a1FS3QPN+jzxgrHrmnFZBsWgJMeoSxqa9
XqC3Af8k6FykrWgOwO+N1QhSDPXgqd8LCH1I/qA1cTE41xTwXlOghsUK0VDBiTJ3
u1yLr0XQz8+/E+dzXCd2yfj4ry1CXJJ/9fwOw3iAIOlTmwwUvHUulHXr4/vz/B8X
N2w+Zgp5sbUH+PBVeoUuJMPxRLTh73mstUVCCsCHr1fxMb+XbFYR+Xp9XSqFk8G0
s/y2AqbVfmyJ9V71pkbs7qhMVMUAX/I4epl9VywXNfPh/vLpcJBbaOoamUOftroL
8CEQl1u8+gb0fTA=
=PdEn
-END PGP SIGNATURE-

Re: How to read from a Kafka topic from the beginning

2017-01-16 Thread Matthias J. Sax

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

I would put this differently: "auto.offset.reset" policy is only used,
if there are no valid committed offsets for a topic.

See here:
http://stackoverflow.com/documentation/apache-kafka/5449/consumer-groups
- -and-offset-management

(don't be confused about "earliest/latest" and "smallest/larges" --
the former is for Kafka 0.8.2 and the later for 0.9+ -- but the
mechanism is the same)

But I though, Flink does not rely on consumer offsets commits but does
"manual" offset management? So I am wondering, if this property is
passed into the Kafka source operator's Kafka consumer or not?


- -Matthias


On 11/17/15 2:30 AM, Robert Metzger wrote:
> Hi Will,
> 
> In Kafka's consumer configuration [1] there is a configuration
> parameter called "auto.offset.reset". Setting it to "smallest" will
> tell the consumer to start reading a topic from the smallest
> available offset.
> 
> You can pass the configuration using the properties of the Kafka
> consumer.
> 
> 
> [1] http://kafka.apache.org/documentation.html#consumerconfigs
> 
> 
> On Tue, Nov 17, 2015 at 8:55 AM, Miaoyongqiang (Will) 
> >
> wrote:
> 
> Hi,
> 
> __ __
> 
> How can I tell a “FlinkKafkaConsumer” that I want to read from a 
> topic from the beginning?
> 
> __ __
> 
> Thanks,
> 
> Will
> 
> 
-BEGIN PGP SIGNATURE-
Comment: GPGTools - https://gpgtools.org

iQIYBAEBCgAGBQJYfSWaAAoJELz8Z8hxAGOilZ0P2gJZzeSpSU5RK7gmrL5oohyA
T+mKWXIkdMepDNec6w4zM0V07NnObu0UsVqPWEJmdOHg6bFihxmjO8i+7vYFShDH
9h26pChB7W6nvrwrASRiTXLNQl9rhMrBmp2qsMXskjKCHn+pHGeT0+LIt91sCwL0
VndFzk36UolfleGxpeQkcmPfNeTvlHws7nI5Imv5flsGIvWuGyJr/1v1Z2bWuXYj
PxE2vndoQo4yvcgEfSI3kNnm3vKnflPi83SuCY5r+C2lfiz1c83GM/yPPwlcUR5c
KjfeDQidy0B9npYkvTqoJV7Fm0oGvWjKKHCoS5HRrk4ha8WrakS/5FNpwf+FaOhi
+TCCdi9TAHhYd0lD183HK/F6bbnHTvo75C9PsCjcF7gFWDOj9sBgvTNvz8SgokpQ
g+QeiWtfi/YeU1TRWfM/KlpBdr5O/KmPFJ6XxIzXzUQmjR+z+Rp0j/hWq6o4loS5
OlJbtZon08HMcGIC0hQOGlnF2tKMkwEuatA3/fDor9AU2TAmQjhdZGvAu/RIa9IX
yKATrFjdxLLk3sUVvowTnnK1kSEApM4g3m3hGdPVzqsIWzbjgsNSvBDPKEma7oFu
y3cpo+x7uqE0QkJpDaja2zvYdRu91lwAJIkpDPknE/Ip2x6j+sWPwz3NRTRK7eEN
NH65TaPJXQvipDA=
=iVUW
-END PGP SIGNATURE-

Re: [DISCUSS] "Who's hiring on Flink" monthly thread in the mailing lists?

2016-12-13 Thread Matthias J. Sax

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

I think it's worth to announce this via news list. :)

On 12/13/16 7:32 AM, Robert Metzger wrote:
> The commun...@flink.apache.org 
> has been created :)
> 
> On Tue, Dec 13, 2016 at 10:43 AM, Robert Metzger
> > wrote:
> 
> +1. I've requested the community@ mailing list from infra.
> 
> On Tue, Dec 13, 2016 at 10:40 AM, Kostas Tzoumas 
> > wrote:
> 
> It seems that several folks are excited about the idea - but there
> is still a concern on whether this would be spam for the dev@ and
> user@ lists (which I share)
> 
> As a compromise, I propose to request a new mailing list ( 
> commun...@flink.apache.org ) 
> which we can use for this purpose, and also to post upcoming
> meetups, conferences, etc. In order to inform the community about
> this mailing list, we can cc the dev@ and user@ lists in the first 
> months until the new mailing list has ramped up.
> 
> On Fri, Dec 9, 2016 at 4:55 PM, Greg Hogan  > wrote:
> 
>> Google indexes the mailing list. Anyone can filter the
> messages to trash
>> in a few clicks.
>> 
>> This will also be a means for the community to better
> understand which and
>> how companies are using Flink.
>> 
>> On Fri, Dec 9, 2016 at 8:27 AM, Felix Neutatz
> >
>> wrote:
>> 
>>> Hi,
>>> 
>>> I wonder whether a mailing list is a good choice for that in
> general. If
>>> I am looking for a job I won't register for a mailing list or
> browse
>>> through the archive of one but rather search it via Google.
> So what about
>>> putting it on a dedicated site on the Web Page. This feels
> more intuitive
>>> to me and gives a better overview.
>>> 
>>> Best regards, Felix
>>> 
>>> On Dec 9, 2016 14:20, "Ufuk Celebi"  > wrote:
>>> 
>>> 
>>> 
>>> 
>>> On 9 December 2016 at 14:13:14, Robert Metzger
> (rmetz...@apache.org )
>>> wrote:
 I'm against using the news@ list for that. The promise of the
 news@ list is that its low-traffic and
> only for
>>> news. If
 we now start having job offers (and potentially some
> questions on them
 etc.) it'll be a list with more than some announcements. 
 That's also the reason why the news@ list is completely
> moderated.
>>> 
>>> I agree with Robert. I would consider that to be spam if
> posted to news@.
>>> 
>>> 
>>> 
>>> 
>> 
> 
> 
> 
-BEGIN PGP SIGNATURE-
Comment: GPGTools - https://gpgtools.org

iQIYBAEBCgAGBQJYUC/vAAoJELz8Z8hxAGOiMx0P31O4S270PAFJbGkYn6p7zZDH
ox3q/38DGU4MO00I5oieul+KE3lS00JHzRMqejXNdekDhtqmn5hMZtvRvwYS7kJv
kmuPrcxWOVtV9PGAR8i/cv7pgx6rDUXV4TpnIlc8XQc3qxSraykggZajN0VJ57NX
gO7fwsWzyh1lHHdVPI0KamqXKFDZVA+X3SY6Ml+gDJE4q5vvDQi5TXa9C96jn2it
xyDY4uDz1SnMqdIiSFx+F6Dba9gXjeoc0WGFYpq88u7D5OVwdF3S/sMdoKhcYsC8
eKKNQgnhAl/K5aYxA3v5EfI1eA/DHpIqgW2VEsJbU553PZ9PR/ZG2pnXVgVE70IH
6koHyBc/zlYc0BmOfJMcjpBfkeEJib1emKdpRiWB0RSXy2vM0sbHSMTlmUKSkGCh
A5Zza3+YbRec+ylcGdu+l0BKjriLa32gsPraWZCVVw+NcBKlA1Qxeqp5jwIyoW1r
fLjTe8+0DPYQ18Ufijtxa/iedGmBVYONhi1PhpE5cuSVDxBkUiJRqDe/SCCGj1Oi
1qDiR3imEaPmHCg6de6lF8MOzSm+CkgAjAXsjKv5kWVoiU6B+DVHQKrwVP/0CN+J
K/IjTGpqYzXbZnE+Vadofh9YpzwCHU9YadTms0oTrjRNOuHJi8rA0pPF3HaxEWDU
QRowo2+ah3PF4dA=
=cf9j
-END PGP SIGNATURE-

Re: microsecond resolution

2016-12-04 Thread Matthias J. Sax

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Oh. My bad... Did not read your question carefully enough.

Than the answer is no, it does not support microseconds (only
milliseconds).

- -Matthias


On 12/4/16 2:22 PM, jeff jacobson wrote:
> Sorry if I'm missing something. That link mentions milliseconds,
> no? My question is whether or not I can specify microseconds where 
> 1000microseconds = 1millisecond. Thanks!
> 
> On Sun, Dec 4, 2016 at 5:05 PM, Matthias J. Sax <mj...@apache.org 
> <mailto:mj...@apache.org>> wrote:
> 
> Yes. It does.
> 
> See: 
> https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/strea
ming/event_timestamps_watermarks.html#assigning-timestamps
>
> 
<https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/stream
ing/event_timestamps_watermarks.html#assigning-timestamps>
> 
> "Both timestamps and watermarks are specified as millliseconds
> since the Java epoch of 1970-01-01T00:00:00Z."
> 
> 
> 
> -Matthias
> 
> 
> On 12/04/2016 10:57 AM, jeff jacobson wrote:
>> I've sourced stackoverflow, the docs, and the web but I can't
>> figure out: does flink support microsecond timestamp resolution?
>> Thanks!
> 
> 
-BEGIN PGP SIGNATURE-
Comment: GPGTools - https://gpgtools.org

iQIYBAEBCgAGBQJYRJhUAAoJELz8Z8hxAGOiNKoP32ChGeNd7N8Zco2q6lsu+Hxd
JZq62ey3wTrIUS+3oRlILwnu81cViQHtMMVBly3+YnqB85gNiaEUxEQTQCdKPl8G
AqxoFIkMcrKGzwGXigKnCAoVIiyuPeNuhY1d1yv4rWrkt7qb0lCC02Xoq1C0hoS6
Stwk62GXmNRXPYpyjnSq/iAIMbjWaU+ZU0t4V3J8loroNuJ5QcUsJLfRXeo3/5ho
f42L+IANyB5K7vnTxNZYyf5ShNVbTY9/iFaviluxrCNztqGTo7CxMpcyWyMS3wcF
ycXcq/daB+guEJpW0sm4JtMPSsQ/kN99c/ig3t0HX1kDV7xrDDSF2qPvbYOWF38n
omTr7RY3YRFi5LOKvBGa96Aw5UYjMddjcqozWId6xgdXfvz6RUeJCWa9RW8I6ptg
8TaJpM2WgDJMgMuzdl8dDv65l78DkLlNlNo53O66b/9Pt78P75KNjj8naD5kkj4C
i9amwnUNNEnZucA2/1vhzr6cVSzrzBLL7juVj0VmABZo4itUZjjR0UkN7MB+ioWU
trNhaXgE6EP/160n6D0/NUu02prm3jq8mK6gu9lZFWGbAeCUcch+CbvWSaiXAw3H
BOieCsgZD1wfXQJ3wEmnqj/YP94uDlx1IjynskDevjk6OIyIysbBSIqgsUK6fvQ8
ztXO6ls7ARMOBmA=
=/O+Q
-END PGP SIGNATURE-

Re: [SUGGESTION] Stack Overflow Documentation for Apache Flink

2016-09-05 Thread Matthias J. Sax

I voted. It's live now.

The advance of SO documentation is also, that people not familiar with
Apache might do some documentation (but would never open a PR). Of
course, as community, we should put the focus on web page docs. But
having something additional can't hurt.

From my experience, it is also good if certain aspects are describe by
different people and thus with different point of view. It ofter helps
users to understand better.

Also reoccurring SO question can be handled nicely by SO documentation.

-Matthias

On 09/05/2016 01:25 PM, Till Rohrmann wrote:
> I've understood the SO documentation approach similar to what Max has
> said. I see it as source of code examples which illustrate Flink
> concepts and which is maintained by the SO community.
> 
> On Mon, Sep 5, 2016 at 1:09 PM, Maximilian Michels  > wrote:
> 
> I thought it is not about outsourcing but about providing an
> example-based documentation on SO which can be easily edited by the SO
> community. The work can be fed back to the official Flink
> documentation which will always be on flink.apache.org
> .
> 
> On Mon, Sep 5, 2016 at 12:42 PM, Fabian Hueske  > wrote:
> > Thanks for the suggestion Vishnu!
> > Stackoverflow documentation looks great. I like the easy
> contribution and
> > versioning features.
> >
> > However, I am a bit skeptical. IMO, Flink's primary documentation
> must be
> > hosted by Apache. Out-sourcing such an important aspect of a
> project to an
> > external service is not an option for me.
> > This would mean, that documentation on SO would be an additional /
> secondary
> > documentation. I see two potential problems with that:
> >
> > - It is duplicate effort to keep two documentations up-to-date.
> Adding a new
> > feature of changing some behavior must be documented in two places.
> > - Efforts to improve documentation might split up, i.e., the primary
> > documentation might receive less improvements and contributions.
> >
> > Of course, this is just my opinion but I think it is worth to
> mention these
> > points.
> >
> > Thanks,
> > Fabian
> >
> > 2016-09-05 12:22 GMT+02:00 Ravikumar Hawaldar
> > >:
> >>
> >> Hi,
> >>
> >>
> >> I just committed to apache-flink documentation on SO, one more commit
> >> required. Nice idea to document on SO Vishnu.
> >>
> >>
> >>
> >> Regards,
> >>
> >> Ravikumar
> >>
> >> On 5 September 2016 at 14:22, Maximilian Michels  > wrote:
> >>>
> >>> Hi!
> >>>
> >>> This looks neat. Let's try it out. I just voted.
> >>>
> >>> Cheers,
> >>> Max
> >>>
> >>> On Sun, Sep 4, 2016 at 8:11 PM, Vishnu Viswanath
> >>>  > wrote:
> >>> > Hi All,
> >>> >
> >>> > Why don't we make use of Stackoverflow's new documentation
> feature to
> >>> > do
> >>> > some documentation of Apache Flink.
> >>> >
> >>> > To start, at least 5 SO users should commit to document, who
> has at
> >>> > least150
> >>> > reputation and have at least 1 positively scored answer in
> Flink tag.
> >>> >
> >>> > http://stackoverflow.com/documentation/apache-flink
> 
> >>> >
> >>> > Regards,
> >>> > Vishnu Viswanath
> >>
> >>
> >
> 
> 



signature.asc
Description: OpenPGP digital signature

Re: Join two streams using a count-based window

2016-06-10 Thread Matthias J. Sax

I just put an answer to SO.

About the other questions: Flink processes tuple-by-tuple and does some
internal buffering. You might be interested in
https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks

-Matthias

On 06/09/2016 08:13 PM, Nikos R. Katsipoulakis wrote:
> Hello all,
> 
> At first, I have a question posted on
> http://stackoverflow.com/questions/37732978/join-two-streams-using-a-count-based-window
> . I am re-posting this on the mailing list in case some of you are not
> on SO.
> 
> In addition, I would like to know what is the difference between Flink
> and other Streaming engines on data-granularity transport and
> processing. To be more precise, I am aware that Storm sends tuples using
> Netty (by filling up queues) and a Bolt's logic is executed per tuple.
> Spark, employs micro-batches to simulate streaming and (I am not
> entirely certain) each task performs processing on a micro-batch. What
> about Flink? How are tuples transferred and processed. Any explanation
> and or article/blog-post/link is more than welcome.
> 
> Thanks
> 
> -- 
> Nikos R. Katsipoulakis, 
> Department of Computer Science 
> University of Pittsburgh



signature.asc
Description: OpenPGP digital signature

Re: Flink's WordCount at scale of 1BLN of unique words

2016-05-23 Thread Matthias J. Sax

Are you talking about a streaming or a batch job?

You are mentioning a "text stream" but also say you want to stream 100TB
-- indicating you have a finite data set using DataSet API.

-Matthias

On 05/22/2016 09:50 PM, Xtra Coder wrote:
> Hello, 
> 
> Question from newbie about how Flink's WordCount will actually work at
> scale. 
> 
> I've read/seen rather many high-level presentations and do not see
> more-or-less clear answers for following …
> 
> Use-case: 
> --
> there is huuuge text stream with very variable set of words – let's say
> 1BLN of unique words. Storing them just as raw text, without
> supplementary data, will take roughly 16TB of RAM. How Flink is
> approaching this internally. 
> 
> Here I'm more interested in following:
> 1.  How individual words are spread in cluster of Flink nodes? 
> Will each word appear exactly in one node and will be counted there or
> ... I'm not sure about the variants
> 
> 2.  As far as I understand – while job is running all its intermediate
> aggregation results are stored in-memory across cluster nodes (which may
> be partially written to local drive). 
> Wild guess - what size of cluster is required to run above mentioned
> tasks efficiently?
> 
> And two functional question on top of this  ...
> 
> 1. Since intermediate results are in memory – I guess it should be
> possible to get “current” counter for any word being processed. 
> Is this possible?
> 
> 2. After I've streamed 100TB of text – what will be the right way to
> save result to HDFS. For example I want to save list of words ordered by
> key with portions of 10mln per file compressed with bzip2. 
> What APIs I should use? 
> Since Flink uses intermediate snapshots for falt-tolerance - is it
> possible to save whole "current" state without stopping the stream?
> 
> Thanks.



signature.asc
Description: OpenPGP digital signature

Re: Barriers at work

2016-05-13 Thread Matthias J. Sax

I don't think barries can "expire" as of now. Might be a nice idea
thought -- I don't know if this might be a problem in production.

Furthermore, I want to point out, that an "expiring checkpoint" would
not break exactly-once processing, as the latest successful checkpoint
can always be used to recover correctly. Only the recovery-time would be
increase. because if a "barrier expires" and no checkpoint can be
stored, more data has to be replayed using the "old" checkpoint".

-Matthias

On 05/12/2016 09:21 PM, Srikanth wrote:
> Hello,
> 
> I was reading about Flink's checkpoint and wanted to check if I
> correctly understood the usage of barriers for exactly once processing.
>  1) Operator does alignment by buffering records coming after a barrier
> until it receives barrier from all upstream operators instances.
>  2) Barrier is always preceded by a watermark to trigger processing all
> windows that are complete.
>  3) Records in windows that are not triggered are also saved as part of
> checkpoint. These windows are repopulated when restoring from checkpoints. 
> 
> In production setups, were there any cases where alignment during
> checkpointing caused unacceptable latency?
> If so, is there a way to indicate say wait for a MAX 100 ms? That way we
> have exactly-once in most situations but prefer at least once over
> higher latency in corner cases.
> 
> Srikanth

signature.asc
Description: OpenPGP digital signature

Re: synchronizing two streams

2016-05-12 Thread Matthias J. Sax

I see. But even if you would have an operator (A,B)->(A,B), it would not
be possible to block A if B does not deliver any data, because of
Flink's internal design.

You will need to use an custom solution: something like to a map (one
for each steam) that use an side-communication channel (ie, external to
Flink). The maps could send heart-beats to each other as long as there
are input date available. As long as heart beats are received, data is
forwarded. If there are no heart beats for the other map, it indicates
that the other stream lacks data and thus forwarding can block to
throttle the own stream.

-Matthias


On 05/12/2016 03:36 PM, Alexander Gryzlov wrote:
> Yes, this is generally a viable design, and is actually something we
> started off with.
> 
> The problem in our case is, however, that either of the streams can
> occasionally (due to external producer's issues) get stuck for an
> arbitrary period of time, up to several hours. Buffering the other one
> during all this time would just blow the memory - streams' rates are
> dozens or even hundreds of Mb/sec. 
> 
> Alex
> 
> On Thu, May 12, 2016 at 4:00 PM, Matthias J. Sax <mj...@apache.org
> <mailto:mj...@apache.org>> wrote:
> 
> That is correct. But there is no reason to throttle an input stream.
> 
> If you implements an Outer-Join you will have two in-memory buffers
> holding the record of each stream of your "time window". Each time you
> receive a watermark, you can remove all "expired" records from the
> buffer of the other stream. Furthermore, you need to track if a record
> got joined of not. For all records that got not joined, before removing
> them emit a "record-null" (or "null-record") result tuple.
> 
> No need to block/sleep.
> 
> Does this make sense?
> 
> 
> -Matthias
> 
> 
> On 05/12/2016 02:51 PM, Alexander Gryzlov wrote:
> > Hmm, probably I don't really get how Flink's execution model works. As
> > far as I understand, the preferred way to throttle down stream
> > consumption is to simply have an operator with a conditional
> > Thread.sleep() inside. Wouldn't calling sleep() in either
> > of TwoInputStreamOperator's processWatermarkN() methods just freeze the
> > entire operator, stopping the consumption of both streams (as opposed to
> > just one)?
> >
> > Alex
> >
> > On Thu, May 12, 2016 at 2:31 PM, Matthias J. Sax <mj...@apache.org 
> <mailto:mj...@apache.org>
> > <mailto:mj...@apache.org <mailto:mj...@apache.org>>> wrote:
> >
> > I cannot follow completely. TwoInputStreamOperators defines
> two methods
> > to process watermarks for each stream.
> >
> > So you can sync both stream within your outer join operator
> you plan to
> > implement.
> >
> > -Matthias
> >
> > On 05/11/2016 05:00 PM, Alexander Gryzlov wrote:
> > > Hello,
> > >
> > > We're implementing a streaming outer join operator based on a
> > > TwoInputStreamOperator with an internal buffer. In our use-case
> > only the
> > > items whose timestamps are within a several-second interval
> of each
> > > other can join, so we need to synchronize the two input
> streams to
> > > ensure maximal yield. Our plan is to utilize the watermark
> > mechanism to
> > > implement some sort of a "throttling" operator, which would
> take two
> > > streams and stop passing through one of them based on the
> > watermarks in
> > > another. However, there doesn't seem to exist an operator of
> the shape
> > > (A,B)->(A,B) in Flink, where A and B can be received and emitted
> > > independently. What would be a resource-saving way to
> implement such
> > > (e.g., without spawning two more parallel
> TwoInputStreamOperators)?
> > >
> > > Alex
> >
> >
> 
> 



signature.asc
Description: OpenPGP digital signature

Re: synchronizing two streams

2016-05-12 Thread Matthias J. Sax

That is correct. But there is no reason to throttle an input stream.

If you implements an Outer-Join you will have two in-memory buffers
holding the record of each stream of your "time window". Each time you
receive a watermark, you can remove all "expired" records from the
buffer of the other stream. Furthermore, you need to track if a record
got joined of not. For all records that got not joined, before removing
them emit a "record-null" (or "null-record") result tuple.

No need to block/sleep.

Does this make sense?


-Matthias


On 05/12/2016 02:51 PM, Alexander Gryzlov wrote:
> Hmm, probably I don't really get how Flink's execution model works. As
> far as I understand, the preferred way to throttle down stream
> consumption is to simply have an operator with a conditional
> Thread.sleep() inside. Wouldn't calling sleep() in either
> of TwoInputStreamOperator's processWatermarkN() methods just freeze the
> entire operator, stopping the consumption of both streams (as opposed to
> just one)?
> 
> Alex
> 
> On Thu, May 12, 2016 at 2:31 PM, Matthias J. Sax <mj...@apache.org
> <mailto:mj...@apache.org>> wrote:
> 
> I cannot follow completely. TwoInputStreamOperators defines two methods
> to process watermarks for each stream.
> 
> So you can sync both stream within your outer join operator you plan to
> implement.
> 
> -Matthias
> 
> On 05/11/2016 05:00 PM, Alexander Gryzlov wrote:
> > Hello,
> >
> > We're implementing a streaming outer join operator based on a
> > TwoInputStreamOperator with an internal buffer. In our use-case
> only the
> > items whose timestamps are within a several-second interval of each
> > other can join, so we need to synchronize the two input streams to
> > ensure maximal yield. Our plan is to utilize the watermark
> mechanism to
> > implement some sort of a "throttling" operator, which would take two
> > streams and stop passing through one of them based on the
> watermarks in
> > another. However, there doesn't seem to exist an operator of the shape
> > (A,B)->(A,B) in Flink, where A and B can be received and emitted
> > independently. What would be a resource-saving way to implement such
> > (e.g., without spawning two more parallel TwoInputStreamOperators)?
> >
> > Alex
> 
> 



signature.asc
Description: OpenPGP digital signature

Re: synchronizing two streams

2016-05-12 Thread Matthias J. Sax

I cannot follow completely. TwoInputStreamOperators defines two methods
to process watermarks for each stream.

So you can sync both stream within your outer join operator you plan to
implement.

-Matthias

On 05/11/2016 05:00 PM, Alexander Gryzlov wrote:
> Hello,
> 
> We're implementing a streaming outer join operator based on a
> TwoInputStreamOperator with an internal buffer. In our use-case only the
> items whose timestamps are within a several-second interval of each
> other can join, so we need to synchronize the two input streams to
> ensure maximal yield. Our plan is to utilize the watermark mechanism to
> implement some sort of a "throttling" operator, which would take two
> streams and stop passing through one of them based on the watermarks in
> another. However, there doesn't seem to exist an operator of the shape
> (A,B)->(A,B) in Flink, where A and B can be received and emitted
> independently. What would be a resource-saving way to implement such
> (e.g., without spawning two more parallel TwoInputStreamOperators)?
> 
> Alex



signature.asc
Description: OpenPGP digital signature

Re: Gracefully stop long running streaming job

2016-04-18 Thread Matthias J. Sax

If all your sources implements Stoppable interface, you can STOP a job.

./bin/flink stop JobID

STOP is however quite new and it is ongoing work to make available
sources stoppable (some are already). Not sure what kind of sources you
are using right now.

-Matthias


On 04/18/2016 10:50 PM, Robert Schmidtke wrote:
> Hi everyone,
> 
> I am running a streaming benchmark which involves a potentially
> infinitely running Flink Streaming Job. I run it blocking on YARN using
> ./bin/flink run ... and then send the command into background,
> remembering its PID to kill it later on. While this gets the work done,
> the job always ends up in the FAILED state. I imagine it would be the
> same if I used ./bin/flink cancel ... to cancel the job? It's not that
> pressing but it would be nice to shut down a streaming job properly.
> 
> Thanks
> 
> Robert
> 
> -- 
> My GPG Key ID: 336E2680



signature.asc
Description: OpenPGP digital signature

Re:

2016-04-17 Thread Matthias J. Sax

Can you be a little bit more precise. It fails when you try to do

  bin/start-local.sh

?? Or what do you mean by "try to start the web interface"? The web
interface is started automatically within the JobManager process.

What is the exact error message. Is there any stack trace? Anny error in
the log files (in directory log/)

-Matthias

On 04/17/2016 03:50 PM, Ahmed Nader wrote:
> Sorry the error is can't find the path specified*
> 
> On 17 April 2016 at 15:49, Ahmed Nader <ahmednader...@gmail.com
> <mailto:ahmednader...@gmail.com>> wrote:
> 
> Thanks, I followed the instructions and when i try to start the web
> interface i get an error can't find file specified. I tried to
> change the env.java.home variable to the path of Java JDK or Java
> JRE on my machine however still i get the same error.
> Any idea how to solve this? 
> 
> On 17 April 2016 at 12:48, Matthias J. Sax <mj...@apache.org
> <mailto:mj...@apache.org>> wrote:
> 
> You need to download Flink and install it. Follow this instructions:
> 
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/quickstart/setup_quickstart.html
> 
> -Matthias
> 
> On 04/16/2016 04:00 PM, Ahmed Nader wrote:
> > Hello,
> > I'm new to flink so this might seem a basic question. I added
> flink to
> > an existing project using maven and can run the program
> locally with
> > StreamExecutionEnvironment with no problems, however i want to
> know how
> > can I submit jobs for that project and be able to view these
> jobs from
> > flink's web interface and run these jobs, while i don't have the
> > flink/bin folder in my project structure as i only added the
> dependencies.
> > Thanks.
> 
> 
> 



signature.asc
Description: OpenPGP digital signature

Re: jar dependency in the cluster

2016-04-17 Thread Matthias J. Sax

Did you double check that your jar does contain the Kafka connector
classes? I would assume that the jar is not assembled correctly.

See her for some help on how to package jars correctly:
https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution


One way would be this (even if I think, this would not be required to
solve your problem):
https://stackoverflow.com/questions/31784051/how-to-reference-the-external-jar-in-flink

-Matthias


On 04/16/2016 08:31 PM, Radu Tudoran wrote:
> Hi,
> 
>  
> 
> Could anyone help me with the following problem:
> 
>  
> 
> I have a flink cluster of a couple of nodes (i am using the old version
> 0.10).
> 
> I am packaging a jar that needs to use kafka connector. When I create
> the jar in eclipse I am adding the flink connector dependency and set to
> be packed with the jar. Nevertheless, when I submitted it to be executed
> on the cluster I get an error that the jar connector is not visible for
> the class loader. Is there a way in which I can set flink to use a
> certain library path where to look for dependencies or maybe when I
> deploy either the flink cluster or submit the job to add extra
> dependencies.
> 
>  
> 
> Many thanks
> 
>  
> 
>  
> 
> Dr. Radu Tudoran
> 
> Research Engineer - Big Data Expert
> 
> IT R Division
> 
>  
> 
> cid:image007.jpg@01CD52EB.AD060EE0
> 
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> 
> European Research Center
> 
> Riesstrasse 25, 80992 München
> 
>  
> 
> E-mail: _radu.tudoran@huawei.com_
> 
> Mobile: +49 15209084330
> 
> Telephone: +49 891588344173
> 
>  
> 
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
> 
> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
> 
> This e-mail and its attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure,
> reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error, please
> notify the sender by phone or email immediately and delete it!
> 
>  
> 



signature.asc
Description: OpenPGP digital signature

Re:

2016-04-17 Thread Matthias J. Sax

You need to download Flink and install it. Follow this instructions:
https://ci.apache.org/projects/flink/flink-docs-release-1.0/quickstart/setup_quickstart.html

-Matthias

On 04/16/2016 04:00 PM, Ahmed Nader wrote:
> Hello,
> I'm new to flink so this might seem a basic question. I added flink to
> an existing project using maven and can run the program locally with
> StreamExecutionEnvironment with no problems, however i want to know how
> can I submit jobs for that project and be able to view these jobs from
> flink's web interface and run these jobs, while i don't have the
> flink/bin folder in my project structure as i only added the dependencies.
> Thanks.



signature.asc
Description: OpenPGP digital signature

Re: CEP blog post

2016-04-06 Thread Matthias J. Sax

"Getting Started" in main page shows "Download 1.0" instead of 1.0.1

-Matthias

On 04/06/2016 02:03 PM, Ufuk Celebi wrote:
> The website has been updated for 1.0.1. :-)
> 
> @Till: If you don't mention it in the post, it makes sense to have a
> note at the end of the post saying that the code examples only work
> with 1.0.1.
> 
> On Mon, Apr 4, 2016 at 3:35 PM, Till Rohrmann  wrote:
>> Thanks a lot to all for the valuable feedback. I've incorporated your
>> suggestions and will publish the article, once Flink 1.0.1 has been released
>> (we need 1.0.1 to run the example code).
>>
>> Cheers,
>> Till
>>
>> On Mon, Apr 4, 2016 at 10:29 AM, gen tang  wrote:
>>>
>>> It is really a good article. Please put it on Flink Blog
>>>
>>> Cheers
>>> Gen
>>>
>>>
>>> On Fri, Apr 1, 2016 at 9:56 PM, Till Rohrmann 
>>> wrote:

 Hi Flink community,

 I've written a short blog [1] post about Flink's new CEP library which
 basically showcases its functionality using a monitoring example. I would
 like to publish the post on the flink.apache.org blog next week, if nobody
 objects. Feedback is highly appreciated :-)

 [1]
 https://docs.google.com/document/d/1rF2zVjitdTcooIwzJKNCIvAOi85j-wDXf1goXWXHHbk/edit?usp=sharing

 Cheers,
 Till
>>>
>>>
>>



signature.asc
Description: OpenPGP digital signature

Re: The way to itearte instances in AllWindowFunction in current Master branch

2016-02-25 Thread Matthias J. Sax

Just out of curiosity: Why was it changes like this. Specifying
"Iterable<...>" as type in AllWindowFunction seems rather unintuitive...

-Matthias

On 02/25/2016 01:58 PM, Aljoscha Krettek wrote:
> Hi,
> yes that is true. The way you would now write such a function is this:
> 
> private static class MyIterableFunction implements 
> AllWindowFunction>, Tuple2, 
> TimeWindow> {
>private static final long serialVersionUID = 1L;
> 
>@Override
>public void apply(
>  TimeWindow window,
>  Iterable> values,
>  Collector> out) throws Exception {
> 
>}
> }
> 
> (I used Tuple2 as an example input type here.)
> 
> and then you can use it with AllWindowedStream.apply(new 
> MyIterableFunction());
> 
> 
>> On 25 Feb 2016, at 13:29, HungChang  wrote:
>>
>> Thank you for your reply.
>>
>> The following in the current master looks like not iterable? because the
>> parameter is IN rather than Iterable
>> So I still have problem to iterate,,,
>>
>> @Public
>> public interface AllWindowFunction extends
>> Function, Serializable {
>>
>>  /**
>>   * Evaluates the window and outputs none or several elements.
>>   *
>>   * @param window The window that is being evaluated.
>>   * @param values The elements in the window being evaluated.
>>   * @param out A collector for emitting elements.
>>   *
>>   * @throws Exception The function may throw exceptions to fail the 
>> program
>> and trigger recovery.
>>   */
>>  void apply(W window, IN values, Collector out) throws Exception;
>> }
>>
>> https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/windowing/AllWindowFunction.java
>>
>> Best,
>>
>> Hung
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/The-way-to-itearte-instances-in-AllWindowFunction-in-current-Master-branch-tp5137p5145.html
>> Sent from the Apache Flink User Mailing List archive. mailing list archive 
>> at Nabble.com.
> 



signature.asc
Description: OpenPGP digital signature

Re: How to use all available task managers

2016-02-24 Thread Matthias J. Sax

Could it be, that you would need to edit client local flink-conf.yaml
instead of the TaskManager config files? (In case, you do not want to
specify parallelism via env.setParallelism(int);)

-Matthias

On 02/24/2016 04:19 PM, Saiph Kappa wrote:
> Thanks! It worked now :-)
> 
> On Wed, Feb 24, 2016 at 2:48 PM, Ufuk Celebi  > wrote:
> 
> You can use the environment to set it the job parallelism to 6 e.g.
> env.setParallelism(6).
> 
> Setting this will override the default behaviour. Maybe that's why the
> default parallelism is not working... you might have it set to 1
> already?
> 
> On Wed, Feb 24, 2016 at 3:41 PM, Saiph Kappa  > wrote:
> > I set "parallelism.default: 6" on flink-conf.yaml of all 6
> machines, and
> > still, my job only uses 1 task manager. Why?
> >
> >
> > On Wed, Feb 24, 2016 at 8:31 AM, Till Rohrmann
> > wrote:
> >>
> >> Hi Saiph,
> >>
> >> I think the configuration value should be parallelism.default: 6.
> That
> >> will execute jobs which have not parallelism defined with a DOP of 6.
> >>
> >> Cheers,
> >> Till
> >>
> >>
> >> On Wed, Feb 24, 2016 at 1:43 AM, Saiph Kappa
> >
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I  am running a flink stream application on a cluster with 6
> slaves/task
> >>> managers. I have set in flink-conf.yaml of every machine
> >>> "parallelization.degree.default: 6". However, when I run my
> application it
> >>> just uses one task slot and not all of them. Am I missing something?
> >>>
> >>> Thanks.
> >>
> >>
> >
> 
> 



signature.asc
Description: OpenPGP digital signature

Re: Master Thesis [Apache-flink paper references]

2016-02-12 Thread Matthias J. Sax

You might want to check out the Stratosphere project web site:
http://stratosphere.eu/project/publications/

-Matthias

On 02/12/2016 05:52 PM, subash basnet wrote:
> Hello all,
> 
> I am currently doing master's thesis on Apache-flink. It would be really
> helpful to know about the reference papers followed for the
> development/background of flink. 
> It would help me build a solid background knowledge to analyze flink. 
> 
> Currently I am reading all the related materials found in internet and
> flink/data-artisan materials provided. 
> Could you please suggest me. 
> 
> 
> 
> Best Regards,
> Subash Basnet



signature.asc
Description: OpenPGP digital signature

Re: Stream conversion

2016-02-04 Thread Matthias J. Sax

Hi Sane,

Currently, DataSet and DataStream API a strictly separated. Thus, this
is not possible at the moment.

What kind of operation do you want to perform on the data of a window?
Why do you want to convert the data into a data set?

-Matthias

On 02/04/2016 10:11 AM, Sane Lee wrote:
> Dear all,
> 
> I want to convert the data from each window of stream to dataset. What
> is the best way to do that?  So, while streaming, at the end of each
> window I want to convert those data to dataset and possible apply
> dataset transformations to it.
> Any suggestions?
> 
> -best
> -sane

signature.asc
Description: OpenPGP digital signature

Re: Flink Stream: collect in an array all records within a window

2016-01-19 Thread Matthias J. Sax

What type is your DataStream? It must be DataStream[String] to work with
SimpleStringSchema.

If you have a different type, just implement a customized
SerializationSchema.

-Matthias


On 01/19/2016 11:26 AM, Saiph Kappa wrote:
> When I use SimpleStringSchema I get the error: Type mismatch, expected:
> SerializationSchema[String, Array[Byte]], actual: SimpleStringSchema. I
> think SimpleStringSchema extends SerializationSchema[String], and
> therefore cannot be used as argument of writeToSocket. Can you confirm
> this please?
> 
> s.writeToSocket(host, port.toInt, new SimpleStringSchema())
> 
> 
> Thanks.
> 
> On Tue, Jan 19, 2016 at 10:34 AM, Matthias J. Sax <mj...@apache.org
> <mailto:mj...@apache.org>> wrote:
> 
> There is SimpleStringSchema.
> 
> -Matthias
> 
> On 01/18/2016 11:21 PM, Saiph Kappa wrote:
> > Hi Matthias,
> >
> > Thanks for your response. The method .writeToSocket seems to be what I
> > was looking for. Can you tell me what kind of serialization schema
> > should I use assuming my socket server receives strings. I have
> > something like this in scala:
> >
> > |val server =newServerSocket()while(true){val s =server.accept()val
> > 
> in=newBufferedSource(s.getInputStream()).getLines()println(in.next())s.close()}
>     >
> > |
> >
> > Thanks|
> > |
> >
> >
> >
> > On Mon, Jan 18, 2016 at 8:46 PM, Matthias J. Sax <mj...@apache.org 
> <mailto:mj...@apache.org>
> > <mailto:mj...@apache.org <mailto:mj...@apache.org>>> wrote:
> >
> > Hi Saiph,
> >
> > you can use AllWindowFunction via .apply(...) to get an
> .collect method:
> >
> > From:
> >   
>  
> https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html
> >
> > > // applying an AllWindowFunction on non-keyed window stream
> > > allWindowedStream.apply (new
> > AllWindowFunction<Tuple2<String,Integer>, Integer, Window>() {
> > > public void apply (Window window,
> > > Iterable<Tuple2<String, Integer>> values,
> > > Collector out) throws Exception {
> > > int sum = 0;
> > > for (value t: values) {
> > > sum += t.f1;
> > > }
> > > out.collect (new Integer(sum));
> > > }
> > > });
> >
> > If you consume all those value via an sink, the sink will run
> an the
> > cluster. You can use .writeToSocket(...) as sink:
> >   
>  
> https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html#data-sinks
> >
> > -Matthias
> >
> >
> > On 01/18/2016 06:30 PM, Saiph Kappa wrote:
> > > Hi,
> > >
> > > After performing a windowAll() on a DataStream[String], is
> there any
> > > method to collect and return an array with all Strings
> within a window
> > > (similar to .collect in Spark).
> > >
> > > I basically want to ship all strings in a window to a remote
> server
> > > through a socket, and want to use the same socket connection
> for all
> > > strings that I send. The method .addSink iterates over all
> > records, but
> > > does the provided function runs on the flink client or on
> the server?
> > >
> > > Thanks.
> >
> >
> 
> 



signature.asc
Description: OpenPGP digital signature

Re: Flink Stream: collect in an array all records within a window

2016-01-19 Thread Matthias J. Sax

It should work.

Your error message indicates, that your DataStream is of type
[String,Array[Byte]] and not of type [String].

> Type mismatch, expected: SerializationSchema[String, Array[Byte]], actual: 
> SimpleStringSchema

Can you maybe share your code?

-Matthias

On 01/19/2016 01:57 PM, Saiph Kappa wrote:
> It's DataStream[String]. So it seems that SimpleStringSchema cannot be
> used in writeToSocket regardless of the type of the DataStream. Right?
> 
> On Tue, Jan 19, 2016 at 1:32 PM, Matthias J. Sax <mj...@apache.org
> <mailto:mj...@apache.org>> wrote:
> 
> What type is your DataStream? It must be DataStream[String] to work with
> SimpleStringSchema.
> 
> If you have a different type, just implement a customized
> SerializationSchema.
> 
> -Matthias
> 
> 
> On 01/19/2016 11:26 AM, Saiph Kappa wrote:
> > When I use SimpleStringSchema I get the error: Type mismatch, expected:
> > SerializationSchema[String, Array[Byte]], actual: SimpleStringSchema. I
> > think SimpleStringSchema extends SerializationSchema[String], and
> > therefore cannot be used as argument of writeToSocket. Can you confirm
> > this please?
> >
> > s.writeToSocket(host, port.toInt, new SimpleStringSchema())
> >
> >
> > Thanks.
> >
> > On Tue, Jan 19, 2016 at 10:34 AM, Matthias J. Sax <mj...@apache.org 
> <mailto:mj...@apache.org>
> > <mailto:mj...@apache.org <mailto:mj...@apache.org>>> wrote:
> >
> > There is SimpleStringSchema.
> >
> > -Matthias
> >
> > On 01/18/2016 11:21 PM, Saiph Kappa wrote:
> > > Hi Matthias,
> > >
> > > Thanks for your response. The method .writeToSocket seems to be 
> what I
> > > was looking for. Can you tell me what kind of serialization schema
> > > should I use assuming my socket server receives strings. I have
> > > something like this in scala:
> > >
> > > |val server =newServerSocket()while(true){val s 
> =server.accept()val
> > > 
> in=newBufferedSource(s.getInputStream()).getLines()println(in.next())s.close()}
> > >
> > > |
> > >
> > > Thanks|
> > > |
> > >
> > >
> > >
> > > On Mon, Jan 18, 2016 at 8:46 PM, Matthias J. Sax 
> <mj...@apache.org <mailto:mj...@apache.org> <mailto:mj...@apache.org
> <mailto:mj...@apache.org>>
> > > <mailto:mj...@apache.org <mailto:mj...@apache.org>
> <mailto:mj...@apache.org <mailto:mj...@apache.org>>>> wrote:
> > >
> > > Hi Saiph,
> > >
> > > you can use AllWindowFunction via .apply(...) to get an
> > .collect method:
> > >
> > > From:
> > >
> > 
> 
> https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html
> > >
> > > > // applying an AllWindowFunction on non-keyed window
> stream
> > > > allWindowedStream.apply (new
> > > AllWindowFunction<Tuple2<String,Integer>, Integer,
> Window>() {
> > > > public void apply (Window window,
> > > > Iterable<Tuple2<String, Integer>> values,
> > > > Collector out) throws Exception {
> > > > int sum = 0;
> > > > for (value t: values) {
> > > > sum += t.f1;
> > > > }
> > > > out.collect (new Integer(sum));
> > > > }
> > > > });
> > >
> > > If you consume all those value via an sink, the sink
> will run
> > an the
> > > cluster. You can use .writeToSocket(...) as sink:
> > >
> > 
> 
> https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html#data-sinks
> > >
> > > -Matthias
> > >
> > >
> > > On 01/18/2016 06:30 PM, Saiph Kappa wrote:
> > > > Hi,
> > > >
> > > > After performing a windowAll() on a DataStream[String], is
> > there any
> > > > method to collect and return an array with all Strings
> > within a window
> > > > (similar to .collect in Spark).
> > > >
> > > > I basically want to ship all strings in a window to a
> remote
> > server
> > > > through a socket, and want to use the same socket
> connection
> > for all
> > > > strings that I send. The method .addSink iterates over all
> > > records, but
> > > > does the provided function runs on the flink client or on
> > the server?
> > > >
> > > > Thanks.
> > >
> > >
> >
> >
> 
> 



signature.asc
Description: OpenPGP digital signature

Re: Flink Stream: collect in an array all records within a window

2016-01-19 Thread Matthias J. Sax

Seems you are right. It works on the current 1.0-Snapshot version which
has a different signature...

> def writeToSocket(
>   hostname: String,
>   port: Integer,
>   schema: SerializationSchema[T]): DataStreamSink[T] = {
> javaStream.writeToSocket(hostname, port, schema)
>   }

instead of 0.10.1:

> def writeToSocket(
>   hostname: String,
>   port: Integer,
>   schema: SerializationSchema[T, Array[Byte]]): DataStreamSink[T] = {
> javaStream.writeToSocket(hostname, port, schema)
>   }

I guess, you can still implement your own SerializationSchema for 0.10.1
to make it work.


-Matthias


On 01/19/2016 04:27 PM, Saiph Kappa wrote:
> I think this is a bug in the scala API.
> 
> def writeToSocket(hostname : scala.Predef.String, port : java.lang.Integer, 
> schema : org.apache.flink.streaming.util.serialization.SerializationSchema[T, 
> scala.Array[scala.Byte]]) : 
> org.apache.flink.streaming.api.datastream.DataStreamSink[T] = { /* compiled 
> code */ }
> 
> 
> 
> On Tue, Jan 19, 2016 at 2:16 PM, Matthias J. Sax <mj...@apache.org
> <mailto:mj...@apache.org>> wrote:
> 
> It should work.
> 
> Your error message indicates, that your DataStream is of type
> [String,Array[Byte]] and not of type [String].
> 
> > Type mismatch, expected: SerializationSchema[String, Array[Byte]], 
> actual: SimpleStringSchema
> 
> Can you maybe share your code?
> 
> -Matthias
> 
> On 01/19/2016 01:57 PM, Saiph Kappa wrote:
> > It's DataStream[String]. So it seems that SimpleStringSchema cannot be
>     > used in writeToSocket regardless of the type of the DataStream. Right?
> >
> > On Tue, Jan 19, 2016 at 1:32 PM, Matthias J. Sax <mj...@apache.org 
> <mailto:mj...@apache.org>
> > <mailto:mj...@apache.org <mailto:mj...@apache.org>>> wrote:
> >
> > What type is your DataStream? It must be DataStream[String] to work 
> with
> > SimpleStringSchema.
> >
> > If you have a different type, just implement a customized
> > SerializationSchema.
> >
> > -Matthias
> >
> >
> > On 01/19/2016 11:26 AM, Saiph Kappa wrote:
> > > When I use SimpleStringSchema I get the error: Type mismatch, 
> expected:
> > > SerializationSchema[String, Array[Byte]], actual: 
> SimpleStringSchema. I
> > > think SimpleStringSchema extends SerializationSchema[String], and
> > > therefore cannot be used as argument of writeToSocket. Can you 
> confirm
> > > this please?
> > >
> > > s.writeToSocket(host, port.toInt, new SimpleStringSchema())
> > >
> > >
> > > Thanks.
> > >
> > > On Tue, Jan 19, 2016 at 10:34 AM, Matthias J. Sax 
> <mj...@apache.org <mailto:mj...@apache.org> <mailto:mj...@apache.org
> <mailto:mj...@apache.org>>
> > > <mailto:mj...@apache.org <mailto:mj...@apache.org>
> <mailto:mj...@apache.org <mailto:mj...@apache.org>>>> wrote:
> > >
> > > There is SimpleStringSchema.
> > >
> > > -Matthias
> > >
> > > On 01/18/2016 11:21 PM, Saiph Kappa wrote:
> > > > Hi Matthias,
> > > >
> > > > Thanks for your response. The method .writeToSocket
> seems to be what I
> > > > was looking for. Can you tell me what kind of
> serialization schema
> > > > should I use assuming my socket server receives
> strings. I have
> > > > something like this in scala:
> > > >
> > > > |val server =newServerSocket()while(true){val s
> =server.accept()val
> > > >
> 
> in=newBufferedSource(s.getInputStream()).getLines()println(in.next())s.close()}
> > > >
> > > > |
> > > >
> > > > Thanks|
> > > > |
> > > >
> > > >
> > > >
> > > > On Mon, Jan 18, 2016 at 8:46 PM, Matthias J. Sax
> <mj...@apache.org <mailto:mj...@apache.org> <mailto:mj...@apache.org
> <mailto:mj...@apache.org>> <mailto:mj...@apache.org
> <mailto:mj...@apache.org>
> > <mailto:mj...@apache.org

Re: Flink Stream: How to ship results through socket server

2016-01-19 Thread Matthias J. Sax

Your "SocketWriter-Thread" code will run on your client. All code in
"main" runs on the client.

execute() itself runs on the client, too. Of course, it triggers the job
submission to the cluster. In this step, the assembled job from the
previous calls is translated into the JobGraph which is submitted to the
JobManager for execution.

You should start your SocketWriter-Thread manually on the cluster, ie,
if you use "localhost" in "env.socketTextStream", it must be the
TaskManager machine that executes this SocketStream-source task.

I guess, it would be better not to use "localhost", but start your
SocketWriter-Thread on a dedicated machine in the cluster, and connect
your SocketStream-source to this machine via its host name.

-Matthias



On 01/19/2016 03:57 PM, Saiph Kappa wrote:
> Hi,
> 
> This is a simple example that I found using Flink Stream. I changed it
> so the flink client can be executed on a remote cluster, and so that it
> can open a socket server to ship its results for any other consumer
> machine. It seems to me that the socket server is not being open in the
> remote cluster, but rather in my local machine (which I'm using to
> launch the app). How can I achieve that? I want to be able to ship
> results directly from the remote cluster, and through a socket server
> where clients can use as a tap.
> 
> Sorry about indentation:
> 
> |def main(args: Array[String]) { |
> 
> val env =
> StreamExecutionEnvironment.createRemoteEnvironment("myhostname",
> DefaultFlinkMasterPort,
> 
> ||"myapp-assembly-0.1-SNAPSHOT.jar"); | //Read from a socket stream at
> map it to StockPrice objects val socketStockStream =
> env.socketTextStream("localhost", ).map(x => { val split =
> x.split(",") StockPrice(split(0), split(1).toDouble) }) //Generate other
> stock streams val SPX_Stream = env.addSource(generateStock("SPX")(10) _)
> val FTSE_Stream = env.addSource(generateStock("FTSE")(20) _) val
> DJI_Stream = env.addSource(generateStock("DJI")(30) _) val BUX_Stream =
> env.addSource(generateStock("BUX")(40) _) //Merge all stock streams
> together val stockStream = socketStockStream.merge(SPX_Stream,
> FTSE_Stream, DJI_Stream, BUX_Stream) stockStream.print()
> |
> 
> // WHERE IS THE FOLLOWING CODE RUN?
> 
> |var out: PrintWriter = null
> new Thread {
> override def run(): Unit = {
> val serverSocket = new ServerSocket(12345)
> while (true) {
> val socket = serverSocket.accept()
> val hostname = socket.getInetAddress.getHostName.split('.').head
> println(s"Got a new connection from $hostname")
> out = new PrintWriter(socket.getOutputStream)
> }
> }
> }.start()
> 
> |||stockStream|.addSink(record => {
> if(out != null) {
> out.write(record)
> out.flush()
> }
> })
> 
> env.execute("Stock stream") }|
> 
> Thanks.



signature.asc
Description: OpenPGP digital signature

Re: Cancel Job

2016-01-18 Thread Matthias J. Sax

Hi,

currently, messaged in flight will be dropped if a streaming job gets
canceled.

There is already WIP to add a STOP signal which allows for a clean
shutdown of a streaming job. This should get merged soon and will be
available in Flink 1.0.

You can follow the JIRA an PR here:
https://issues.apache.org/jira/browse/FLINK-2111
https://github.com/apache/flink/pull/750

-Matthias

On 01/18/2016 08:26 PM, Don Frascuchon wrote:
> Hi,
> 
> When some streaming job is manually canceled, what's about the messages
> in process ? Flink's engine wait to task finish process  messages inside
> (some like apache-storm) ? If not, there is a safe way for stop
> streaming jobs ?
> 
> Thanks in advance!
> Best regards

signature.asc
Description: OpenPGP digital signature

Re: Working with storm compatibility layer

2016-01-14 Thread Matthias J. Sax

Hi,

I can submit the topology without any problems. Your code is fine.

If your program "exits silently" I would actually assume, that you
submitted the topology successfully. Can you see the topology in
JobManager WebFrontend? If not, do you see any errors in the log files?

-Matthias

On 01/14/2016 07:37 AM, Shinhyung Yang wrote:
> Dear Matthias,
> 
> Thank you for the reply! I am so sorry to respond late on the matter.
> 
>> I just double checked the Flink code and during translation from Storm
>> to Flink declareOuputFields() is called twice. You are right that is
>> does the same job twice, but that is actually not a problem. The Flink
>> code is cleaner this way to I guess we will not change it.
> 
> Thank you for checking. I don't think it contributed any part of my
> current problem anyways. For my case though, it is called 3 times if
> the number is important at all.
> 
>> About lifecyle:
>> If you submit your code, during deployment, Spout.open() and
>> Bolt.prepare() should be called for each parallel instance on each
>> Spout/Bolt of your topology.
>>
>> About your submission (I guess this should solve your current problem):
>> If you use bin/start-local.sh, you should *not* use FlinkLocalCluster,
>> but FlinkSubmitter. You have to distinguish three cases:
>>
>>   - local/debug/IDE mode: use FlinkLocalCluster
>> => you do not need to start any Flink cluster before --
>> FlinkLocalCluster is started up in you current JVM
>> * the purpose is local debugging in an IDE (this allows to easily
>> set break points and debug code)
>>
>>   - pseudo-distributed mode: use FlinkSubmitter
>> => you start up a local Flink cluster via bin/start-local.sh
>> * this local Flink cluster run in an own JVM and looks like a real
>> cluster to the Flink client, ie, "bin/flink run"
>> * thus, you just use FlinkSubmitter as for a real cluster (with
>> JobManager/Nimbus hostname "localhost")
>> * in contrast to FlinkLocalCluster, no "internal Flink Cluster" is
>> started in your current JVM, but your code is shipped to the local
>> cluster you started up beforehand via bin/start-local.sh and executed in
>> this JVM
>>
>>   - distributed mode: use FlinkSubmitter
>> => you start up Flink in a real cluster using bin/start-cluster.sh
>> * you use "bin/flink run" to submit your code to the real cluster
> 
> Thank you for the explanation, now I have clearer understanding of
> clusters and submitters. However my problem is not fixed yet. Here's
> my code:
> 
> 
> // ./src/main/java/myexample/App.java
> 
> 
> package myexample;
> 
> import backtype.storm.Config;
> import backtype.storm.LocalCluster;
> import myexample.spout.StandaloneSpout;
> import backtype.storm.generated.StormTopology;
> import backtype.storm.topology.IRichSpout;
> import backtype.storm.topology.TopologyBuilder;
> import backtype.storm.topology.base.BaseBasicBolt;
> 
> import myexample.bolt.Node;
> import myexample.bolt.StandardBolt;
> 
> import java.util.Arrays;
> import java.util.List;
> 
> //
> import org.apache.flink.storm.api.FlinkTopology;
> //import org.apache.flink.storm.api.FlinkLocalCluster;
> import org.apache.flink.storm.api.FlinkSubmitter;
> //import org.apache.flink.storm.api.FlinkClient;
> import org.apache.flink.storm.api.FlinkTopologyBuilder;
> 
> public class App
> {
> public static void main( String[] args ) throws Exception
> {
> int layer = 0;
> StandaloneSpout spout = new StandaloneSpout();
> Config conf = new Config();
> conf.put(Config.TOPOLOGY_DEBUG, false);
> //FlinkLocalCluster cluster = new FlinkLocalCluster();
> //FlinkLocalCluster cluster = FlinkLocalCluster.getLocalCluster();
> //LocalCluster cluster = new LocalCluster();
> 
> layer = Integer.parseInt(args[0]);
> //cluster.submitTopology("topology", conf,
> BinaryTopology(spout, layer));
> FlinkSubmitter.submitTopology("topology", conf,
> BinaryTopology(spout, layer));
> //Thread.sleep(5 * 1000);
> //FlinkClient.getConfiguredClient(conf).killTopology("topology");
> //cluster.killTopology("topology");
> //cluster.shutdown();
> }
> 
> public static FlinkTopology BinaryTopology(IRichSpout input, int n) {
> //public static StormTopology BinaryTopology(IRichSpout input, int n) {
> return BinaryTopology(input, n,
> Arrays.asList((BaseBasicBolt)new StandardBolt()));
> }
> 
> public static FlinkTopology BinaryTopology(IRichSpout input, int
> n, List boltList) {
> //public static StormTopology BinaryTopology(IRichSpout input, int
> n, List boltList) {
> FlinkTopologyBuilder builder = new FlinkTopologyBuilder();
> //TopologyBuilder builder = new TopologyBuilder();
> String sourceId = "src";
>

Re: DataStream jdbc sink

2016-01-13 Thread Matthias J. Sax

Hi,

use JDBCOutputFormatBuilder to set all required parameters:

> JDBCOutputFormatBuilder builder = JDBCOutputFormat.buildJDBCOutputFormat();
> builder.setDBUrl(...)
> // and more
> 
> var.write(builder.finish, OL);

-Matthias


On 01/13/2016 06:21 PM, Traku traku wrote:
> Hi everyone.
> 
> I'm trying to migrate some code to flink 0.10 and I'm having a problem.
> 
> I try to create a custom sink to insert the data to a postgresql
> database. My code was this.
> 
> var.output(
> // build and configure OutputFormat
> JDBCOutputFormat
> .buildJDBCOutputFormat()
> .setDrivername("org.postgresql.Driver")
> .setDBUrl("jdbc:postgresql://127.0.0.1:5432/test
> ")
> .setUsername("postgres")
> .setPassword("")
> .setQuery("insert into XXX  values  (?,?,?);") 
> .finish()
> );
> 
> Could you help me?
> 
> Best regards.
> 
> 



signature.asc
Description: OpenPGP digital signature

Re: Working with storm compatibility layer

2016-01-09 Thread Matthias J. Sax

Hello Shinhyung,

that sounds weird and should not happen -- Spout.open() should get
called exactly once. I am not sure about multiple calls to
declareOuputFields though -- if might be called multiple times -- would
need to double check the code.

However, the call to declareOuputFields should be idempotent, so it
should actually not be a problem if it is called multiple times. Even if
Storm might call this method only once, there is no guarantee that it is
not called multiple time. If this is a problem for you, please let me
know. I think we could fix this and make sure the method is only called
once.

It would be helpful if you could share you code. What do you mean with
"exits silently"? No submission happens? Did you check the logs? As you
mentioned FlinkLocalCluster, I assume that you run within an IDE?

Btw: lately we fixed a couple of bugs. I would suggest that you use the
latest version from Flink master branch. I should work with 0.10.1
without problems.

-Matthias

On 01/09/2016 01:27 AM, Shinhyung Yang wrote:
> Howdies to everyone,
> 
> I'm trying to use the storm compatibility layer on Flink 0.10.1. The
> original storm topology works fine on Storm 0.9.5 and I have
> incorporated FlinkLocalCluster, FlinkTopologyBuilder, and
> FlinkTopology classes according to the programming guide
> (https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/storm_compatibility.html).
> I'm running it on Oracle Java 8 (1.8.0_66-b17) on Centos 7 (7.2.1511).
> What happens is, it seems to be going all the way to submitTopology
> method without any problem, however it doesn't invoke open method of
> Spout class but declareOutputFields method is called for multiple
> times and the program exits silently. Do you guys have any idea what's
> going on here or have any suggestions? If needed, then please ask me
> for more information.
> 
> Thank you for reading.
> With best regards,
> Shinhyung Yang
> 

signature.asc
Description: OpenPGP digital signature

Re: Working with storm compatibility layer

2016-01-09 Thread Matthias J. Sax

Hi,

I just double checked the Flink code and during translation from Storm
to Flink declareOuputFields() is called twice. You are right that is
does the same job twice, but that is actually not a problem. The Flink
code is cleaner this way to I guess we will not change it.

About lifecyle:
If you submit your code, during deployment, Spout.open() and
Bolt.prepare() should be called for each parallel instance on each
Spout/Bolt of your topology.

About your submission (I guess this should solve your current problem):
If you use bin/start-local.sh, you should *not* use FlinkLocalCluster,
but FlinkSubmitter. You have to distinguish three cases:

  - local/debug/IDE mode: use FlinkLocalCluster
=> you do not need to start any Flink cluster before --
FlinkLocalCluster is started up in you current JVM
* the purpose is local debugging in an IDE (this allows to easily
set break points and debug code)

  - pseudo-distributed mode: use FlinkSubmitter
=> you start up a local Flink cluster via bin/start-local.sh
* this local Flink cluster run in an own JVM and looks like a real
cluster to the Flink client, ie, "bin/flink run"
* thus, you just use FlinkSubmitter as for a real cluster (with
JobManager/Nimbus hostname "localhost")
* in contrast to FlinkLocalCluster, no "internal Flink Cluster" is
started in your current JVM, but your code is shipped to the local
cluster you started up beforehand via bin/start-local.sh and executed in
this JVM

  - distributed mode: use FlinkSubmitter
=> you start up Flink in a real cluster using bin/start-cluster.sh
* you use "bin/flink run" to submit your code to the real cluster


About further debugging: you can increase the log level to get more
information:
https://ci.apache.org/projects/flink/flink-docs-release-0.10/internals/logging.html

Hope this helps!

-Matthias

On 01/09/2016 04:38 PM, Shinhyung Yang wrote:
> Dear Matthias,
> 
> Thank you for replying!
> 
> that sounds weird and should not happen -- Spout.open() should get
> called exactly once.
> 
> 
> That's what I thought too. I'm new to both Storm and Flink so it's quite
> complicated for me to handle both yet; would it be helpful for me if I
> know storm's lifecyle and flink 's lifecycle? When submitTopology()
> invoked, what should be called other than spout.open()?
> 
> I am not sure about multiple calls to
> 
> declareOuputFields though -- if might be called multiple times -- would
> need to double check the code.
> 
> 
> I'll check my code too.
>  
> 
> However, the call to declareOuputFields should be idempotent, so it
> should actually not be a problem if it is called multiple times. Even if
> Storm might call this method only once, there is no guarantee that it is
> not called multiple time. If this is a problem for you, please let me
> know. I think we could fix this and make sure the method is only called
> once.
> 
> 
> Actually it doesn't seem to be a problem for now. It just does the same
> job multiple times.
>  
> 
> It would be helpful if you could share you code. What do you mean with
> "exits silently"? No submission happens? Did you check the logs? As you
> mentioned FlinkLocalCluster, I assume that you run within an IDE?
> 
> 
> The topology doesn't seem to continue. There's a set of initialization
> code in the open method of the program's spout and it looks hopeless if
> it's not invoked. Is there any way to check the logs other than using
> println() calls? I'm running it on the commandline with having
> `bin/start_local.sh' running in the background and `bin/flink run'.
>  
> 
> Btw: lately we fixed a couple of bugs. I would suggest that you use the
> latest version from Flink master branch. I should work with 0.10.1
> without problems.
> 
> 
> It was vey tedious for me to deal with a pom.xml file and .m2
> repository. So I preferred to use maven central. But I should try with
> the master branch if I have to.
> 
> I will quickly check if I could share some of the code.
> 
> Thank you again for the help!
> With best regards,
> Shinhyung Yang
>  
> 
> 
> -Matthias
> 
> 
> 
> On 01/09/2016 01:27 AM, Shinhyung Yang wrote:
> > Howdies to everyone,
> >
> > I'm trying to use the storm compatibility layer on Flink 0.10.1. The
> > original storm topology works fine on Storm 0.9.5 and I have
> > incorporated FlinkLocalCluster, FlinkTopologyBuilder, and
> > FlinkTopology classes according to the programming guide
> >
> 
> (https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/storm_compatibility.html).
> > I'm running it on Oracle Java 8 (1.8.0_66-b17) on Centos 7 (7.2.1511).
> > What happens is, it seems to be going all the way to submitTopology
> > method without any problem, however it doesn't invoke open method of
> > Spout class but declareOutputFields method is called for multiple
> > times and the program exits silently. Do you guys

Re: Problem with passing arguments to Flink Web Submission Client

2015-12-21 Thread Matthias J. Sax

Thanks for opening a JIRA.

I used "info" which yield the error (forgot it in the mail):

> bin/flink info myJarFile.jar -f flink -i  -m 1

-Matthias

On 12/21/2015 10:36 AM, Filip Łęczycki wrote:
> Hi,
> 
> Regarding the CLI, I have been using 
>>bin/flink run myJarFile.jar -f flink -i  -m 1 
> and it is working perfectly fine. Is there a difference between this two
> ways of submitting a job ("bin/flink MyJar.jar" and "bin/flink run
> MyJar.jar")?
> 
> I will open a Jira.
> 
> Best Regards,
> Filip Łęczycki
> 
> Pozdrawiam,
> Filip Łęczycki
> 
> 2015-12-20 21:55 GMT+01:00 Matthias J. Sax <mj...@apache.org
> <mailto:mj...@apache.org>>:
> 
> The bug is actually in the CLI (it's not a WebClient related issue)
> 
> if you run
> 
> > bin/flink myJarFile.jar -f flink -i  -m 1
> 
> it also returns
> 
> > Unrecognized option: -f
> 
> 
> -Matthias
> 
> On 12/20/2015 09:37 PM, Matthias J. Sax wrote:
> > That is a bug. Can you open a JIRA for it?
> >
> > You can work around by not prefixing your flag with "-"
> >
> > -Matthias
> >
> > On 12/20/2015 12:59 PM, Filip Łęczycki wrote:
> >> Hi all,
> >>
> >> I would like get the pretty printed execution plan of my job, in
> order
> >> to achieve that I uploaded my jar to Flink Web Submission Client and
> >> tried to run it. However when I provide arguments for my app I
> receive
> >> following error:
> >>
> >> An unexpected error occurred:
> >> Unrecognized option: -f
> >> org.apache.flink.client.cli.CliArgsException: Unrecognized option: -f
> >> at
> >>
> 
> org.apache.flink.client.cli.CliFrontendParser.parseInfoCommand(CliFrontendParser.java:296)
> >> at org.apache.flink.client.CliFrontend.info
> <http://org.apache.flink.client.CliFrontend.info>
> >>
> <http://org.apache.flink.client.CliFrontend.info>(CliFrontend.java:376)
> >> at
> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:983)
> >> at
> >>
> 
> org.apache.flink.client.web.JobSubmissionServlet.doGet(JobSubmissionServlet.java:171)
> >> at javax.servlet.http.HttpServlet.service(HttpServlet.java:734)
> >> at javax.servlet.http.HttpServlet.service(HttpServlet.java:847)
> >> at
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:532)
> >> at
> >>
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
> >> at
> >>
> 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227)
> >> at
> >>
> 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:965)
> >> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:388)
> >> at
> >>
> 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:187)
> >> at
> >>
> 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:901)
> >> at
> >>
> 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
> >> at
> org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:47)
> >> at
> >>
> 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:113)
> >> at org.eclipse.jetty.server.Server.handle(Server.java:348)
> >> at
> >>
> 
> org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:596)
> >> at
> >>
> 
> org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1048)
> >> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:549)
> >> at
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:211)
> >> at
> org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:425)
> >> at
> >>
> 
> org.eclipse.jetty.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:489)
> >> at
> >>
> 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:436)
> >> at java.lang.Thread.run(Thread.java:745)
> >>
> >> Here is what i write into the "Program Arguments" box:
> >> -f flink -i  -m 1
> >>
> >> Am I doing something wrong or is this a bug and web client tries to
> >> interpret my arguments as flink options?
> >>
> >> Regards/Pozdrawiam,
> >> Filip Łęczycki
> >
> 
> 



signature.asc
Description: OpenPGP digital signature

Re: Problem with passing arguments to Flink Web Submission Client

2015-12-20 Thread Matthias J. Sax

The bug is actually in the CLI (it's not a WebClient related issue)

if you run

> bin/flink myJarFile.jar -f flink -i  -m 1

it also returns

> Unrecognized option: -f


-Matthias

On 12/20/2015 09:37 PM, Matthias J. Sax wrote:
> That is a bug. Can you open a JIRA for it?
> 
> You can work around by not prefixing your flag with "-"
> 
> -Matthias
> 
> On 12/20/2015 12:59 PM, Filip Łęczycki wrote:
>> Hi all,
>>
>> I would like get the pretty printed execution plan of my job, in order
>> to achieve that I uploaded my jar to Flink Web Submission Client and
>> tried to run it. However when I provide arguments for my app I receive
>> following error:
>>
>> An unexpected error occurred:
>> Unrecognized option: -f
>> org.apache.flink.client.cli.CliArgsException: Unrecognized option: -f
>> at
>> org.apache.flink.client.cli.CliFrontendParser.parseInfoCommand(CliFrontendParser.java:296)
>> at org.apache.flink.client.CliFrontend.info
>> <http://org.apache.flink.client.CliFrontend.info>(CliFrontend.java:376)
>> at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:983)
>> at
>> org.apache.flink.client.web.JobSubmissionServlet.doGet(JobSubmissionServlet.java:171)
>> at javax.servlet.http.HttpServlet.service(HttpServlet.java:734)
>> at javax.servlet.http.HttpServlet.service(HttpServlet.java:847)
>> at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:532)
>> at
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
>> at
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227)
>> at
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:965)
>> at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:388)
>> at
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:187)
>> at
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:901)
>> at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
>> at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:47)
>> at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:113)
>> at org.eclipse.jetty.server.Server.handle(Server.java:348)
>> at
>> org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:596)
>> at
>> org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1048)
>> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:549)
>> at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:211)
>> at org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:425)
>> at
>> org.eclipse.jetty.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:489)
>> at
>> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:436)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Here is what i write into the "Program Arguments" box:
>> -f flink -i  -m 1
>>
>> Am I doing something wrong or is this a bug and web client tries to
>> interpret my arguments as flink options?
>>
>> Regards/Pozdrawiam,
>> Filip Łęczycki
> 



signature.asc
Description: OpenPGP digital signature

Re: Apache Flink Web Dashboard - Completed Job history

2015-12-16 Thread Matthias J. Sax

I guess it should be possible to manually save this information from the
corresponding directory and copy it back after restart? But I am not sure?

Please correct me if I am wrong.

-Matthias

On 12/16/2015 03:16 PM, Ufuk Celebi wrote:
> 
>> On 16 Dec 2015, at 15:00, Ovidiu-Cristian MARCU 
>>  wrote:
>>
>> Hi
>>
>> If I restart the Flink I don’t see anymore the history of the completed jobs.
>> Is this a missing feature or what should I do to see the completed job list 
>> history?
> 
> Not possible at the moment.
> 
> Completed jobs are archived on the JobManager and lost after restarts.
> 
> – Ufuk
> 



signature.asc
Description: OpenPGP digital signature

Re: flink streaming documentation

2015-12-15 Thread Matthias J. Sax

Thanks for reporting!

Would you like to fix this and open a PR?


-Matthias

On 12/15/2015 04:43 AM, Radu Tudoran wrote:
> Hi,
> 
>  
> 
> I believe i found 2 small inconsistencies in the documentation for the
> description of Window Apply
> 
> https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html#windows-on-unkeyed-data-streams
> 
>  
> 
> 1) in the example provided I believe it should be corrected to remove
> the extra > and add ")":
> 
>  
> 
> |(new WindowFunction,Integer, Tuple, Window>() {
> *...* });|
> 
>  
> 
> instead of
> 
>  
> 
> |(new WindowFunction,Integer>, Tuple, Window>() {
> *...* };|
> 
>  
> 
>  
> 
> 2) for AllWindowedStream it seems there is a need for a
> AllWindowFunction not a WindowFunction
> 
> I would propose to duplicate the existing example to cover also this
> case, particularly that it has a slightly different interface
> 
>  
> 
> |allWindowedStream.apply (new
> AllWindowFunction,Integer, TimeWindow>() { public
> void apply (TimeWindow window, Iterable> values,
> Collector out) throws Exception { int sum = 0; for (value t:
> values) { sum += t.f1; } out.collect (new Integer(sum)); } });|
> 
>  
> 
> Regards,
> 
>  
> 
> Radu
> 
>  
> 



signature.asc
Description: OpenPGP digital signature

Re: Any role for volunteering

2015-12-05 Thread Matthias J. Sax

Hi Deepak,

the Flink community is always open to new people who want to contribute
to the project. Please subscribe to the user- and dev-mailing list as a
starting point: https://flink.apache.org/community.html#mailing-lists

Furthermore, please read the following docs:
https://flink.apache.org/how-to-contribute.html
https://flink.apache.org/contribute-code.html

It explains the process the Flink community follows and you have to
follow, too.

The best way to get started with coding is to look into open tickets:
https://issues.apache.org/jira/browse/FLINK

If you find anything you want to work on, let us know.

-Matthias

On 12/04/2015 06:58 PM, Deepak Sharma wrote:
> Hi All
> Sorry for spamming your inbox.
> I am really keen to work on a big data project full time(preferably
> remote from India) , if not I am open to volunteering as well.
> Please do let me know if there is any such opportunity available
> 
> -- 
> Thanks
> Deepak

signature.asc
Description: OpenPGP digital signature

Re: flink connectors

2015-11-27 Thread Matthias J. Sax

If I understand the question right, you just want to download the jar
manually?

Just go to the maven repository website and download the jar from there.


-Matthias

On 11/27/2015 02:49 PM, Robert Metzger wrote:
> Maybe there is a maven mirror you can access from your network?
> 
> This site contains a list of some mirrors
> http://stackoverflow.com/questions/5233610/what-are-the-official-mirrors-of-the-maven-central-repository
> You don't have to use the maven tool, you can also manually browse for
> the jars and download what you need.
> 
> 
> On Fri, Nov 27, 2015 at 2:46 PM, Fabian Hueske  > wrote:
> 
> You can always build Flink from source, but apart from that I am not
> aware of an alternative.
> 
> 2015-11-27 14:42 GMT+01:00 Radu Tudoran  >:
> 
> Hi,
> 
> __ __
> 
> Is there any alternative to avoiding maven?
> 
> That is why I was curious if there is a binary distribution of
> this available for download directly
> 
> __ __
> 
> Dr. Radu Tudoran
> 
> Research Engineer
> 
> IT R Division
> 
> __ __
> 
> cid:image007.jpg@01CD52EB.AD060EE0
> 
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> 
> European Research Center
> 
> Riesstrasse 25, 80992 München
> 
> __ __
> 
> E-mail: _radu.tudo...@huawei.com
> _
> 
> Mobile: +49 15209084330 
> 
> Telephone: +49 891588344173 
> 
> __ __
> 
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
> 
> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
> Managing Director: Jingwen TAO, Wanzhou MENG, Lifang CHEN
> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB
> 56063,
> Geschäftsführer: Jingwen TAO, Wanzhou MENG, Lifang CHEN
> 
> This e-mail and its attachments contain confidential information
> from HUAWEI, which is intended only for the person or entity
> whose address is listed above. Any use of the information
> contained herein in any way (including, but not limited to,
> total or partial disclosure, reproduction, or dissemination) by
> persons other than the intended recipient(s) is prohibited. If
> you receive this e-mail in error, please notify the sender by
> phone or email immediately and delete it!
> 
> __ __
> 
> *From:*Fabian Hueske [mailto:fhue...@gmail.com
> ]
> *Sent:* Friday, November 27, 2015 2:41 PM
> *To:* user@flink.apache.org 
> *Subject:* Re: flink connectors
> 
> __ __
> 
> Hi Radu,
> 
> the connectors are available in Maven Central.
> 
> Just add them as a dependency in your project and they will be
> fetched and included.
> 
> Best, Fabian
> 
> __ __
> 
> 2015-11-27 14:38 GMT+01:00 Radu Tudoran  >:
> 
> Hi,
> 
>  
> 
> I was trying to use flink connectors. However, when I tried to
> import this
> 
>  
> 
> import org.apache.flink.streaming.connectors.*;
> 
>  
> 
> I saw that they are not present in the binary distribution as
> downloaded from website (flink-dist-0.10.0.jar). Is this
> intentionally? Is there also a binary distribution that contains
> these connectors?
> 
>  
> 
> Regards,
> 
>  
> 
> Dr. Radu Tudoran
> 
> Research Engineer
> 
> IT R Division
> 
>  
> 
> cid:image007.jpg@01CD52EB.AD060EE0
> 
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> 
> European Research Center
> 
> Riesstrasse 25, 80992 München
> 
>  
> 
> E-mail: _radu.tudo...@huawei.com
> _
> 
> Mobile: +49 15209084330 
> 
> Telephone: +49 891588344173 
> 
>  
> 
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
> 
> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
> Managing Director: Jingwen TAO, Wanzhou MENG, Lifang CHEN
> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB
> 56063,
> Geschäftsführer: Jingwen TAO, Wanzhou

Re: Doubt about window and count trigger

2015-11-26 Thread Matthias J. Sax

Hi,

a Trigger is an *additional* condition for intermediate (early)
evaluation of the window. Thus, it is not "or-ed" to the basic window
definition.

If you want to have an or-ed window condition, you can customize it by
specifying your own window definition.

> dataStream.window(new MyOwnWindow() extends WindowAssigner { /* put your code 
> here */ );

-Matthias


On 11/26/2015 11:40 PM, Anwar Rizal wrote:
> Hi all,
> 
> From the documentation:
> "The |Trigger| specifies when the function that comes after the window
> clause (e.g., |sum|, |count|) is evaluated (“fires”) for each window."
> 
> So, basically, if I specify:
> 
> |keyedStream
> .window(TumblingTimeWindows.of(Time.of(5, TimeUnit.SECONDS))
> .trigger(CountTrigger.of(100))|
> 
> |
> |
> 
> |The execution of the window function is triggered when the count reaches 100 
> in the time window of 5 seconds. If you have a system that never reaches 100 
> in 5 seconds, basically you will never have the window fired.|
> 
> |
> |
> 
> |My question is, what would be the best option to have behavior as follow:|
> 
> |The execution of the window function is triggered when 5 seconds is reached 
> or 100 events are received before 5 seconds.|
> 
> 
> I think of implementing my own trigger that looks like CountTrigger, but that 
> will fire also when the end of time window is reached (at the moment, it just 
> returns Continue, instead of Fired). But maybe there's a better way ? 
> 
> Is there a reason why CountTrigger is implemented as it is implemented today, 
> and not as I described above (5 seconds or 100 events reached, whichever 
> comes first).
> 
> 
> Thanks,
> 
> Anwar.
> 



signature.asc
Description: OpenPGP digital signature

Re: ReduceByKeyAndWindow in Flink

2015-11-23 Thread Matthias J. Sax

Hi,

Can't you use a second keyed window (with the same size) and apply
.max(...)?

-Matthias

On 11/23/2015 11:00 AM, Konstantin Knauf wrote:
> Hi Fabian,
> 
> thanks for your answer. Yes, that's what I want.
> 
> The solution you suggest is what I am doing right now (see last of the
> bullet point in my question).
> 
> But given your example. I would expect the following output:
> 
> (key: 1, w-time: 10, agg: 17)
> (key: 2, w-time: 10, agg: 20)
> (key: 1, w-time: 20, agg: 30)
> (key: 1, w-time: 20, agg: 30)
> (key: 1, w-time: 20, agg: 30)
> 
> Because the reduce function is evaluated for every incoming event (i.e.
> each key), right?
> 
> Cheers,
> 
> Konstantin
> 
> On 23.11.2015 10:47, Fabian Hueske wrote:
>> Hi Konstantin,
>>
>> let me first summarize to make sure I understood what you are looking for.
>> You computed an aggregate over a keyed event-time window and you are
>> looking for the maximum aggregate for each group of windows over the
>> same period of time.
>> So if you have
>> (key: 1, w-time: 10, agg: 17)
>> (key: 2, w-time: 10, agg: 20)
>> (key: 1, w-time: 20, agg: 30)
>> (key: 2, w-time: 20, agg: 28)
>> (key: 3, w-time: 20, agg: 5)
>>
>> you would like to get:
>> (key: 2, w-time: 10, agg: 20)
>> (key: 1, w-time: 20, agg: 30)
>>
>> If this is correct, you can do this as follows.
>> You can extract the window start and end time from the TimeWindow
>> parameter of the WindowFunction and key the stream either by start or
>> end time and apply a ReduceFunction on the keyed stream.
>>
>> Best, Fabian
>>
>> 2015-11-23 8:41 GMT+01:00 Konstantin Knauf > >:
>>
>> Hi everyone,
>>
>> me again :) Let's say you have a stream, and for every window and key
>> you compute some aggregate value, like this:
>>
>> DataStream.keyBy(..)
>>   .timeWindow(..)
>>   .apply(...)
>>
>>
>> Now I want to get the maximum aggregate value for every window over the
>> keys. This feels like a pretty natural use case. How can I achieve this
>> with Flink in the most compact way?
>>
>> The options I thought of so far are:
>>
>> * Use an allTimeWindow, obviously. Drawback is, that the WindowFunction
>> would not be distributed by keys anymore.
>>
>> * use a windowAll after the WindowFunction to create windows of the
>> aggregates, which originated from the same timeWindow. This could be
>> done either with a TimeWindow or with a GlobalWindow with DeltaTrigger.
>> Drawback: Seems unnecessarily complicated and doubles the latency (at
>> least in my naive implementation ;)).
>>
>> * Of course, you could also just keyBy the start time of the window
>> after the WindowFunction, but then you get more than one event for each
>> window.
>>
>> Is there some easy way I am missing? If not, is there a technical
>> reasons, why such an "reduceByKeyAndWindow"-operator is not available in
>> Flink?
>>
>> Cheers,
>>
>> Konstantin
>>
>>
> 



signature.asc
Description: OpenPGP digital signature

Re: Destroy StreamExecutionEnv

2015-10-05 Thread Matthias J. Sax

Hi,

you just need to terminate your source (ie, return from run() method if
you implement your own source function). This will finish the complete
program. For already available sources, just make sure you read finite
input.

Hope this helps.

-Matthias

On 10/05/2015 12:15 AM, jay vyas wrote:
> Hi folks.
> 
> How do we end a stream execution environment? 
> 
> I have a unit test which runs a streaming job, and want the unit test to
> die after the first round of output is processed...
> 
> 
> DataStream> counts =
> dataStream.map(
> new MapFunction>() {
>   @Override
>   public Tuple2 map(String s) throws Exception {
> Map transaction = MAPPER.readValue(s, Map.class);
> return new Tuple2<>(transaction, 1);
>   }
> });
> counts.print();
> 
> 
> 
> -- 
> jay vyas



signature.asc
Description: OpenPGP digital signature

Re: Usage of Hadoop 2.2.0

2015-09-04 Thread Matthias J. Sax

+1 for dropping

On 09/04/2015 11:04 AM, Maximilian Michels wrote:
> +1 for dropping Hadoop 2.2.0 binary and source-compatibility. The
> release is hardly used and complicates the important high-availability
> changes in Flink.
> 
> On Fri, Sep 4, 2015 at 9:33 AM, Stephan Ewen  wrote:
>> I am good with that as well. Mind that we are not only dropping a binary
>> distribution for Hadoop 2.2.0, but also the source compatibility with 2.2.0.
>>
>>
>>
>> Lets also reconfigure Travis to test
>>
>>  - Hadoop1
>>  - Hadoop 2.3
>>  - Hadoop 2.4
>>  - Hadoop 2.6
>>  - Hadoop 2.7
>>
>>
>> On Fri, Sep 4, 2015 at 6:19 AM, Chiwan Park  wrote:
>>>
>>> +1 for dropping Hadoop 2.2.0
>>>
>>> Regards,
>>> Chiwan Park
>>>
 On Sep 4, 2015, at 5:58 AM, Ufuk Celebi  wrote:

 +1 to what Robert said.

 On Thursday, September 3, 2015, Robert Metzger 
 wrote:
 I think most cloud providers moved beyond Hadoop 2.2.0.
 Google's Click-To-Deploy is on 2.4.1
 AWS EMR is on 2.6.0

 The situation for the distributions seems to be the following:
 MapR 4 uses Hadoop 2.4.0 (current is MapR 5)
 CDH 5.0 uses 2.3.0 (the current CDH release is 5.4)

 HDP 2.0  (October 2013) is using 2.2.0
 HDP 2.1 (April 2014) uses 2.4.0 already

 So both vendors and cloud providers are multiple releases away from
 Hadoop 2.2.0.

 Spark does not offer a binary distribution lower than 2.3.0.

 In addition to that, I don't think that the HDFS client in 2.2.0 is
 really usable in production environments. Users were reporting
 ArrayIndexOutOfBounds exceptions for some jobs, I also had these exceptions
 sometimes.

 The easiest approach  to resolve this issue would be  (a) dropping the
 support for Hadoop 2.2.0
 An alternative approach (b) would be:
  - ship a binary version for Hadoop 2.3.0
  - make the source of Flink still compatible with 2.2.0, so that users
 can compile a Hadoop 2.2.0 version if needed.

 I would vote for approach (a).

 On Tue, Sep 1, 2015 at 5:01 PM, Till Rohrmann 
 wrote:
 While working on high availability (HA) for Flink's YARN execution I
 stumbled across some limitations with Hadoop 2.2.0. From version 2.2.0 to
 2.3.0, Hadoop introduced new functionality which is required for an
 efficient HA implementation. Therefore, I was wondering whether there is
 actually a need to support Hadoop 2.2.0. Is Hadoop 2.2.0 still actively 
 used
 by someone?

 Cheers,
 Till

>>>
>>>
>>>
>>>
>>>
>>

signature.asc
Description: OpenPGP digital signature

Re: How to create a stream of data batches

2015-09-04 Thread Matthias J. Sax

Hi Andres,

you could do this by using your own data type, for example

> public class MyBatch {
>   private ArrayList data = new ArrayList
> }

In the DataSource, you need to specify your own InputFormat that reads
multiple tuples into a batch and emits the whole batch at once.

However, be aware, that this POJO type hides the batch nature from
Flink, ie, Flink does not know anything about the tuples in the batch.
To Flink a batch is a single tuple. If you want to perform key-based
operations, this might become a problem.

-Matthias

On 09/04/2015 01:00 PM, Andres R. Masegosa  wrote:
> Hi,
> 
> I'm trying to code some machine learning algorithms on top of flink such
> as a variational Bayes learning algorithms. Instead of working at a data
> element level (i.e. using map transformations), it would be far more
> efficient to work at a "batch of elements" levels (i.e. I get a batch of
> elements and I produce some output).
> 
> I could code that using "mapPartition" function. But I can not control
> the size of the partition, isn't?
> 
> Is there any way to transform a stream (or DataSet) of elements in a
> stream (or DataSet) of data batches with the same size?
> 
> 
> Thanks for your support,
> Andres
> 

signature.asc
Description: OpenPGP digital signature

Re: question on flink-storm-examples

2015-09-03 Thread Matthias J. Sax

One more remark that just came to my mind. There is a storm-hdfs module
available: https://github.com/apache/storm/tree/master/external/storm-hdfs

Maybe you can use it. It would be great if you could give feedback if
this works for you.

-Matthias

On 09/02/2015 10:52 AM, Matthias J. Sax wrote:
> Hi,
> StormFileSpout uses a simple FileReader internally an cannot deal with
> HDFS. It would be a nice extension to have. I just opened a JIRA for it:
> https://issues.apache.org/jira/browse/FLINK-2606
> 
> Jerry, feel to work in this feature and contribute code to Flink ;)
> 
> -Matthias
> 
> On 09/02/2015 07:52 AM, Aljoscha Krettek wrote:
>> Hi Jerry,
>> unfortunately, it seems that the StormFileSpout can only read files from
>> a local filesystem, not from HDFS. Maybe Matthias has something in the
>> works for that.
>>
>> Regards,
>> Aljoscha
>>
>> On Tue, 1 Sep 2015 at 23:33 Jerry Peng <jerry.boyang.p...@gmail.com
>> <mailto:jerry.boyang.p...@gmail.com>> wrote:
>>
>> Ya that what I did and everything seems execute fine but when I try
>> to run the WordCount-StormTopology with a file on hfs I get
>> a java.io.FileNotFoundException :
>>
>> java.lang.RuntimeException: java.io.FileNotFoundException:
>> /home/jerrypeng/hadoop/hadoop_dir/data/data.txt (No such file or
>> directory)
>>
>> at
>> 
>> org.apache.flink.stormcompatibility.util.StormFileSpout.open(StormFileSpout.java:50)
>>
>> at
>> 
>> org.apache.flink.stormcompatibility.wrappers.AbstractStormSpoutWrapper.run(AbstractStormSpoutWrapper.java:102)
>>
>> at
>> 
>> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)
>>
>> at
>> 
>> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:58)
>>
>> at
>> 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:172)
>>
>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:581)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Caused by: java.io.FileNotFoundException:
>> /home/jerrypeng/hadoop/hadoop_dir/data/data.txt (No such file or
>> directory)
>>
>> at java.io.FileInputStream.open(Native Method)
>>
>> at java.io.FileInputStream.(FileInputStream.java:138)
>>
>> at java.io.FileInputStream.(FileInputStream.java:93)
>>
>> at java.io.FileReader.(FileReader.java:58)
>>
>> at
>> 
>> org.apache.flink.stormcompatibility.util.StormFileSpout.open(StormFileSpout.java:48)
>>
>>
>>
>> However I have that file on my hdfs namespace:
>>
>>
>> $ hadoop fs -ls -R /
>>
>> 15/09/01 21:25:11 WARN util.NativeCodeLoader: Unable to load
>> native-hadoop library for your platform... using builtin-java
>> classes where applicable
>>
>> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-21 14:40 /home
>>
>> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-21 14:40
>> /home/jerrypeng
>>
>> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-21 14:41
>> /home/jerrypeng/hadoop
>>
>> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-21 14:40
>> /home/jerrypeng/hadoop/dir
>>
>> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-24 16:06
>> /home/jerrypeng/hadoop/hadoop_dir
>>
>> drwxr-xr-x   - jerrypeng supergroup  0 2015-09-01 20:48
>> /home/jerrypeng/hadoop/hadoop_dir/data
>>
>> -rw-r--r--   3 jerrypeng supergroup  18552 2015-09-01 19:18
>> /home/jerrypeng/hadoop/hadoop_dir/data/data.txt
>>
>>     -rw-r--r--   3 jerrypeng supergroup  0 2015-09-01 20:48
>> /home/jerrypeng/hadoop/hadoop_dir/data/result.txt
>>
>> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-21 14:41
>> /home/jerrypeng/hadoop/hadoop_dir/dir1
>>
>> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-24 15:59
>> /home/jerrypeng/hadoop/hadoop_dir/test
>>
>> -rw-r--r--   3 jerrypeng supergroup 32 2015-08-24 15:59
>> /home/jerrypeng/hadoop/hadoop_dir/test/filename.txt
>>
>>
>> Any idea what's going on?
>>
>>
>> On Tue, Sep 1, 2015 at 4:20 PM, Matthias J. Sax
>> <mj...@informatik.hu-berlin.de
>> <mailto:mj...@informatik.hu-berlin.de>> wrote:
>>
>> You can use "bin/flink cancel J

Re: question on flink-storm-examples

2015-09-02 Thread Matthias J. Sax

Hi,
StormFileSpout uses a simple FileReader internally an cannot deal with
HDFS. It would be a nice extension to have. I just opened a JIRA for it:
https://issues.apache.org/jira/browse/FLINK-2606

Jerry, feel to work in this feature and contribute code to Flink ;)

-Matthias

On 09/02/2015 07:52 AM, Aljoscha Krettek wrote:
> Hi Jerry,
> unfortunately, it seems that the StormFileSpout can only read files from
> a local filesystem, not from HDFS. Maybe Matthias has something in the
> works for that.
> 
> Regards,
> Aljoscha
> 
> On Tue, 1 Sep 2015 at 23:33 Jerry Peng <jerry.boyang.p...@gmail.com
> <mailto:jerry.boyang.p...@gmail.com>> wrote:
> 
> Ya that what I did and everything seems execute fine but when I try
> to run the WordCount-StormTopology with a file on hfs I get
> a java.io.FileNotFoundException :
> 
> java.lang.RuntimeException: java.io.FileNotFoundException:
> /home/jerrypeng/hadoop/hadoop_dir/data/data.txt (No such file or
> directory)
> 
> at
> 
> org.apache.flink.stormcompatibility.util.StormFileSpout.open(StormFileSpout.java:50)
> 
> at
> 
> org.apache.flink.stormcompatibility.wrappers.AbstractStormSpoutWrapper.run(AbstractStormSpoutWrapper.java:102)
> 
> at
> 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)
> 
> at
> 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:58)
> 
> at
> 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:172)
> 
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:581)
> 
> at java.lang.Thread.run(Thread.java:745)
> 
> Caused by: java.io.FileNotFoundException:
> /home/jerrypeng/hadoop/hadoop_dir/data/data.txt (No such file or
> directory)
> 
> at java.io.FileInputStream.open(Native Method)
> 
> at java.io.FileInputStream.(FileInputStream.java:138)
> 
> at java.io.FileInputStream.(FileInputStream.java:93)
> 
> at java.io.FileReader.(FileReader.java:58)
> 
> at
> 
> org.apache.flink.stormcompatibility.util.StormFileSpout.open(StormFileSpout.java:48)
> 
> 
> 
> However I have that file on my hdfs namespace:
> 
> 
> $ hadoop fs -ls -R /
> 
> 15/09/01 21:25:11 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java
> classes where applicable
> 
> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-21 14:40 /home
> 
> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-21 14:40
> /home/jerrypeng
> 
> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-21 14:41
> /home/jerrypeng/hadoop
> 
> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-21 14:40
> /home/jerrypeng/hadoop/dir
> 
> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-24 16:06
> /home/jerrypeng/hadoop/hadoop_dir
> 
> drwxr-xr-x   - jerrypeng supergroup  0 2015-09-01 20:48
> /home/jerrypeng/hadoop/hadoop_dir/data
> 
> -rw-r--r--   3 jerrypeng supergroup  18552 2015-09-01 19:18
> /home/jerrypeng/hadoop/hadoop_dir/data/data.txt
> 
> -rw-r--r--   3 jerrypeng supergroup  0 2015-09-01 20:48
> /home/jerrypeng/hadoop/hadoop_dir/data/result.txt
> 
> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-21 14:41
> /home/jerrypeng/hadoop/hadoop_dir/dir1
> 
> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-24 15:59
> /home/jerrypeng/hadoop/hadoop_dir/test
> 
> -rw-r--r--   3 jerrypeng supergroup 32 2015-08-24 15:59
> /home/jerrypeng/hadoop/hadoop_dir/test/filename.txt
> 
> 
> Any idea what's going on?
> 
> 
> On Tue, Sep 1, 2015 at 4:20 PM, Matthias J. Sax
> <mj...@informatik.hu-berlin.de
> <mailto:mj...@informatik.hu-berlin.de>> wrote:
> 
> You can use "bin/flink cancel JOBID" or JobManager WebUI to
>     cancel the
> running job.
> 
> The exception you see, occurs in
> FlinkSubmitter.killTopology(...) which
> is not used by "bin/flink cancel" or JobMaanger WebUI.
> 
> If you compile the example you yourself, just remove the call to
> killTopology().
> 
> -Matthias
> 
> On 09/01/2015 11:16 PM, Matthias J. Sax wrote:
> > Oh yes. I forgot about this. I have already a fix for it in a
> pending
> > pull request... I hope that this PR is merged soon...
> >
> > If you want to observe the progress, look here:
> > https://issues.a

Re: How to force the parallelism on small streams?

2015-09-02 Thread Matthias J. Sax

Hi,

If I understand you correctly, you want to have 100 mappers. Thus you
need to apply the .setParallelism() after .map()

> addSource(myFileSource).rebalance().map(myFileMapper).setParallelism(100)

The order of commands you used, set the dop for the source to 100 (which
might be ignored, if the provided source function "myFileSource" does
not implements "ParallelSourceFunction" interface). The dop for the
mapper should be the default value.

Using .rebalance() is absolutely correct. It distributes the emitted
tuples in a round robin fashion to all consumer tasks.

-Matthias

On 09/02/2015 05:41 PM, LINZ, Arnaud wrote:
> Hi,
> 
>  
> 
> I have a source that provides few items since it gives file names to the
> mappers. The mapper opens the file and process records. As the files are
> huge, one input line (a filename) gives a consequent work to the next stage.
> 
> My topology looks like :
> 
> addSource(myFileSource).rebalance().setParallelism(100).map(myFileMapper)
> 
> If 100 mappers are created, about 85 end immediately and only a few
> process the files (for hours). I suspect an optimization making that
> there is a minimum number of lines to pass to the next node or it is
> “shutdown” ; but in my case I do want the lines to be evenly distributed
> to each mapper.
> 
> How to enforce that ?
> 
>  
> 
> Greetings,
> 
> Arnaud
> 
> 
> 
> 
> L'intégrité de ce message n'étant pas assurée sur internet, la société
> expéditrice ne peut être tenue responsable de son contenu ni de ses
> pièces jointes. Toute utilisation ou diffusion non autorisée est
> interdite. Si vous n'êtes pas destinataire de ce message, merci de le
> détruire et d'avertir l'expéditeur.
> 
> The integrity of this message cannot be guaranteed on the Internet. The
> company that sent this message cannot therefore be held liable for its
> content nor attachments. Any unauthorized use or dissemination is
> prohibited. If you are not the intended recipient of this message, then
> please delete it and notify the sender.



signature.asc
Description: OpenPGP digital signature

Re: question on flink-storm-examples

2015-09-01 Thread Matthias J. Sax

Hi Jerry,

WordCount-StormTopology uses a hard coded dop of 4. If you start up
Flink in local mode (bin/start-local-streaming.sh), you need to increase
the number of task slots to at least 4 in conf/flink-conf.yaml before
starting Flink -> taskmanager.numberOfTaskSlots

You should actually see the following exception in
log/flink-...-jobmanager-...log

> NoResourceAvailableException: Not enough free slots available to run the job. 
> You can decrease the operator parallelism or increase the number of slots per 
> TaskManager in the configuration.

WordCount-StormTopology does use StormWordCountRemoteBySubmitter
internally. So, you do use it already ;)

I am not sure what you mean by "get rid of KafkaSource"? It is still in
the code base. Which version to you use? In flink-0.10-SNAPSHOT it is
located in submodule "flink-connector-kafka" (which is submodule of
"flink-streaming-connector-parent" -- which is submodule of
"flink-streamping-parent").

-Matthias

On 09/01/2015 09:40 PM, Jerry Peng wrote:
> Hello,
> 
> I have some questions regarding how to run one of the
> flink-storm-examples, the WordCountTopology.  How should I run the job? 
> On github its says I should just execute
> bin/flink run example.jar but when I execute:
> 
> bin/flink run WordCount-StormTopology.jar 
> 
> nothing happens.  What am I doing wrong? and How can I run the
> WordCounttopology via StormWordCountRemoteBySubmitter? 
> 
> Also why did you guys get rid of the KafkaSource class?  What is the API
> now for subscribing to a kafka source?
> 
> Best,
> 
> Jerry

signature.asc
Description: OpenPGP digital signature

Re: question on flink-storm-examples

2015-09-01 Thread Matthias J. Sax

You can use "bin/flink cancel JOBID" or JobManager WebUI to cancel the
running job.

The exception you see, occurs in FlinkSubmitter.killTopology(...) which
is not used by "bin/flink cancel" or JobMaanger WebUI.

If you compile the example you yourself, just remove the call to
killTopology().

-Matthias

On 09/01/2015 11:16 PM, Matthias J. Sax wrote:
> Oh yes. I forgot about this. I have already a fix for it in a pending
> pull request... I hope that this PR is merged soon...
> 
> If you want to observe the progress, look here:
> https://issues.apache.org/jira/browse/FLINK-2111
> and
> https://issues.apache.org/jira/browse/FLINK-2338
> 
> This PR, resolves both and fixed the problem you observed:
> https://github.com/apache/flink/pull/750
> 
> -Matthias
> 
> 
> On 09/01/2015 11:09 PM, Jerry Peng wrote:
>> Hello,
>>
>> I corrected the number of slots for each task manager but now when I try
>> to run the WordCount-StormTopology, the job manager daemon on my master
>> node crashes and I get this exception in the log:
>>
>> java.lang.Exception: Received a message
>> CancelJob(6a4b9aa01ec87db20060210e5b36065e) without a leader session ID,
>> even though the message requires a leader session ID.
>>
>> at
>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:41)
>>
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>
>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>
>> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>
>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>
>> at
>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>
>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>
>> at
>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:104)
>>
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>
>> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>
>> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>
>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>
>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>
>> at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>
>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>
>> at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>>
>> It seems to have something to do with canceling of the topology after
>> the sleep.  Any ideas?
>>
>>
>> Best,
>>
>>
>> Jerry
>>
>>
>> On Tue, Sep 1, 2015 at 3:33 PM, Matthias J. Sax
>> <mj...@informatik.hu-berlin.de <mailto:mj...@informatik.hu-berlin.de>>
>> wrote:
>>
>> Yes. That is what I expected.
>>
>> JobManager cannot start the job, due to less task slots. It logs the
>> exception NoResourceAvailableException (it is not shown in stdout; see
>> "log" folder). There is no feedback to Flink CLI that the job could not
>> be started.
>>
>> Furthermore, WordCount-StormTopology sleeps for 5 seconds and tries to
>> "kill" the job. However, because the job was never started, there is a
>> NotAliveException which in print to stdout.
>>
>> -Matthias
>>
>>
>>
>> On 09/01/2015 10:26 PM, Jerry Peng wrote:
>> > When I run WordCount-StormTopology I get the following exception:
>> >
>> > ~/flink/bin/flink run WordCount-StormTopology.jar
>> > hdfs:///home/jerrypeng/hadoop/hadoop_dir/data/data.txt
>> > hdfs:///home/jerrypeng/hadoop/hadoop_dir/data/results.txt
>> >
>> > org.apache.flink.client.program.ProgramInvocationException: The main
>> > method caused an error.
>> >
>> > at
>> >
>> 
>> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:452)
>> >
>> > at
>> >
>&

Re: question on flink-storm-examples

2015-09-01 Thread Matthias J. Sax

Oh yes. I forgot about this. I have already a fix for it in a pending
pull request... I hope that this PR is merged soon...

If you want to observe the progress, look here:
https://issues.apache.org/jira/browse/FLINK-2111
and
https://issues.apache.org/jira/browse/FLINK-2338

This PR, resolves both and fixed the problem you observed:
https://github.com/apache/flink/pull/750

-Matthias


On 09/01/2015 11:09 PM, Jerry Peng wrote:
> Hello,
> 
> I corrected the number of slots for each task manager but now when I try
> to run the WordCount-StormTopology, the job manager daemon on my master
> node crashes and I get this exception in the log:
> 
> java.lang.Exception: Received a message
> CancelJob(6a4b9aa01ec87db20060210e5b36065e) without a leader session ID,
> even though the message requires a leader session ID.
> 
> at
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:41)
> 
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> 
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> 
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> 
> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> 
> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> 
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> 
> at
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> 
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> 
> at
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:104)
> 
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> 
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> 
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> 
> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> 
> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> 
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 
> 
> It seems to have something to do with canceling of the topology after
> the sleep.  Any ideas?
> 
> 
> Best,
> 
> 
> Jerry
> 
> 
> On Tue, Sep 1, 2015 at 3:33 PM, Matthias J. Sax
> <mj...@informatik.hu-berlin.de <mailto:mj...@informatik.hu-berlin.de>>
> wrote:
> 
> Yes. That is what I expected.
> 
> JobManager cannot start the job, due to less task slots. It logs the
> exception NoResourceAvailableException (it is not shown in stdout; see
> "log" folder). There is no feedback to Flink CLI that the job could not
> be started.
> 
> Furthermore, WordCount-StormTopology sleeps for 5 seconds and tries to
> "kill" the job. However, because the job was never started, there is a
> NotAliveException which in print to stdout.
> 
> -Matthias
> 
> 
> 
> On 09/01/2015 10:26 PM, Jerry Peng wrote:
> > When I run WordCount-StormTopology I get the following exception:
> >
> > ~/flink/bin/flink run WordCount-StormTopology.jar
> > hdfs:///home/jerrypeng/hadoop/hadoop_dir/data/data.txt
> > hdfs:///home/jerrypeng/hadoop/hadoop_dir/data/results.txt
> >
> > org.apache.flink.client.program.ProgramInvocationException: The main
> > method caused an error.
> >
> > at
> >
> 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:452)
> >
> > at
> >
> 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353)
> >
> > at org.apache.flink.client.program.Client.run(Client.java:278)
> >
> > at
> org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:631)
> >
> > at org.apache.flink.client.CliFrontend.run(CliFrontend.java:319)
> >
> > at
> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:954)
> >
> > at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1004)
> >
> > Caused by: NotAliveException(msg:null)
> >
> > at
> >
> 
> org.apache.flink.stormcompatibility.api.FlinkClient.killTopologyWithOpts(Flink

Re: question on flink-storm-examples

2015-09-01 Thread Matthias J. Sax

Yes. That is what I expected.

JobManager cannot start the job, due to less task slots. It logs the
exception NoResourceAvailableException (it is not shown in stdout; see
"log" folder). There is no feedback to Flink CLI that the job could not
be started.

Furthermore, WordCount-StormTopology sleeps for 5 seconds and tries to
"kill" the job. However, because the job was never started, there is a
NotAliveException which in print to stdout.

-Matthias



On 09/01/2015 10:26 PM, Jerry Peng wrote:
> When I run WordCount-StormTopology I get the following exception:
> 
> ~/flink/bin/flink run WordCount-StormTopology.jar
> hdfs:///home/jerrypeng/hadoop/hadoop_dir/data/data.txt
> hdfs:///home/jerrypeng/hadoop/hadoop_dir/data/results.txt
> 
> org.apache.flink.client.program.ProgramInvocationException: The main
> method caused an error.
> 
> at
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:452)
> 
> at
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353)
> 
> at org.apache.flink.client.program.Client.run(Client.java:278)
> 
> at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:631)
> 
> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:319)
> 
> at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:954)
> 
> at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1004)
> 
> Caused by: NotAliveException(msg:null)
> 
> at
> org.apache.flink.stormcompatibility.api.FlinkClient.killTopologyWithOpts(FlinkClient.java:209)
> 
> at
> org.apache.flink.stormcompatibility.api.FlinkClient.killTopology(FlinkClient.java:203)
> 
> at
> org.apache.flink.stormcompatibility.wordcount.StormWordCountRemoteBySubmitter.main(StormWordCountRemoteBySubmitter.java:80)
> 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 
> at java.lang.reflect.Method.invoke(Method.java:483)
> 
> at
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437)
> 
> ... 6 more
> 
> 
> The exception above occurred while trying to run your command.
> 
> 
> Any idea how to fix this?
> 
> On Tue, Sep 1, 2015 at 3:10 PM, Matthias J. Sax
> <mj...@informatik.hu-berlin.de <mailto:mj...@informatik.hu-berlin.de>>
> wrote:
> 
> Hi Jerry,
> 
> WordCount-StormTopology uses a hard coded dop of 4. If you start up
> Flink in local mode (bin/start-local-streaming.sh), you need to increase
> the number of task slots to at least 4 in conf/flink-conf.yaml before
> starting Flink -> taskmanager.numberOfTaskSlots
> 
> You should actually see the following exception in
> log/flink-...-jobmanager-...log
> 
> > NoResourceAvailableException: Not enough free slots available to
> run the job. You can decrease the operator parallelism or increase
> the number of slots per TaskManager in the configuration.
> 
> WordCount-StormTopology does use StormWordCountRemoteBySubmitter
> internally. So, you do use it already ;)
> 
> I am not sure what you mean by "get rid of KafkaSource"? It is still in
> the code base. Which version to you use? In flink-0.10-SNAPSHOT it is
> located in submodule "flink-connector-kafka" (which is submodule of
> "flink-streaming-connector-parent" -- which is submodule of
> "flink-streamping-parent").
> 
> 
> -Matthias
> 
> 
> On 09/01/2015 09:40 PM, Jerry Peng wrote:
> > Hello,
> >
> > I have some questions regarding how to run one of the
> > flink-storm-examples, the WordCountTopology.  How should I run the
> job?
> > On github its says I should just execute
> > bin/flink run example.jar but when I execute:
> >
> > bin/flink run WordCount-StormTopology.jar
> >
> > nothing happens.  What am I doing wrong? and How can I run the
> > WordCounttopology via StormWordCountRemoteBySubmitter?
> >
> > Also why did you guys get rid of the KafkaSource class?  What is
> the API
> > now for subscribing to a kafka source?
> >
> > Best,
> >
> > Jerry
> 
> 



signature.asc
Description: OpenPGP digital signature

Re: Problem with Windowing

2015-08-31 Thread Matthias J. Sax

Can you post your whole program (both versions if possible)?

Otherwise I have only a wild guess: A common mistake is not to assign
the stream variable properly:

DataStream ds = ...

ds = ds.APPLY_FUNCTIONS

ds.APPLY_MORE_FUNCTIONS

In your code example, the assignment is missing -- but maybe it just
missing in your email.

-Matthias


On 08/31/2015 04:38 PM, Dipl.-Inf. Rico Bergmann wrote:
> Hi!
> 
> I have a problem that I cannot really track down. I'll try to describe
> the issue.
> 
> My streaming flink program computes something. At the end I'm doing the
> follwing on my DataStream ds
> ds.window(2, TimeUnit.SECONDS)
> .groupBy(/*custom KeySelector converting input to a String
> representation*/)
> .mapWindow(/*TypeConversion*/)
> .flatten()
> 
> Then the result is written to a Kafka topic.
> 
> The purpose of this is output deduplication within a 2 seconds window...
> 
> Without the above the program works fine. But with the above I don't get
> any output and no error appears in the log. The program keeps running.
> Am I doing something wrong?
> 
> I would be happy for help!
> 
> Cheers, Rico.



signature.asc
Description: OpenPGP digital signature

Re: Event time in Flink streaming

2015-08-28 Thread Matthias J. Sax

Hi Martin,

you need to implement you own policy. However, this should be be
complicated. Have a look at TimeTriggerPolicy. You just need to
provide a Timestamp implementation that extracts you ts-attribute from
the tuples.

-Matthias

On 08/28/2015 03:58 PM, Martin Neumann wrote:
 Hej,
 
 I have a stream of timestamped events I want to process in Flink streaming.
 Di I have to write my own policies to do so, or can define time based
 windows to use the timestamps instead of the system time?
 
 cheers Martin



signature.asc
Description: OpenPGP digital signature

Re: Source job parallelism

2015-08-25 Thread Matthias J. Sax

Hi Arnaud,

did you try:

 Env.setSource(mySource).setParrellelism(1).map(mymapper).setParallelism(10)

If this does not work, it might be that Flink chains the mapper to the
source which implies to use the same parallelism (and the producer
dictates this dop value).

Using a rebalance() in between should break the chaining:

 Env.setSource(mySource).setParrellelism(1).rebalance().map(mymapper).setParallelism(10)



-Matthias

On 08/25/2015 07:08 PM, LINZ, Arnaud wrote:
 Hi,
 
  
 
 I have a streaming source that extends RichParallelSourceFunction, but
 for some reason I don’t want parallelism at the source level, so I use :
 
 Env.setSource(mySource).setParrellelism(1).map(mymapper)
 
  
 
 I do want parallelism at the mapper level, because it’s a long task, and
 I would like the source to dispatch data to several mappers.
 
  
 
 It seems that I don’t get parallelism on the mapper, it seems that the
 setParallelism() does not apply only to the source.
 
 Is that right? If yes, how can I mix my parallelism levels ?
 
  
 
 Best regards,
 
 Arnaud
 
 
 
 
 L'intégrité de ce message n'étant pas assurée sur internet, la société
 expéditrice ne peut être tenue responsable de son contenu ni de ses
 pièces jointes. Toute utilisation ou diffusion non autorisée est
 interdite. Si vous n'êtes pas destinataire de ce message, merci de le
 détruire et d'avertir l'expéditeur.
 
 The integrity of this message cannot be guaranteed on the Internet. The
 company that sent this message cannot therefore be held liable for its
 content nor attachments. Any unauthorized use or dissemination is
 prohibited. If you are not the intended recipient of this message, then
 please delete it and notify the sender.



signature.asc
Description: OpenPGP digital signature

Re: java.lang.ClassNotFoundException when deploying streaming jar locally

2015-08-06 Thread Matthias J. Sax

If I see it correctly your jar contains

 com/davengo/rfidcloud/flink/DaoJoin$1.class

But your error message says

 ClassNotFoundException: com.otter.ist.flink.DaoJoin$1

Both are different packages. Your jar seems not be correctly packaged.


-Matthias

On 08/06/2015 12:46 PM, Michael Huelfenhaus wrote:
 I am back at work next Tuesday, any further ideas would be great until
 then, for now I am continuing inside ecplise.
 
 Am 06.08.2015 um 11:27 schrieb Michael Huelfenhaus
 m.huelfenh...@davengo.com mailto:m.huelfenh...@davengo.com:
 
 hi,

 how did you build the jar file?

 mvn clean install -Pbuild-jar

 Have you checked whether your classes are in the jar file?

 yes, this seems alright for me

  jar tf target/flink-test-0.1.jar 
 META-INF/MANIFEST.MF
 META-INF/
 com/
 com/davengo/
 com/davengo/rfidcloud/
 com/davengo/rfidcloud/flink/
 com/davengo/rfidcloud/flink/DaoJoin$1.class
 com/davengo/rfidcloud/flink/DaoJoin.class
 com/davengo/rfidcloud/flink/streampojos/
 com/davengo/rfidcloud/flink/streampojos/EpcTuple.class
 log4j.properties
 META-INF/maven/
 META-INF/maven/com.davengo.rfidcloud.flink/
 META-INF/maven/com.davengo.rfidcloud.flink/flink-test/
 META-INF/maven/com.davengo.rfidcloud.flink/flink-test/pom.xml
 META-INF/maven/com.davengo.rfidcloud.flink/flink-test/pom.properties

 Am 06.08.2015 um 11:21 schrieb Robert Metzger rmetz...@apache.org
 mailto:rmetz...@apache.org:

 Hi,

 how did you build the jar file?
 Have you checked whether your classes are in the jar file?

 On Thu, Aug 6, 2015 at 11:08 AM, Michael Huelfenhaus
 m.huelfenh...@davengo.com mailto:m.huelfenh...@davengo.com wrote:

 Hello everybody

 I am truing to build a very simple streaming application with the
 nightly build of flink 0.10, my code runs fine in eclipse.

 But when I build and deploy the jar locally I always get
 java.lang.ClassNotFoundException: com.otter.ist.flink.DaoJoin$1

 There is also no plan visible in the web interface.

 I start the local flink 0.10 with start-local-streaming.sh  after
 building it from the git code

 Below you find the complete error, my code and the pom.xml any
 help is appreciated.

 Cheers Michael


 error log from web interface:
 An error occurred while invoking the program:

 The main method caused an error.


 org.apache.flink.runtime.client.JobExecutionException: Job
 execution failed.
 at
 
 org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:364)
 at
 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at
 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at
 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at
 
 org.apache.flink.runtime.LeaderSessionMessages$$anonfun$receive$1.applyOrElse(LeaderSessionMessages.scala:40)
 at
 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at
 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at
 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at
 org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
 at
 org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
 at
 scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
 at
 
 org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at
 
 org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:101)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
 at akka.dispatch.Mailbox.run(Mailbox.scala:221)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: java.lang.Exception: Call to registerInputOutput() of
 invokable failed
 at
 org.apache.flink.runtime.taskmanager.Task.run(Task.java:526)
 at java.lang.Thread.run(Thread.java:745)
 Caused by:
 org.apache.flink.streaming.runtime.tasks.StreamTaskException:
 Cannot instantiate user function.
 at

Re: java.lang.ClassNotFoundException when deploying streaming jar locally

2015-08-06 Thread Matthias J. Sax

Never mind. Just saw that this in not the problem...

Sounds weird to me. Maybe you can try to name the class. Anonymous
classes should not be a problem, but it should be worth a try.

-Matthias



On 08/06/2015 01:51 PM, Matthias J. Sax wrote:
 If I see it correctly your jar contains
 
 com/davengo/rfidcloud/flink/DaoJoin$1.class
 
 But your error message says
 
 ClassNotFoundException: com.otter.ist.flink.DaoJoin$1
 
 Both are different packages. Your jar seems not be correctly packaged.
 
 
 -Matthias
 
 On 08/06/2015 12:46 PM, Michael Huelfenhaus wrote:
 I am back at work next Tuesday, any further ideas would be great until
 then, for now I am continuing inside ecplise.

 Am 06.08.2015 um 11:27 schrieb Michael Huelfenhaus
 m.huelfenh...@davengo.com mailto:m.huelfenh...@davengo.com:

 hi,

 how did you build the jar file?

 mvn clean install -Pbuild-jar

 Have you checked whether your classes are in the jar file?

 yes, this seems alright for me

 jar tf target/flink-test-0.1.jar 
 META-INF/MANIFEST.MF
 META-INF/
 com/
 com/davengo/
 com/davengo/rfidcloud/
 com/davengo/rfidcloud/flink/
 com/davengo/rfidcloud/flink/DaoJoin$1.class
 com/davengo/rfidcloud/flink/DaoJoin.class
 com/davengo/rfidcloud/flink/streampojos/
 com/davengo/rfidcloud/flink/streampojos/EpcTuple.class
 log4j.properties
 META-INF/maven/
 META-INF/maven/com.davengo.rfidcloud.flink/
 META-INF/maven/com.davengo.rfidcloud.flink/flink-test/
 META-INF/maven/com.davengo.rfidcloud.flink/flink-test/pom.xml
 META-INF/maven/com.davengo.rfidcloud.flink/flink-test/pom.properties

 Am 06.08.2015 um 11:21 schrieb Robert Metzger rmetz...@apache.org
 mailto:rmetz...@apache.org:

 Hi,

 how did you build the jar file?
 Have you checked whether your classes are in the jar file?

 On Thu, Aug 6, 2015 at 11:08 AM, Michael Huelfenhaus
 m.huelfenh...@davengo.com mailto:m.huelfenh...@davengo.com wrote:

 Hello everybody

 I am truing to build a very simple streaming application with the
 nightly build of flink 0.10, my code runs fine in eclipse.

 But when I build and deploy the jar locally I always get
 java.lang.ClassNotFoundException: com.otter.ist.flink.DaoJoin$1

 There is also no plan visible in the web interface.

 I start the local flink 0.10 with start-local-streaming.sh  after
 building it from the git code

 Below you find the complete error, my code and the pom.xml any
 help is appreciated.

 Cheers Michael


 error log from web interface:
 An error occurred while invoking the program:

 The main method caused an error.


 org.apache.flink.runtime.client.JobExecutionException: Job
 execution failed.
 at
 
 org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:364)
 at
 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at
 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at
 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at
 
 org.apache.flink.runtime.LeaderSessionMessages$$anonfun$receive$1.applyOrElse(LeaderSessionMessages.scala:40)
 at
 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at
 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at
 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at
 
 org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
 at
 
 org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
 at
 scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
 at
 
 org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at
 
 org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:101)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
 at akka.dispatch.Mailbox.run(Mailbox.scala:221)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: java.lang.Exception: Call to registerInputOutput() of
 invokable failed

Re: Cluster execution -jar files-

2015-07-16 Thread Matthias J. Sax

As the JavaDoc explains:

* @param jarFiles The JAR files with code that needs to be shipped to 
 the cluster. If the program uses
* user-defined functions, user-defined input formats, 
 or any libraries, those must be
* provided in the JAR files.

- external libraries, yes
- your program code, no
  - except your UDFs, those yes

-Matthias


On 07/16/2015 04:06 PM, Juan Fumero wrote:
 Missing reference:
 
 [1]
 https://ci.apache.org/projects/flink/flink-docs-release-0.9/apis/cluster_execution.html
 
 On Don, 2015-07-16 at 16:04 +0200, Juan Fumero wrote:
 Hi, 
   I would like to use the createRemoteEnvironment to run the application
 in a cluster and I have some questions. Following the documentation in
 [1] It is not clear to me how to use it.

 What should be the content of the jar file? All the external libraries
 that I use? or need to include the program map/reduce to be distributed
 as well? In the last case, why should I redefine all the operations
 again in the main source? Shouldn't be included in the jar files? 

 Many thanks
 Juan

 
 



signature.asc
Description: OpenPGP digital signature

Re: why when use orders.aggregate(Aggregations.MAX, 2) not return one value but return more value

2015-07-08 Thread Matthias J. Sax

This is your code (it applied the print before the aggregation is done)

 ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
  DataSetOrders orders=(DataSetOrders)
 env.readCsvFile(/home/hadoop/Desktop/Dataset/orders.csv)
   .fieldDelimiter('|')
   .includeFields(mask).ignoreFirstLine()
   .tupleType(get_Order().getClass());
orders.aggregate(Aggregations.MAX, 2)  ;
 
 orders.print();

You need to put the print direct after the aggregate() of use a new
variable:

  orders.aggregate(Aggregations.MAX, 2).print();

or

  DataSetOrders aggedOrders = orders.aggregate(Aggregations.MAX, 2);
  aggedOrders.print();


-Matthias

On 07/08/2015 10:30 PM, hagersaleh wrote:
 I did not understand what you mean
 
 
 
 --
 View this message in context: 
 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/why-when-use-orders-aggregate-Aggregations-MAX-2-not-return-one-value-but-return-more-value-tp1977p1989.html
 Sent from the Apache Flink User Mailing List archive. mailing list archive at 
 Nabble.com.
 



signature.asc
Description: OpenPGP digital signature

Re: Job Statistics

2015-06-18 Thread Matthias J. Sax

Hi,

the CLI cannot show any job statistics. However, you can use the
JobManager web interface that is accessible at port 8081 from a browser.

-Matthias


On 06/17/2015 10:13 PM, Jean Bez wrote:
 Hello,
 
 Is it possible to view job statistics after it finished to execute
 directly in the command line? If so, could you please explain how? I
 could not find any mentions about this in the docs. I also tried to set
 the logs to debug mode, but no other information was presented. 
 
 Thank you!
 
 Regards,
 Jean



signature.asc
Description: OpenPGP digital signature

Re: Memory in local setting

2015-06-17 Thread Matthias J. Sax

Hi,

look at slide 35 for more details about memory configuration:
http://www.slideshare.net/robertmetzger1/apache-flink-hands-on

-Matthias


On 06/17/2015 09:29 AM, Chiwan Park wrote:
 Hi.
 
 You can increase the memory given to Flink by increasing JVM Heap memory in 
 local.
 If you are using Eclipse as IDE, add “-XmxHEAPSIZE” option in run 
 configuration. [1].
 Although you are using IntelliJ IDEA as IDE, you can increase JVM Heap using 
 the same way. [2]
 
 [1] 
 http://help.eclipse.org/luna/index.jsp?topic=%2Forg.eclipse.jdt.doc.user%2Ftasks%2Ftasks-java-local-configuration.htm
 [2] 
 https://www.jetbrains.com/idea/help/creating-and-editing-run-debug-configurations.html
 
 Regards,
 Chiwan Park
 
 On Jun 17, 2015, at 2:01 PM, Sebastian s...@apache.org wrote:

 Hi,

 Flink has memory problems when I run an algorithm from my local IDE on a 2GB 
 graph. Is there any way that I can increase the memory given to Flink?

 Best,
 Sebastian

 Caused by: java.lang.RuntimeException: Memory ran out. numPartitions: 32 
 minPartition: 4 maxPartition: 4 number of overflow segments: 151 bucketSize: 
 146 Overall memory: 14024704 Partition memory: 4194304
  at 
 org.apache.flink.runtime.operators.hash.CompactingHashTable.getNextBuffer(CompactingHashTable.java:784)
  at 
 org.apache.flink.runtime.operators.hash.CompactingHashTable.insertBucketEntryFromSearch(CompactingHashTable.java:668)
  at 
 org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:538)
  at 
 org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:347)
  at 
 org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:209)
  at 
 org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:270)
  at 
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
  at java.lang.Thread.run(Thread.java:745)
 
 
 
 
 
 
 



signature.asc
Description: OpenPGP digital signature

Re: Flink Streaming State Management

2015-06-17 Thread Matthias J. Sax

Hi Hilmi,

currently, this is not supported. However, state management is already
work in progress and should be available soon. See
https://github.com/apache/flink/pull/747

-Matthias

On 06/17/2015 09:36 AM, Hilmi Yildirim wrote:
 Hi,
 does Flink Streaming support state management? For example, I have a
 state which will be used inside the streaming operations but the state
 can be updated.
 
 For example:
 stream.map( use state for operation).updateState(update state).
 
 
 Best Regards,
 Hilmi
 
 
 
 



signature.asc
Description: OpenPGP digital signature

Re: Random Shuffling

2015-06-15 Thread Matthias J. Sax

I think, you need to implement an own Partitioner.java and hand it via
DataSet.partitionCustom(partitioner, field)

(Just specify any field you like; as you don't want to group by key, it
doesn't matter.)

When implementing the partitionier, you can ignore the key parameter and
compute the output channel randomly.

This is kind of a work-around, but it should work.


-Matthias

On 06/15/2015 01:49 PM, Maximilian Alber wrote:
 Hi Flinksters,
 
 I would like to shuffle my elements in the data set and then split it in
 two according to some ratio. Each element in the data set has an unique
 id. Is there a nice way to do it with the flink api?
 (It would be nice to have guaranteed random shuffling.)
 Thanks!
 
 Cheers,
 Max



signature.asc
Description: OpenPGP digital signature

Re: I want run flink program in ubuntu x64 Mult Node Cluster what is configuration?

2015-06-08 Thread Matthias J. Sax

Can you please share the whole console output. It is unclear what the
problem might be from this short message.


On 06/08/2015 10:24 AM, hagersaleh wrote:
 when run progam
 
 display
 error:null
 
 
 
 --
 View this message in context: 
 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/I-want-run-flink-program-in-ubuntu-x64-Mult-Node-Cluster-what-is-configuration-tp1444p1534.html
 Sent from the Apache Flink User Mailing List archive. mailing list archive at 
 Nabble.com.
 



signature.asc
Description: OpenPGP digital signature

Re: I want run flink program in ubuntu x64 Mult Node Cluster what is configuration?

2015-06-08 Thread Matthias J. Sax

Are you sure, that the TaskManager registered to the JobManager
correctly? You can check on the master machine in your browser:

localhost:8081

Additionally, you might need to increase the number of slots for the
TaskManager. Increase 'taskmanager.numberOfTaskSlots' within
conf/flink-conf.yaml.

-Matthias

On 06/08/2015 10:52 AM, hagersaleh wrote:
I change IP to good IP
but when run program
Error: The program execution failed:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Not enough free slots available to run the job. You can decrease the
operator parallelism or increase the number of slots per TaskManager in the
configuration. Task to schedule: Attempt #0 (CHAIN DataSource (at
main(TPCHQuery3.java:53) (org.apache.flink.api.java.io.CsvInputFormat)) -
Filter (Filter at main(TPCHQuery3.java:57)) (1/1)) @ (unassigned) -
[SCHEDULED] with groupID ee0fdf61005cacb7748513264adf86f1 in sharing
group SlotSharingGroup [5cb8ec8990d20836f80f8db43c61e0b0,
58e498ffd5b50c985489b6c2ca9391b0, ee0fdf61005cacb7748513264adf86f1] .
Resources available to scheduler: Number of instances=0, total number of
slots=0, available slots=0

--
View this message in context:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/I-want-run-flink-program-in-ubuntu-x64-Mult-Node-Cluster-what-is-configuration-tp1444p1537.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at
Nabble.com.

signature.asc
Description: OpenPGP digital signature

Re: Best way to join with inequalities (historical data)

2015-05-04 Thread Matthias J. Sax

Hi,

there is no other system support to express this join.

However, you could perform some hand wired optimization by
partitioning your input data into distinct intervals. It might be tricky
though. Especially, if the time-ranges in your range-key dataset are
overlapping everywhere (- data replication necessary for overlapping
parts).

But it might be worth the effort if you can't get the job done using
cross-product. How large are your data sets? What hardware are you using?


-Matthias


On 05/04/2015 10:47 AM, LINZ, Arnaud wrote:
 Hello,
 
  
 
 I was wondering how to join large data sets on inequalities.
 
  
 
 Let say I have a data set whose “keys” are two timestamps (start time 
 end time of validity) and value is a label :
 
 *final*DataSetTuple3Long, Long, String historical= …;
 
  
 
 I also have events, with an event name and a timestamp :
 
 *final*DataSetTuple2String, Long events= …;
 
  
 
 I want to join my events with my historical data to get the “active”
 label for the time of the event.
 
 The simple way is to use a cross product + a filter :
 
  
 
 events.cross(historical).filter((crossedRow) - {
 
 *return*(crossedRow.f0.f1= crossedRow.f1.f0) 
 (crossedRow.f0.f1= crossedRow.f1.f1);
 
 })
 
  
 
 But that’s not efficient with 2 big data sets…
 
  
 
 How would you code that ?
 
  
 
 Greetings,
 
 Arnaud
 
  
 
  
 
  
 
  
 
 
 
 
 L'intégrité de ce message n'étant pas assurée sur internet, la société
 expéditrice ne peut être tenue responsable de son contenu ni de ses
 pièces jointes. Toute utilisation ou diffusion non autorisée est
 interdite. Si vous n'êtes pas destinataire de ce message, merci de le
 détruire et d'avertir l'expéditeur.
 
 The integrity of this message cannot be guaranteed on the Internet. The
 company that sent this message cannot therefore be held liable for its
 content nor attachments. Any unauthorized use or dissemination is
 prohibited. If you are not the intended recipient of this message, then
 please delete it and notify the sender.



signature.asc
Description: OpenPGP digital signature

63 matches

Mail list logo