[jira] [Closed] (FLINK-7002) Partitioning broken if enum is used in compound key specified using field expression

2019-10-04 Thread Nico Kruber (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-7002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-7002.
--
Resolution: Won't Fix

Actually, this is not a Flink issue, but an issue of enums in Java and their 
implementation of {{hashCode}} which relies on the enum instance's memory 
address and therefore may be different in each JVM.

You could instead use the enum's ordinal or its name in the key selector 
implementation.

Please also refer to this for some more info:
https://stackoverflow.com/questions/49140654/flink-error-key-group-is-not-in-keygrouprange

> Partitioning broken if enum is used in compound key specified using field 
> expression
> 
>
> Key: FLINK-7002
> URL: https://issues.apache.org/jira/browse/FLINK-7002
> Project: Flink
>  Issue Type: Bug
>  Components: API / Type Serialization System
>Affects Versions: 1.2.0, 1.3.1
>Reporter: Sebastian Klemke
>Priority: Major
> Attachments: TestJob.java, WorkingTestJob.java, testdata.avro
>
>
> When groupBy() or keyBy() is used with multiple field expressions, at least 
> one of them being an enum type serialized using EnumTypeInfo, partitioning 
> seems random, resulting in incorrectly grouped/keyed output 
> datasets/datastreams.
> The attached Flink DataSet API jobs and the test dataset detail the issue: 
> Both jobs count (id, type) occurrences, TestJob uses field expressions to 
> group, WorkingTestJob uses a KeySelector function.
> Expected output for both is 6 records, with frequency value 100_000 each. If 
> you run in LocalEnvironment, results are in fact equivalent. But when run on 
> a cluster with 5 TaskManagers, only KeySelector function with String key 
> produces correct results whereas field expressions produce random, 
> non-repeatable, wrong results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-5334) outdated scala SBT quickstart example

2019-09-23 Thread Nico Kruber (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-5334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-5334:
---
Component/s: Documentation

> outdated scala SBT quickstart example
> -
>
> Key: FLINK-5334
> URL: https://issues.apache.org/jira/browse/FLINK-5334
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation, Quickstarts
>Affects Versions: 1.7.0
>Reporter: Nico Kruber
>Priority: Major
>
> The scala quickstart set up via sbt-quickstart.sh or from the repository at 
> https://github.com/tillrohrmann/flink-project seems outdated compared to what 
> is set up with the maven archetype, e.g. Job.scala vs. BatchJob.scala and 
> StreamingJob.scala. This should probably be updated and also the hard-coded 
> example in sbt-quickstart.sh on the web page could be removed and download 
> the newest version instead as the mvn command does.
> see 
> https://ci.apache.org/projects/flink/flink-docs-release-1.2/quickstart/scala_api_quickstart.html
>  for these two paths (SBT vs. Maven)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-5334) outdated scala SBT quickstart example

2019-09-23 Thread Nico Kruber (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-5334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-5334:
---
Affects Version/s: 1.8.2
   1.9.0

> outdated scala SBT quickstart example
> -
>
> Key: FLINK-5334
> URL: https://issues.apache.org/jira/browse/FLINK-5334
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation, Quickstarts
>Affects Versions: 1.7.0, 1.8.2, 1.9.0
>Reporter: Nico Kruber
>Priority: Major
>
> The scala quickstart set up via sbt-quickstart.sh or from the repository at 
> https://github.com/tillrohrmann/flink-project seems outdated compared to what 
> is set up with the maven archetype, e.g. Job.scala vs. BatchJob.scala and 
> StreamingJob.scala. This should probably be updated and also the hard-coded 
> example in sbt-quickstart.sh on the web page could be removed and download 
> the newest version instead as the mvn command does.
> see 
> https://ci.apache.org/projects/flink/flink-docs-release-1.2/quickstart/scala_api_quickstart.html
>  for these two paths (SBT vs. Maven)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-5334) outdated scala SBT quickstart example

2019-09-23 Thread Nico Kruber (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-5334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936183#comment-16936183
 ] 

Nico Kruber commented on FLINK-5334:


Actually, the script asks for a Flink and Scala version but then does not take 
them into account when creating the example project.

> outdated scala SBT quickstart example
> -
>
> Key: FLINK-5334
> URL: https://issues.apache.org/jira/browse/FLINK-5334
> Project: Flink
>  Issue Type: Bug
>  Components: Quickstarts
>Affects Versions: 1.7.0
>Reporter: Nico Kruber
>Priority: Major
>
> The scala quickstart set up via sbt-quickstart.sh or from the repository at 
> https://github.com/tillrohrmann/flink-project seems outdated compared to what 
> is set up with the maven archetype, e.g. Job.scala vs. BatchJob.scala and 
> StreamingJob.scala. This should probably be updated and also the hard-coded 
> example in sbt-quickstart.sh on the web page could be removed and download 
> the newest version instead as the mvn command does.
> see 
> https://ci.apache.org/projects/flink/flink-docs-release-1.2/quickstart/scala_api_quickstart.html
>  for these two paths (SBT vs. Maven)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14104) Bump Jackson to 2.9.9.3

2019-09-17 Thread Nico Kruber (Jira)
Nico Kruber created FLINK-14104:
---

 Summary: Bump Jackson to 2.9.9.3
 Key: FLINK-14104
 URL: https://issues.apache.org/jira/browse/FLINK-14104
 Project: Flink
  Issue Type: Bug
  Components: BuildSystem / Shaded
Affects Versions: shaded-8.0, shaded-7.0
Reporter: Nico Kruber
Assignee: Nico Kruber


Our current Jackson version (2.9.8) is vulnerable for at least this CVE:
https://nvd.nist.gov/vuln/detail/CVE-2019-14379

Bumping to 2.9.9.3 should solve it.
See https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.9



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13771) Support kqueue Netty transports (MacOS)

2019-09-05 Thread Nico Kruber (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923797#comment-16923797
 ] 

Nico Kruber commented on FLINK-13771:
-

[~aitozi] I'm not working on this and I also do not know how much it is worth 
since Mac servers (running Flink, in particular) are not really wide-spread, 
afaik. However, the actual implementation overhead should be low.

I'll assign you to the issue and can have a look on the PR when you are done.

> Support kqueue Netty transports (MacOS)
> ---
>
> Key: FLINK-13771
> URL: https://issues.apache.org/jira/browse/FLINK-13771
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Nico Kruber
>Priority: Major
>
> It seems like Netty is now also supporting MacOS's native transport 
> {{kqueue}}:
> https://netty.io/wiki/native-transports.html#using-the-macosbsd-native-transport
> We should allow this via {{taskmanager.network.netty.transport}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (FLINK-13771) Support kqueue Netty transports (MacOS)

2019-09-05 Thread Nico Kruber (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber reassigned FLINK-13771:
---

Assignee: Aitozi

> Support kqueue Netty transports (MacOS)
> ---
>
> Key: FLINK-13771
> URL: https://issues.apache.org/jira/browse/FLINK-13771
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Nico Kruber
>Assignee: Aitozi
>Priority: Major
>
> It seems like Netty is now also supporting MacOS's native transport 
> {{kqueue}}:
> https://netty.io/wiki/native-transports.html#using-the-macosbsd-native-transport
> We should allow this via {{taskmanager.network.netty.transport}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-12122) Spread out tasks evenly across all available registered TaskManagers

2019-09-04 Thread Nico Kruber (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922710#comment-16922710
 ] 

Nico Kruber commented on FLINK-12122:
-

[~anton_ryabtsev]  true, memory could become an issue in certain scenarios. 
However, I don't get the GC part: native memory for threads shouldn't be part 
of GC, network buffers are pre-allocated and off-heap, and the task's load is, 
in sum, the same, just more widely spread.

> Spread out tasks evenly across all available registered TaskManagers
> 
>
> Key: FLINK-12122
> URL: https://issues.apache.org/jira/browse/FLINK-12122
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.6.4, 1.7.2, 1.8.0
>Reporter: Till Rohrmann
>Priority: Major
> Attachments: image-2019-05-21-12-28-29-538.png, 
> image-2019-05-21-13-02-50-251.png
>
>
> With Flip-6, we changed the default behaviour how slots are assigned to 
> {{TaskManages}}. Instead of evenly spreading it out over all registered 
> {{TaskManagers}}, we randomly pick slots from {{TaskManagers}} with a 
> tendency to first fill up a TM before using another one. This is a regression 
> wrt the pre Flip-6 code.
> I suggest to change the behaviour so that we try to evenly distribute slots 
> across all available {{TaskManagers}} by considering how many of their slots 
> are already allocated.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-10177) Use network transport type AUTO by default

2019-08-22 Thread Nico Kruber (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913127#comment-16913127
 ] 

Nico Kruber commented on FLINK-10177:
-

One thing to consider/test: if, for whatever reason one TM would end up with 
NIO and another with epoll, they should theoretically work together but this 
should be verified. After all, this is just the local channel listening 
implementation. 

On the other hand, most deployments should be homogeneous and therefore not end 
up in that scenario.

> Use network transport type AUTO by default
> --
>
> Key: FLINK-10177
> URL: https://issues.apache.org/jira/browse/FLINK-10177
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration, Runtime / Network
>Affects Versions: 1.6.0, 1.7.0
>Reporter: Nico Kruber
>Assignee: boshu Zheng
>Priority: Major
>
> Now that the shading issue with the native library is fixed (FLINK-9463), 
> EPOLL should be available on (all?) Linux distributions and provide some 
> efficiency gain (if enabled). Therefore, 
> {{taskmanager.network.netty.transport}} should be set to {{auto}} by default. 
> If EPOLL is not available, it will automatically fall back to NIO which 
> currently is the default.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13770) Bump Netty to 4.1.39.Final

2019-08-22 Thread Nico Kruber (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913124#comment-16913124
 ] 

Nico Kruber commented on FLINK-13770:
-

see https://issues.apache.org/jira/browse/FLINK-10177

> Bump Netty to 4.1.39.Final
> --
>
> Key: FLINK-13770
> URL: https://issues.apache.org/jira/browse/FLINK-13770
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I quickly went through all the changelogs for Netty 4.1.32 (which we
> currently use) to the latest Netty 4.1.39.Final. Below, you will find a
> list of bug fixes and performance improvements that may affect us. Nice
> changes we could benefit from, also for the Java > 8 efforts. The most
> important ones fixing leaks etc are #8921, #9167, #9274, #9394, and the
> various {{CompositeByteBuf}} fixes. The rest are mostly performance
> improvements.
> Since we are still early in the dev cycle for Flink 1.10, it would be
> nice to update now and verify that the new version works correctly.
> {code}
> Netty 4.1.33.Final
> - Fix ClassCastException and native crash when using kqueue transport
> (#8665)
> - Provide a way to cache the internal nioBuffer of the PooledByteBuffer
> to reduce GC (#8603)
> Netty 4.1.34.Final
> - Do not use GetPrimitiveArrayCritical(...) due multiple not-fixed bugs
> related to GCLocker (#8921)
> - Correctly monkey-patch id also in whe os / arch is used within library
> name (#8913)
> - Further reduce ensureAccessible() overhead (#8895)
> - Support using an Executor to offload blocking / long-running tasks
> when processing TLS / SSL via the SslHandler (#8847)
> - Minimize memory footprint for AbstractChannelHandlerContext for
> handlers that execute in the EventExecutor (#8786)
> - Fix three bugs in CompositeByteBuf (#8773)
> Netty 4.1.35.Final
> - Fix possible ByteBuf leak when CompositeByteBuf is resized (#8946)
> - Correctly produce ssl alert when certificate validation fails on the
> client-side when using native SSL implementation (#8949)
> Netty 4.1.37.Final
> - Don't filter out TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (#9274)
> - Try to mark child channel writable again once the parent channel
> becomes writable (#9254)
> - Properly debounce wakeups (#9191)
> - Don't read from timerfd and eventfd on each EventLoop tick (#9192)
> - Correctly detect that KeyManagerFactory is not supported when using
> OpenSSL 1.1.0+ (#9170)
> - Fix possible unsafe sharing of internal NIO buffer in CompositeByteBuf
> (#9169)
> - KQueueEventLoop won't unregister active channels reusing a file
> descriptor (#9149)
> - Prefer direct io buffers if direct buffers pooled (#9167)
> Netty 4.1.38.Final
> - Prevent ByteToMessageDecoder from overreading when !isAutoRead (#9252)
> - Correctly take length of ByteBufInputStream into account for
> readLine() / readByte() (#9310)
> - availableSharedCapacity will be slowly exhausted (#9394)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Closed] (FLINK-12983) Replace descriptive histogram's storage back-end

2019-08-21 Thread Nico Kruber (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-12983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-12983.
---
Fix Version/s: 1.10.0
   Resolution: Fixed

merged to master via f57a615

> Replace descriptive histogram's storage back-end
> 
>
> Key: FLINK-12983
> URL: https://issues.apache.org/jira/browse/FLINK-12983
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Metrics
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{DescriptiveStatistics}} relies on their {{ResizableDoubleArray}} for 
> storing double values for their histograms. However, this is constantly 
> resizing an internal array and seems to have quite some overhead.
> Additionally, we're not using {{SynchronizedDescriptiveStatistics}} which, 
> according to its docs, we should. Currently, we seem to be somewhat safe 
> because {{ResizableDoubleArray}} has some synchronized parts but these are 
> scheduled to go away with commons.math version 4.
> Internal tests with the current implementation, one based on a linear array 
> of twice the histogram size (and moving values back to the start once the 
> window reaches the end), and one using a circular array (wrapping around with 
> flexible start position) has shown these numbers using the optimised code 
> from FLINK-10236, FLINK-12981, and FLINK-12982:
> # only adding values to the histogram
> {code}
> Benchmark   Mode  Cnt  Score
> Error   Units
> HistogramBenchmarks.dropwizardHistogramAdd thrpt   30   47985.359 ±
> 25.847  ops/ms
> HistogramBenchmarks.descriptiveHistogramAddthrpt   30   70158.792 ±   
> 276.858  ops/ms
> --- with FLINK-10236, FLINK-12981, and FLINK-12982 ---
> HistogramBenchmarks.descriptiveHistogramAddthrpt   30   75303.040 ±   
> 475.355  ops/ms
> HistogramBenchmarks.descrHistogramCircularAdd  thrpt   30  200906.902 ±   
> 384.483  ops/ms
> HistogramBenchmarks.descrHistogramLinearAddthrpt   30  189788.728 ±   
> 233.283  ops/ms
> {code}
> # after adding each value, also retrieving a common set of metrics:
> {code}
> Benchmark   Mode  Cnt  Score
> Error   Units
> HistogramBenchmarks.dropwizardHistogramthrpt   30 400.274 ± 
> 4.930  ops/ms
> HistogramBenchmarks.descriptiveHistogram   thrpt   30 124.533 ± 
> 1.060  ops/ms
> --- with FLINK-10236, FLINK-12981, and FLINK-12982 ---
> HistogramBenchmarks.descriptiveHistogram   thrpt   30 251.895 ± 
> 1.809  ops/ms
> HistogramBenchmarks.descrHistogramCircular thrpt   30 301.068 ± 
> 2.077  ops/ms
> HistogramBenchmarks.descrHistogramLinear   thrpt   30 234.050 ± 
> 5.485  ops/ms
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Closed] (FLINK-12982) Make DescriptiveStatisticsHistogramStatistics a true point-in-time snapshot

2019-08-21 Thread Nico Kruber (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-12982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-12982.
---
Fix Version/s: 1.10.0
   Resolution: Fixed

merged in master via 4452be3

> Make DescriptiveStatisticsHistogramStatistics a true point-in-time snapshot
> ---
>
> Key: FLINK-12982
> URL: https://issues.apache.org/jira/browse/FLINK-12982
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Metrics
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Instead of redirecting {{DescriptiveStatisticsHistogramStatistics}} calls to 
> {{DescriptiveStatistics}}, it takes a point-in-time snapshot using an own
>   {{UnivariateStatistic}} implementation that
>  * calculates min, max, mean, and standard deviation in one go (as opposed to 
> four iterations over the values array!)
>  * caches pivots for the percentile calculation to speed up retrieval of 
> multiple percentiles/quartiles
> This is also similar to the semantics of our implementation using codahale's 
> {{DropWizard}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Closed] (FLINK-12981) Ignore NaN values in histogram's percentile implementation

2019-08-21 Thread Nico Kruber (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-12981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-12981.
---
Fix Version/s: 1.10.0
   Resolution: Fixed

merged into master via  e59b9d2

> Ignore NaN values in histogram's percentile implementation
> --
>
> Key: FLINK-12981
> URL: https://issues.apache.org/jira/browse/FLINK-12981
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Metrics
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Histogram metrics use "long" values and therefore, there is no {{Double.NaN}} 
> in {{DescriptiveStatistics}}' data and there is no need to cleanse it while 
> working with it.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13793) Build different language docs in parallel

2019-08-20 Thread Nico Kruber (Jira)
Nico Kruber created FLINK-13793:
---

 Summary: Build different language docs in parallel
 Key: FLINK-13793
 URL: https://issues.apache.org/jira/browse/FLINK-13793
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Reporter: Nico Kruber
Assignee: Nico Kruber


Unfortunately, jekyll is lacking parallel builds and thus not making use of 
unused resources. In the special case of building the documentation without 
serving it, we could build each language (en, zh) in a separate sub-process and 
thus get some parallelization.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-13791) Speed up sidenav by using group_by

2019-08-19 Thread Nico Kruber (Jira)
Nico Kruber created FLINK-13791:
---

 Summary: Speed up sidenav by using group_by
 Key: FLINK-13791
 URL: https://issues.apache.org/jira/browse/FLINK-13791
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Reporter: Nico Kruber
Assignee: Nico Kruber


{{_includes/sidenav.html}} parses through {{pages_by_language}} over and over 
again trying to find children when building the (recursive) side navigation. We 
could do this once with a {{group_by}} instead.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Closed] (FLINK-12984) Only call Histogram#getStatistics() once per set of retrieved statistics

2019-08-15 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-12984.
---
   Resolution: Fixed
Fix Version/s: 1.10.0

fixed on master via d9f012746f5b8b36ebb416f70e9f5bac93538d5d

> Only call Histogram#getStatistics() once per set of retrieved statistics
> 
>
> Key: FLINK-12984
> URL: https://issues.apache.org/jira/browse/FLINK-12984
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Metrics
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In some occasions, {{Histogram#getStatistics()}} was called multiple times to 
> retrieve different statistics. However, at least the Dropwizard 
> implementation has some constant overhead per call and we should maybe rather 
> interpret this method as returning a point-in-time snapshot of the histogram 
> in order to get consistent values when querying them.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (FLINK-12987) DescriptiveStatisticsHistogram#getCount does not return the number of elements seen

2019-08-15 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-12987.
---
   Resolution: Fixed
Fix Version/s: 1.10.0

fixed on master via fd9ef60cc8448a5f4d1915973e168aad073d8e8d

> DescriptiveStatisticsHistogram#getCount does not return the number of 
> elements seen
> ---
>
> Key: FLINK-12987
> URL: https://issues.apache.org/jira/browse/FLINK-12987
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{DescriptiveStatisticsHistogram#getCount()}} returns the number of elements 
> in the current window and not the number of total elements seen over time. In 
> contrast, {{DropwizardHistogramWrapper}} does this correctly.
> We should unify the behaviour and add a unit test for it (there is no generic 
> histogram test yet).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (FLINK-12987) DescriptiveStatisticsHistogram#getCount does not return the number of elements seen

2019-08-15 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-12987:

Affects Version/s: 1.9.0

> DescriptiveStatisticsHistogram#getCount does not return the number of 
> elements seen
> ---
>
> Key: FLINK-12987
> URL: https://issues.apache.org/jira/browse/FLINK-12987
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {{DescriptiveStatisticsHistogram#getCount()}} returns the number of elements 
> in the current window and not the number of total elements seen over time. In 
> contrast, {{DropwizardHistogramWrapper}} does this correctly.
> We should unify the behaviour and add a unit test for it (there is no generic 
> histogram test yet).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (FLINK-13020) UT Failure: ChainLengthDecreaseTest

2019-08-15 Thread Nico Kruber (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908534#comment-16908534
 ] 

Nico Kruber edited comment on FLINK-13020 at 8/15/19 11:00 PM:
---

Actually, I just encountered this error in a branch of mine which is based on 
[latest 
master|https://github.com/apache/flink/commit/428ce1b938813fba287a51bf86e6c52ef54453cb].
 So either there has been a regression, or the fix does not work in all cases, 
or it is no duplicate afterall:
{code}
17:30:18.083 [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time 
elapsed: 14.113 s <<< FAILURE! - in 
org.apache.flink.test.state.operator.restore.unkeyed.ChainLengthDecreaseTest
17:30:18.083 [ERROR] testMigrationAndRestore[Migrate Savepoint: 
1.8](org.apache.flink.test.state.operator.restore.unkeyed.ChainLengthDecreaseTest)
  Time elapsed: 0.268 s  <<< ERROR!
java.util.concurrent.ExecutionException: 
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.checkpoint.CheckpointException: Task received 
cancellation from one of its inputs
Caused by: java.util.concurrent.CompletionException: 
org.apache.flink.runtime.checkpoint.CheckpointException: Task received 
cancellation from one of its inputs
Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Task 
received cancellation from one of its inputs
Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Task 
received cancellation from one of its inputs
{code}

https://api.travis-ci.com/v3/job/225588484/log.txt

{code}
17:30:17,408 INFO  org.apache.flink.streaming.runtime.tasks.StreamTask  
 - Configuring application-defined state backend with job/cluster config
17:30:17,409 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Source: Custom Source (2/4) (ffb5e756d6acddab9cab76e2a0a32904) switched from 
DEPLOYING to RUNNING.
17:30:17,409 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Map (4/4) (79fcf333d4d11eae297b65e52e397658) switched from DEPLOYING to 
RUNNING.
17:30:17,409 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Map (2/4) (aedaa4a61e74a3b766fafbef46e6aea6) switched from DEPLOYING to 
RUNNING.
17:30:17,409 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Source: Custom Source (4/4) (a1f07e2714e73b2533291a322961ea67) switched from 
DEPLOYING to RUNNING.
17:30:17,409 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Source: Custom Source (3/4) (6073be38d7be0ee571558f1dc865837a) switched from 
DEPLOYING to RUNNING.
17:30:17,409 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Map (1/4) (e4bc84d8137769b513d1a5107027500d) switched from DEPLOYING to 
RUNNING.
17:30:17,409 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Map (3/4) (6834950d9742da9c6a784ecc5ee892df) switched from DEPLOYING to 
RUNNING.
17:30:17,409 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
 - Checkpoint triggering task Source: Custom Source (1/4) of job 
075cea7da1d0690f96c879ae07b058c0 is not in state RUNNING but DEPLOYING instead. 
Aborting checkpoint.
17:30:17,413 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
 - Checkpoint triggering task Source: Custom Source (1/4) of job 
075cea7da1d0690f96c879ae07b058c0 is not in state RUNNING but DEPLOYING instead. 
Aborting checkpoint.
17:30:17,414 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
 - Checkpoint triggering task Source: Custom Source (1/4) of job 
075cea7da1d0690f96c879ae07b058c0 is not in state RUNNING but DEPLOYING instead. 
Aborting checkpoint.
17:30:17,416 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
 - Checkpoint triggering task Source: Custom Source (1/4) of job 
075cea7da1d0690f96c879ae07b058c0 is not in state RUNNING but DEPLOYING instead. 
Aborting checkpoint.
17:30:17,417 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator
 - Checkpoint triggering task Source: Custom Source (1/4) of job 
075cea7da1d0690f96c879ae07b058c0 is not in state RUNNING but DEPLOYING instead. 
Aborting checkpoint.
17:30:17,423 INFO  org.apache.flink.runtime.taskmanager.Task
 - Source: Custom Source (1/4) (8b302fefb0c10b7fd0b66f4fdb253632) switched from 
DEPLOYING to RUNNING.
17:30:17,423 INFO  org.apache.flink.streaming.runtime.tasks.StreamTask  
 - Using application-defined state backend: MemoryStateBackend (data in heap 
memory / checkpoints to JobManager) (checkpoints: 'null', savepoints: 'null', 
asynchronous: UNDEFINED, maxStateSize: 5242880)
17:30:17,423 INFO  org.apache.flink.streaming.runtime.tasks.StreamTask  
 - Configuring application-defined state backend with job/cluster config
17:30:17,424 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph   
 - Source: Custom Source (1/4) 

[jira] [Updated] (FLINK-13020) UT Failure: ChainLengthDecreaseTest

2019-08-15 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-13020:

Affects Version/s: 1.10.0

> UT Failure: ChainLengthDecreaseTest
> ---
>
> Key: FLINK-13020
> URL: https://issues.apache.org/jira/browse/FLINK-13020
> Project: Flink
>  Issue Type: Improvement
>Affects Versions: 1.10.0
>Reporter: Bowen Li
>Priority: Major
>
> {code:java}
> 05:47:24.893 [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time 
> elapsed: 19.836 s <<< FAILURE! - in 
> org.apache.flink.test.state.operator.restore.unkeyed.ChainLengthDecreaseTest
> 05:47:24.895 [ERROR] testMigrationAndRestore[Migrate Savepoint: 
> 1.3](org.apache.flink.test.state.operator.restore.unkeyed.ChainLengthDecreaseTest)
>   Time elapsed: 1.501 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Task received 
> cancellation from one of its inputs
> Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Task received 
> cancellation from one of its inputs
> Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Task 
> received cancellation from one of its inputs
> Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Task 
> received cancellation from one of its inputs
> ...
> 05:48:27.736 [ERROR] Errors: 
> 05:48:27.736 [ERROR]   
> ChainLengthDecreaseTest>AbstractOperatorRestoreTestBase.testMigrationAndRestore:102->AbstractOperatorRestoreTestBase.migrateJob:138
>  » Execution
> 05:48:27.736 [INFO] 
> {code}
> https://travis-ci.org/apache/flink/jobs/551053821



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13020) UT Failure: ChainLengthDecreaseTest

2019-08-15 Thread Nico Kruber (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908534#comment-16908534
 ] 

Nico Kruber commented on FLINK-13020:
-

Actually, I just encountered this error in a branch of mine which is based on 
[latest 
master|https://github.com/apache/flink/commit/428ce1b938813fba287a51bf86e6c52ef54453cb].
 So either there has been a regression, or the fix does not work in all cases, 
or it is no duplicate afterall:
{code}
17:30:18.083 [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time 
elapsed: 14.113 s <<< FAILURE! - in 
org.apache.flink.test.state.operator.restore.unkeyed.ChainLengthDecreaseTest
17:30:18.083 [ERROR] testMigrationAndRestore[Migrate Savepoint: 
1.8](org.apache.flink.test.state.operator.restore.unkeyed.ChainLengthDecreaseTest)
  Time elapsed: 0.268 s  <<< ERROR!
java.util.concurrent.ExecutionException: 
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.checkpoint.CheckpointException: Task received 
cancellation from one of its inputs
Caused by: java.util.concurrent.CompletionException: 
org.apache.flink.runtime.checkpoint.CheckpointException: Task received 
cancellation from one of its inputs
Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Task 
received cancellation from one of its inputs
Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Task 
received cancellation from one of its inputs
{code}

https://api.travis-ci.com/v3/job/225588484/log.txt

> UT Failure: ChainLengthDecreaseTest
> ---
>
> Key: FLINK-13020
> URL: https://issues.apache.org/jira/browse/FLINK-13020
> Project: Flink
>  Issue Type: Improvement
>Reporter: Bowen Li
>Priority: Major
>
> {code:java}
> 05:47:24.893 [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time 
> elapsed: 19.836 s <<< FAILURE! - in 
> org.apache.flink.test.state.operator.restore.unkeyed.ChainLengthDecreaseTest
> 05:47:24.895 [ERROR] testMigrationAndRestore[Migrate Savepoint: 
> 1.3](org.apache.flink.test.state.operator.restore.unkeyed.ChainLengthDecreaseTest)
>   Time elapsed: 1.501 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Task received 
> cancellation from one of its inputs
> Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Task received 
> cancellation from one of its inputs
> Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Task 
> received cancellation from one of its inputs
> Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Task 
> received cancellation from one of its inputs
> ...
> 05:48:27.736 [ERROR] Errors: 
> 05:48:27.736 [ERROR]   
> ChainLengthDecreaseTest>AbstractOperatorRestoreTestBase.testMigrationAndRestore:102->AbstractOperatorRestoreTestBase.migrateJob:138
>  » Execution
> 05:48:27.736 [INFO] 
> {code}
> https://travis-ci.org/apache/flink/jobs/551053821



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Reopened] (FLINK-13020) UT Failure: ChainLengthDecreaseTest

2019-08-15 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber reopened FLINK-13020:
-

> UT Failure: ChainLengthDecreaseTest
> ---
>
> Key: FLINK-13020
> URL: https://issues.apache.org/jira/browse/FLINK-13020
> Project: Flink
>  Issue Type: Improvement
>Reporter: Bowen Li
>Priority: Major
>
> {code:java}
> 05:47:24.893 [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time 
> elapsed: 19.836 s <<< FAILURE! - in 
> org.apache.flink.test.state.operator.restore.unkeyed.ChainLengthDecreaseTest
> 05:47:24.895 [ERROR] testMigrationAndRestore[Migrate Savepoint: 
> 1.3](org.apache.flink.test.state.operator.restore.unkeyed.ChainLengthDecreaseTest)
>   Time elapsed: 1.501 s  <<< ERROR!
> java.util.concurrent.ExecutionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Task received 
> cancellation from one of its inputs
> Caused by: java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: Task received 
> cancellation from one of its inputs
> Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Task 
> received cancellation from one of its inputs
> Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Task 
> received cancellation from one of its inputs
> ...
> 05:48:27.736 [ERROR] Errors: 
> 05:48:27.736 [ERROR]   
> ChainLengthDecreaseTest>AbstractOperatorRestoreTestBase.testMigrationAndRestore:102->AbstractOperatorRestoreTestBase.migrateJob:138
>  » Execution
> 05:48:27.736 [INFO] 
> {code}
> https://travis-ci.org/apache/flink/jobs/551053821



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (FLINK-13727) Build docs with jekyll 4.0.0 (final)

2019-08-14 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-13727:

Description: 
When Jekyll 4.0.0 is out, we should upgrade to this final version and 
discontinue using the beta.

When we make this final, we could also follow these official recommendations:
{quote}
-
This version of Jekyll comes with some major changes.

Most notably:
  * Our `link` tag now comes with the `relative_url` filter incorporated into 
it.
You should no longer prepend `{{ site.baseurl }}` to `{% link foo.md %}`
For further details: https://github.com/jekyll/jekyll/pull/6727

  * Our `post_url` tag now comes with the `relative_url` filter incorporated 
into it.
You shouldn't prepend `{{ site.baseurl }}` to `{% post_url 2019-03-27-hello 
%}`
For further details: https://github.com/jekyll/jekyll/pull/7589

  * Support for deprecated configuration options has been removed. We will no 
longer
output a warning and gracefully assign their values to the newer 
counterparts
internally.
-
{quote}

  was:When Jekyll 4.0.0 is out, we should upgrade to this final version and 
discontinue using the beta.


> Build docs with jekyll 4.0.0 (final)
> 
>
> Key: FLINK-13727
> URL: https://issues.apache.org/jira/browse/FLINK-13727
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 1.10.0
>Reporter: Nico Kruber
>Priority: Major
>
> When Jekyll 4.0.0 is out, we should upgrade to this final version and 
> discontinue using the beta.
> When we make this final, we could also follow these official recommendations:
> {quote}
> -
> This version of Jekyll comes with some major changes.
> Most notably:
>   * Our `link` tag now comes with the `relative_url` filter incorporated into 
> it.
> You should no longer prepend `{{ site.baseurl }}` to `{% link foo.md %}`
> For further details: https://github.com/jekyll/jekyll/pull/6727
>   * Our `post_url` tag now comes with the `relative_url` filter incorporated 
> into it.
> You shouldn't prepend `{{ site.baseurl }}` to `{% post_url 
> 2019-03-27-hello %}`
> For further details: https://github.com/jekyll/jekyll/pull/7589
>   * Support for deprecated configuration options has been removed. We will no 
> longer
> output a warning and gracefully assign their values to the newer 
> counterparts
> internally.
> -
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13729) Update website generation dependencies

2019-08-14 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-13729:
---

 Summary: Update website generation dependencies
 Key: FLINK-13729
 URL: https://issues.apache.org/jira/browse/FLINK-13729
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.10.0
Reporter: Nico Kruber
Assignee: Nico Kruber


The website generation dependencies are quite old. By upgrading some of them we 
get improvements like a much nicer code highlighting and prepare for the jekyll 
update of FLINK-13726 and FLINK-13727.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13728) Fix wrong closing tag order in sidenav

2019-08-14 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-13728:
---

 Summary: Fix wrong closing tag order in sidenav
 Key: FLINK-13728
 URL: https://issues.apache.org/jira/browse/FLINK-13728
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.8.1, 1.9.0, 1.10.0
Reporter: Nico Kruber
Assignee: Nico Kruber


The order of closing HTML tags in the sidenav is wrong: instead of 
{{}} it should be {{}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (FLINK-13724) Remove unnecessary whitespace from the docs' sidenav

2019-08-14 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-13724:

Description: The side navigation generates quite some white space that will 
end up in every HTML page. Removing this reduces final page sizes and also 
improved site generation speed.  (was: The site navigation generates quite some 
white space that will end up in every HTML page. Removing this reduces final 
page sizes and also improved site generation speed.)

> Remove unnecessary whitespace from the docs' sidenav
> 
>
> Key: FLINK-13724
> URL: https://issues.apache.org/jira/browse/FLINK-13724
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>
> The side navigation generates quite some white space that will end up in 
> every HTML page. Removing this reduces final page sizes and also improved 
> site generation speed.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (FLINK-13724) Remove unnecessary whitespace from the docs' sidenav

2019-08-14 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-13724:

Summary: Remove unnecessary whitespace from the docs' sidenav  (was: Remove 
unnecessary whitespace from the docs' sitenav)

> Remove unnecessary whitespace from the docs' sidenav
> 
>
> Key: FLINK-13724
> URL: https://issues.apache.org/jira/browse/FLINK-13724
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>
> The site navigation generates quite some white space that will end up in 
> every HTML page. Removing this reduces final page sizes and also improved 
> site generation speed.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (FLINK-13726) Build docs with jekyll 4.0.0.pre.beta1

2019-08-14 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-13726:

Description: Jekyll 4 is way faster in generating the docs than jekyll 3 - 
probably due to the newly introduced cache. Site generation time goes down by 
roughly a factor of 2.5 even with the current beta version!  (was: Jekyll 4 is 
way faster in generating the docs than jekyll 3 - probably due to the newly 
introduced cache. Site generation time goes down by roughly a factor of 2.5!)

> Build docs with jekyll 4.0.0.pre.beta1
> --
>
> Key: FLINK-13726
> URL: https://issues.apache.org/jira/browse/FLINK-13726
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 1.10.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>
> Jekyll 4 is way faster in generating the docs than jekyll 3 - probably due to 
> the newly introduced cache. Site generation time goes down by roughly a 
> factor of 2.5 even with the current beta version!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13727) Build docs with jekyll 4.0.0 (final)

2019-08-14 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-13727:
---

 Summary: Build docs with jekyll 4.0.0 (final)
 Key: FLINK-13727
 URL: https://issues.apache.org/jira/browse/FLINK-13727
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.10.0
Reporter: Nico Kruber


When Jekyll 4.0.0 is out, we should upgrade to this final version and 
discontinue using the beta.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13726) Build docs with jekyll 4.0.0.pre.beta1

2019-08-14 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-13726:
---

 Summary: Build docs with jekyll 4.0.0.pre.beta1
 Key: FLINK-13726
 URL: https://issues.apache.org/jira/browse/FLINK-13726
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.10.0
Reporter: Nico Kruber
Assignee: Nico Kruber


Jekyll 4 is way faster in generating the docs than jekyll 3 - probably due to 
the newly introduced cache. Site generation time goes down by roughly a factor 
of 2.5!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13725) Use sassc for faster doc generation

2019-08-14 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-13725:
---

 Summary: Use sassc for faster doc generation
 Key: FLINK-13725
 URL: https://issues.apache.org/jira/browse/FLINK-13725
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.10.0
Reporter: Nico Kruber
Assignee: Nico Kruber


Jekyll requires {{sass}} but can optionally also use a C-based implementation 
provided by {{sassc}}. Although we do not use sass directly, there may be some 
indirect use inside jekyll. It doesn't seem to hurt to upgrade here.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13724) Remove unnecessary whitespace from the docs' sitenav

2019-08-14 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-13724:
---

 Summary: Remove unnecessary whitespace from the docs' sitenav
 Key: FLINK-13724
 URL: https://issues.apache.org/jira/browse/FLINK-13724
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.10.0
Reporter: Nico Kruber
Assignee: Nico Kruber


The site navigation generates quite some white space that will end up in every 
HTML page. Removing this reduces final page sizes and also improved site 
generation speed.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13723) Use liquid-c for faster doc generation

2019-08-14 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-13723:
---

 Summary: Use liquid-c for faster doc generation
 Key: FLINK-13723
 URL: https://issues.apache.org/jira/browse/FLINK-13723
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.10.0
Reporter: Nico Kruber
Assignee: Nico Kruber


Jekyll requires {{liquid}} and only optionally uses {{liquid-c}} if available. 
The latter uses natively-compiled code and reduces generation time by ~5% for 
me.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13722) Speed up documentation generation

2019-08-14 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-13722:
---

 Summary: Speed up documentation generation
 Key: FLINK-13722
 URL: https://issues.apache.org/jira/browse/FLINK-13722
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.10.0
Reporter: Nico Kruber
Assignee: Nico Kruber


Creating the documentation via {{build_docs.sh}} currently takes about 150s!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (FLINK-13537) Changing Kafka producer pool size and scaling out may create overlapping transaction IDs

2019-08-07 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-13537:

Description: 
The Kafka producer's transaction IDs are only generated once when there was no 
previous state for that operator. In the case where we restore and increase 
parallelism (scale-out), some operators may not have previous state and create 
new IDs. Now, if we also reduce the {{poolSize}}, these new IDs may overlap 
with the old ones which should never happen! Similarly, a scale-in + increasing 
{{poolSize}} could lead the the same thing.

An easy "fix" for this would be to forbid changing the {{poolSize}}. We could 
potentially be a bit better by only forbidding changes that can lead to 
transaction ID overlaps which we can identify from the formulae that 
{{TransactionalIdsGenerator}} uses. This should probably be the first step 
which can also be back-ported to older Flink versions just in case.


On a side note, the current scheme also relies on the fact, that the operator's 
list state distributes previous states during scale-out in a fashion that only 
the operators with the highest subtask indices do not get a previous state. 
This is somewhat "guaranteed" by {{OperatorStateStore#getListState()}} but I'm 
not sure whether we should actually rely on that there.

  was:
The Kafka producer's transaction IDs are only generated once when there was no 
previous state for that operator. In the case where we restore and increase 
parallelism (scale-out), some operators may not have previous state and create 
new IDs. Now, if we also reduce the poolSize, these new IDs may overlap with 
the old ones which should never happen!

On a side note, the current scheme also relies on the fact, that the operator's 
list state distributes previous states during scale-out in a fashion that only 
the operators with the highest subtask indices do not get a previous state. 
This is somewhat "guaranteed" by {{OperatorStateStore#getListState()}} but I'm 
not sure whether we should actually rely on that there.


> Changing Kafka producer pool size and scaling out may create overlapping 
> transaction IDs
> 
>
> Key: FLINK-13537
> URL: https://issues.apache.org/jira/browse/FLINK-13537
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.8.1, 1.9.0
>Reporter: Nico Kruber
>Priority: Major
>
> The Kafka producer's transaction IDs are only generated once when there was 
> no previous state for that operator. In the case where we restore and 
> increase parallelism (scale-out), some operators may not have previous state 
> and create new IDs. Now, if we also reduce the {{poolSize}}, these new IDs 
> may overlap with the old ones which should never happen! Similarly, a 
> scale-in + increasing {{poolSize}} could lead the the same thing.
> An easy "fix" for this would be to forbid changing the {{poolSize}}. We could 
> potentially be a bit better by only forbidding changes that can lead to 
> transaction ID overlaps which we can identify from the formulae that 
> {{TransactionalIdsGenerator}} uses. This should probably be the first step 
> which can also be back-ported to older Flink versions just in case.
> 
> On a side note, the current scheme also relies on the fact, that the 
> operator's list state distributes previous states during scale-out in a 
> fashion that only the operators with the highest subtask indices do not get a 
> previous state. This is somewhat "guaranteed" by 
> {{OperatorStateStore#getListState()}} but I'm not sure whether we should 
> actually rely on that there.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (FLINK-13498) Reduce Kafka producer startup time by aborting transactions in parallel

2019-08-06 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-13498.
---
   Resolution: Fixed
Fix Version/s: 1.10.0

fixed on master via d774fea

> Reduce Kafka producer startup time by aborting transactions in parallel
> ---
>
> Key: FLINK-13498
> URL: https://issues.apache.org/jira/browse/FLINK-13498
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.8.1, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When a Flink job with a Kafka producer starts up without previous state, it 
> currently starts 5 * kafkaPoolSize number of Kafka producers (per sink 
> instance) to abort potentially existing transactions from a first run without 
> a completed snapshot.
> Apparently, this is quite slow and it is also done sequentially. Until there 
> is a better way of aborting these transactions with Kafka, we could do this 
> in parallel quite easily and at least make use of lingering CPU resources.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (FLINK-13535) Do not abort transactions twice during KafkaProducer startup

2019-08-01 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-13535:

Description: 
During startup of a transactional Kafka producer from previous state, we 
recover in two steps:
# in {{TwoPhaseCommitSinkFunction}}, we commit pending commit-transactions and 
abort pending transactions and then call into {{finishRecoveringContext()}}
# in {{FlinkKafkaProducer#finishRecoveringContext()}} we iterate over all 
recovered transaction IDs and abort them.

This may lead to some transactions being worked on twice. Since this is quite 
some expensive operation, we unnecessarily slow down the job startup but could 
easily give {{finishRecoveringContext()}} a set of transactions that 
{{TwoPhaseCommitSinkFunction}} already covered instead.

  was:
During startup of a transactional Kafka producer from previous state, we 
recover in two steps:
# in {{TwoPhaseCommitSinkFunction}}, we commit pending commit-transactions and 
abort pending transactions and then call into {{finishRecoveringContext()}}
# in {{FlinkKafkaProducer#finishRecoveringContext()}} we iterate over all 
recovered transaction IDs and abort them
This may lead to some transactions being worked on twice. Since this is quite 
some expensive operation, we unnecessarily slow down the job startup but could 
easily give {{finishRecoveringContext()}} a set of transactions that 
{{TwoPhaseCommitSinkFunction}} already covered instead.


> Do not abort transactions twice during KafkaProducer startup
> 
>
> Key: FLINK-13535
> URL: https://issues.apache.org/jira/browse/FLINK-13535
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Kafka
>Affects Versions: 1.8.1, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>
> During startup of a transactional Kafka producer from previous state, we 
> recover in two steps:
> # in {{TwoPhaseCommitSinkFunction}}, we commit pending commit-transactions 
> and abort pending transactions and then call into 
> {{finishRecoveringContext()}}
> # in {{FlinkKafkaProducer#finishRecoveringContext()}} we iterate over all 
> recovered transaction IDs and abort them.
> This may lead to some transactions being worked on twice. Since this is quite 
> some expensive operation, we unnecessarily slow down the job startup but 
> could easily give {{finishRecoveringContext()}} a set of transactions that 
> {{TwoPhaseCommitSinkFunction}} already covered instead.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (FLINK-13537) Changing Kafka producer pool size and scaling out may create overlapping transaction IDs

2019-08-01 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-13537:

Description: 
The Kafka producer's transaction IDs are only generated once when there was no 
previous state for that operator. In the case where we restore and increase 
parallelism (scale-out), some operators may not have previous state and create 
new IDs. Now, if we also reduce the poolSize, these new IDs may overlap with 
the old ones which should never happen!

On a side note, the current scheme also relies on the fact, that the operator's 
list state distributes previous states during scale-out in a fashion that only 
the operators with the highest subtask indices do not get a previous state. 
This is somewhat "guaranteed" by {{OperatorStateStore#getListState()}} but I'm 
not sure whether we should actually rely on that there.

  was:The Kafka producer's transaction IDs are only generated once when there 
was no previous state for that operator. In the case where we restore and 
increase parallelism (scale-out), some operators may not have previous state 
and create new IDs. Now, if we also reduce the poolSize, these new IDs may 
overlap with the old ones which should never happen!


> Changing Kafka producer pool size and scaling out may create overlapping 
> transaction IDs
> 
>
> Key: FLINK-13537
> URL: https://issues.apache.org/jira/browse/FLINK-13537
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.8.1, 1.9.0
>Reporter: Nico Kruber
>Priority: Major
>
> The Kafka producer's transaction IDs are only generated once when there was 
> no previous state for that operator. In the case where we restore and 
> increase parallelism (scale-out), some operators may not have previous state 
> and create new IDs. Now, if we also reduce the poolSize, these new IDs may 
> overlap with the old ones which should never happen!
> On a side note, the current scheme also relies on the fact, that the 
> operator's list state distributes previous states during scale-out in a 
> fashion that only the operators with the highest subtask indices do not get a 
> previous state. This is somewhat "guaranteed" by 
> {{OperatorStateStore#getListState()}} but I'm not sure whether we should 
> actually rely on that there.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13537) Changing Kafka producer pool size and scaling out may create overlapping transaction IDs

2019-08-01 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-13537:
---

 Summary: Changing Kafka producer pool size and scaling out may 
create overlapping transaction IDs
 Key: FLINK-13537
 URL: https://issues.apache.org/jira/browse/FLINK-13537
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.8.1, 1.9.0
Reporter: Nico Kruber


The Kafka producer's transaction IDs are only generated once when there was no 
previous state for that operator. In the case where we restore and increase 
parallelism (scale-out), some operators may not have previous state and create 
new IDs. Now, if we also reduce the poolSize, these new IDs may overlap with 
the old ones which should never happen!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13535) Do not abort transactions twice during KafkaProducer startup

2019-08-01 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-13535:
---

 Summary: Do not abort transactions twice during KafkaProducer 
startup
 Key: FLINK-13535
 URL: https://issues.apache.org/jira/browse/FLINK-13535
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / Kafka
Affects Versions: 1.8.1, 1.9.0
Reporter: Nico Kruber
Assignee: Nico Kruber


During startup of a transactional Kafka producer from previous state, we 
recover in two steps:
# in {{TwoPhaseCommitSinkFunction}}, we commit pending commit-transactions and 
abort pending transactions and then call into {{finishRecoveringContext()}}
# in {{FlinkKafkaProducer#finishRecoveringContext()}} we iterate over all 
recovered transaction IDs and abort them
This may lead to some transactions being worked on twice. Since this is quite 
some expensive operation, we unnecessarily slow down the job startup but could 
easily give {{finishRecoveringContext()}} a set of transactions that 
{{TwoPhaseCommitSinkFunction}} already covered instead.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (FLINK-13517) Restructure Hive Catalog documentation

2019-07-31 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber reassigned FLINK-13517:
---

Assignee: Seth Wiesman

> Restructure Hive Catalog documentation
> --
>
> Key: FLINK-13517
> URL: https://issues.apache.org/jira/browse/FLINK-13517
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Hive, Documentation
>Reporter: Seth Wiesman
>Assignee: Seth Wiesman
>Priority: Major
>
> Hive documentation is currently spread across a number of pages and 
> fragmented. In particular: 
> 1) An example was added to getting-started/examples, however, this section is 
> being removed
> 2) There is a dedicated page on hive integration but also a lot of hive 
> specific information is on the catalog page
> We should
> 1) Inline the example into the hive integration page
> 2) Move the hive specific information on catalogs.md to hive_integration.md
> 3) Make catalogs.md be just about catalogs in general and link to the hive 
> integration. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13498) Reduce Kafka producer startup time by aborting transactions in parallel

2019-07-30 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-13498:
---

 Summary: Reduce Kafka producer startup time by aborting 
transactions in parallel
 Key: FLINK-13498
 URL: https://issues.apache.org/jira/browse/FLINK-13498
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.8.1, 1.9.0
Reporter: Nico Kruber
Assignee: Nico Kruber


When a Flink job with a Kafka producer starts up without previous state, it 
currently starts 5 * kafkaPoolSize number of Kafka producers (per sink 
instance) to abort potentially existing transactions from a first run without a 
completed snapshot.
Apparently, this is quite slow and it is also done sequentially. Until there is 
a better way of aborting these transactions with Kafka, we could do this in 
parallel quite easily and at least make use of lingering CPU resources.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (FLINK-12747) Getting Started - Table API Example Walkthrough

2019-07-30 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-12747.
---

> Getting Started - Table API Example Walkthrough
> ---
>
> Key: FLINK-12747
> URL: https://issues.apache.org/jira/browse/FLINK-12747
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Konstantin Knauf
>Assignee: Seth Wiesman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The planned structure for the new Getting Started Guide is 
> *  Flink Overview (~ two pages)
> * Project Setup
> ** Java
> ** Scala
> ** Python
> * Quickstarts
> ** Example Walkthrough - Table API / SQL
> ** Example Walkthrough - DataStream API
> * Docker Playgrounds
> ** Flink Cluster Playground
> ** Flink Interactive SQL Playground
> This tickets adds the Example Walkthrough for the Table API, which should 
> follow the same structure as the DataStream Example (FLINK-12746), which 
> needs to be completed first.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (FLINK-12747) Getting Started - Table API Example Walkthrough

2019-07-30 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-12747:

Fix Version/s: (was: 1.10)
   1.10.0

> Getting Started - Table API Example Walkthrough
> ---
>
> Key: FLINK-12747
> URL: https://issues.apache.org/jira/browse/FLINK-12747
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Konstantin Knauf
>Assignee: Seth Wiesman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The planned structure for the new Getting Started Guide is 
> *  Flink Overview (~ two pages)
> * Project Setup
> ** Java
> ** Scala
> ** Python
> * Quickstarts
> ** Example Walkthrough - Table API / SQL
> ** Example Walkthrough - DataStream API
> * Docker Playgrounds
> ** Flink Cluster Playground
> ** Flink Interactive SQL Playground
> This tickets adds the Example Walkthrough for the Table API, which should 
> follow the same structure as the DataStream Example (FLINK-12746), which 
> needs to be completed first.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (FLINK-12171) The network buffer memory size should not be checked against the heap size on the TM side

2019-07-30 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-12171:

Fix Version/s: (was: 1.10)
   1.10.0

> The network buffer memory size should not be checked against the heap size on 
> the TM side
> -
>
> Key: FLINK-12171
> URL: https://issues.apache.org/jira/browse/FLINK-12171
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.7.2, 1.8.0, 1.9.0
> Environment: Flink-1.7.2, and Flink-1.8 seems have not modified the 
> logic here.
>  
>Reporter: Yun Gao
>Assignee: Yun Gao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently when computing the network buffer memory size on the TM side in 
> _TaskManagerService#calculateNetworkBufferMemory_`(version 1.8 or 1.7) or 
> _NetworkEnvironmentConfiguration#calculateNewNetworkBufferMemory_(master), 
> the computed network buffer memory size is checked to be less than 
> `maxJvmHeapMemory`. However, in TM side, _maxJvmHeapMemory_ stores the 
> maximum heap memory (namely -Xmx) .
>  
> With the above process, when TM starts, -Xmx is computed in RM or in 
> _taskmanager.sh_ with (container memory - network buffer memory - managed 
> memory),  thus the above checking implies that the heap memory of the TM must 
> be larger than the network memory, which seems to be not necessary.
>  
> This may cause TM to use more memory than expected. For example, for a job 
> who has a large network throughput, uses may configure network memory to 2G. 
> However, if users want to assign 1G to heap memory, the TM will fail to 
> start, and user has to allocate at least 2G heap memory (in other words, 4G 
> in total for the TM instead of 3G) to make the TM runnable. This may cause 
> resource inefficiency.
>  
> Therefore, I think the network buffer memory size also need to be checked 
> against the total memory instead of the heap memory on the TM  side:
>  # Checks that networkBufFraction < 1.0.
>  # Compute the total memory by ( jvmHeapNoNet / (1 - networkBufFraction)).
>  # Compare the network buffer memory with the total memory.
> This checking is also consistent with the similar one done on the RM side.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (FLINK-12171) The network buffer memory size should not be checked against the heap size on the TM side

2019-07-30 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-12171.
---
   Resolution: Fixed
Fix Version/s: 1.10

fixed via 8dec21f

> The network buffer memory size should not be checked against the heap size on 
> the TM side
> -
>
> Key: FLINK-12171
> URL: https://issues.apache.org/jira/browse/FLINK-12171
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.7.2, 1.8.0, 1.9.0
> Environment: Flink-1.7.2, and Flink-1.8 seems have not modified the 
> logic here.
>  
>Reporter: Yun Gao
>Assignee: Yun Gao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently when computing the network buffer memory size on the TM side in 
> _TaskManagerService#calculateNetworkBufferMemory_`(version 1.8 or 1.7) or 
> _NetworkEnvironmentConfiguration#calculateNewNetworkBufferMemory_(master), 
> the computed network buffer memory size is checked to be less than 
> `maxJvmHeapMemory`. However, in TM side, _maxJvmHeapMemory_ stores the 
> maximum heap memory (namely -Xmx) .
>  
> With the above process, when TM starts, -Xmx is computed in RM or in 
> _taskmanager.sh_ with (container memory - network buffer memory - managed 
> memory),  thus the above checking implies that the heap memory of the TM must 
> be larger than the network memory, which seems to be not necessary.
>  
> This may cause TM to use more memory than expected. For example, for a job 
> who has a large network throughput, uses may configure network memory to 2G. 
> However, if users want to assign 1G to heap memory, the TM will fail to 
> start, and user has to allocate at least 2G heap memory (in other words, 4G 
> in total for the TM instead of 3G) to make the TM runnable. This may cause 
> resource inefficiency.
>  
> Therefore, I think the network buffer memory size also need to be checked 
> against the total memory instead of the heap memory on the TM  side:
>  # Checks that networkBufFraction < 1.0.
>  # Compute the total memory by ( jvmHeapNoNet / (1 - networkBufFraction)).
>  # Compare the network buffer memory with the total memory.
> This checking is also consistent with the similar one done on the RM side.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (FLINK-12747) Getting Started - Table API Example Walkthrough

2019-07-29 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber resolved FLINK-12747.
-
   Resolution: Fixed
Fix Version/s: 1.10

fixed via f4943dd

> Getting Started - Table API Example Walkthrough
> ---
>
> Key: FLINK-12747
> URL: https://issues.apache.org/jira/browse/FLINK-12747
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Konstantin Knauf
>Assignee: Seth Wiesman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The planned structure for the new Getting Started Guide is 
> *  Flink Overview (~ two pages)
> * Project Setup
> ** Java
> ** Scala
> ** Python
> * Quickstarts
> ** Example Walkthrough - Table API / SQL
> ** Example Walkthrough - DataStream API
> * Docker Playgrounds
> ** Flink Cluster Playground
> ** Flink Interactive SQL Playground
> This tickets adds the Example Walkthrough for the Table API, which should 
> follow the same structure as the DataStream Example (FLINK-12746), which 
> needs to be completed first.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13417) Bump Zookeeper to 3.5.5

2019-07-29 Thread Nico Kruber (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895170#comment-16895170
 ] 

Nico Kruber commented on FLINK-13417:
-

FYI: since 3.5.5 is the first stable version in the 3.5.x series[1], we should 
actually take this, not any older 3.5.x

[1] https://zookeeper.apache.org/releases.html

> Bump Zookeeper to 3.5.5
> ---
>
> Key: FLINK-13417
> URL: https://issues.apache.org/jira/browse/FLINK-13417
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0
>Reporter: Konstantin Knauf
>Priority: Major
>
> User might want to secure their Zookeeper connection via SSL.
> This requires a Zookeeper version >= 3.5.1. We might as well try to bump it 
> to 3.5.5, which is the latest version. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (FLINK-12741) Update docs about Kafka producer fault tolerance guarantees

2019-07-23 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-12741.
---
   Resolution: Fixed
Fix Version/s: 1.8.2
   1.7.3

merged for
- 1.7: 56c3e7cd653e4cb2ad0a76ca317aa9fa1d564dc2
- 1.8: 91d036f794cfd96a3c1da445d5172690054aee2f

> Update docs about Kafka producer fault tolerance guarantees
> ---
>
> Key: FLINK-12741
> URL: https://issues.apache.org/jira/browse/FLINK-12741
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.9.0
>Reporter: Paul Lin
>Assignee: Paul Lin
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.7.3, 1.8.2, 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since Flink 1.4.0, we provide exactly-once semantic on Kafka 0.11+, but the 
> document is still not updated.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Reopened] (FLINK-12741) Update docs about Kafka producer fault tolerance guarantees

2019-07-23 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber reopened FLINK-12741:
-

> Update docs about Kafka producer fault tolerance guarantees
> ---
>
> Key: FLINK-12741
> URL: https://issues.apache.org/jira/browse/FLINK-12741
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.9.0
>Reporter: Paul Lin
>Assignee: Paul Lin
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since Flink 1.4.0, we provide exactly-once semantic on Kafka 0.11+, but the 
> document is still not updated.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-13245) Network stack is leaking files

2019-07-19 Thread Nico Kruber (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16888982#comment-16888982
 ] 

Nico Kruber commented on FLINK-13245:
-

I agree with [~zjwang] - changing the semantics should be tackled separately, 
not necessarily as part of this bug fix. I'll see when I have time to look at 
the PR so we can get this merged

> Network stack is leaking files
> --
>
> Key: FLINK-13245
> URL: https://issues.apache.org/jira/browse/FLINK-13245
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.9.0
>Reporter: Chesnay Schepler
>Assignee: zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There's file leak in the network stack / shuffle service.
> When running the {{SlotCountExceedingParallelismTest}} on Windows a large 
> number of {{.channel}} files continue to reside in a 
> {{flink-netty-shuffle-XXX}} directory.
> From what I've gathered so far these files are still being used by a 
> {{BoundedBlockingSubpartition}}. The cleanup logic in this class uses 
> ref-counting to ensure we don't release data while a reader is still present. 
> However, at the end of the job this count has not reached 0, and thus nothing 
> is being released.
> The same issue is also present on the {{ResultPartition}} level; the 
> {{ReleaseOnConsumptionResultPartition}} also are being released while the 
> ref-count is greater than 0.
> Overall it appears like there's some issue with the notifications for 
> partitions being consumed.
> It is feasible that this issue has recently caused issues on Travis where the 
> build were failing due to a lack of disk space.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (FLINK-8801) S3's eventual consistent read-after-write may fail yarn deployment of resources to S3

2019-07-17 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-8801.
--
Resolution: Fixed

> S3's eventual consistent read-after-write may fail yarn deployment of 
> resources to S3
> -
>
> Key: FLINK-8801
> URL: https://issues.apache.org/jira/browse/FLINK-8801
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, FileSystems, Runtime / Coordination
>Affects Versions: 1.4.0, 1.5.0, 1.6.4, 1.7.2, 1.8.1, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.9.0, 1.4.3, 1.5.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> According to 
> https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel:
> {quote}
> Amazon S3 provides read-after-write consistency for PUTS of new objects in 
> your S3 bucket in all regions with one caveat. The caveat is that if you make 
> a HEAD or GET request to the key name (to find if the object exists) before 
> creating the object, Amazon S3 provides eventual consistency for 
> read-after-write.
> {quote}
> Some S3 file system implementations may actually execute such a request for 
> the about-to-write object and thus the read-after-write is only eventually 
> consistent. {{org.apache.flink.yarn.Utils#setupLocalResource()}} currently 
> relies on a consistent read-after-write since it accesses the remote resource 
> to get file size and modification timestamp. Since there we have access to 
> the local resource, we can use the data from there instead and circumvent the 
> problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (FLINK-8801) S3's eventual consistent read-after-write may fail yarn deployment of resources to S3

2019-07-17 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-8801:
---
Fix Version/s: (was: 1.10.0)

> S3's eventual consistent read-after-write may fail yarn deployment of 
> resources to S3
> -
>
> Key: FLINK-8801
> URL: https://issues.apache.org/jira/browse/FLINK-8801
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, FileSystems, Runtime / Coordination
>Affects Versions: 1.4.0, 1.5.0, 1.6.4, 1.7.2, 1.8.1, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.4.3, 1.5.0, 1.9.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> According to 
> https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel:
> {quote}
> Amazon S3 provides read-after-write consistency for PUTS of new objects in 
> your S3 bucket in all regions with one caveat. The caveat is that if you make 
> a HEAD or GET request to the key name (to find if the object exists) before 
> creating the object, Amazon S3 provides eventual consistency for 
> read-after-write.
> {quote}
> Some S3 file system implementations may actually execute such a request for 
> the about-to-write object and thus the read-after-write is only eventually 
> consistent. {{org.apache.flink.yarn.Utils#setupLocalResource()}} currently 
> relies on a consistent read-after-write since it accesses the remote resource 
> to get file size and modification timestamp. Since there we have access to 
> the local resource, we can use the data from there instead and circumvent the 
> problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Reopened] (FLINK-8801) S3's eventual consistent read-after-write may fail yarn deployment of resources to S3

2019-07-17 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber reopened FLINK-8801:


> S3's eventual consistent read-after-write may fail yarn deployment of 
> resources to S3
> -
>
> Key: FLINK-8801
> URL: https://issues.apache.org/jira/browse/FLINK-8801
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, FileSystems, Runtime / Coordination
>Affects Versions: 1.4.0, 1.5.0, 1.6.4, 1.7.2, 1.8.1, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.4.3, 1.5.0, 1.9.0, 1.10.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> According to 
> https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel:
> {quote}
> Amazon S3 provides read-after-write consistency for PUTS of new objects in 
> your S3 bucket in all regions with one caveat. The caveat is that if you make 
> a HEAD or GET request to the key name (to find if the object exists) before 
> creating the object, Amazon S3 provides eventual consistency for 
> read-after-write.
> {quote}
> Some S3 file system implementations may actually execute such a request for 
> the about-to-write object and thus the read-after-write is only eventually 
> consistent. {{org.apache.flink.yarn.Utils#setupLocalResource()}} currently 
> relies on a consistent read-after-write since it accesses the remote resource 
> to get file size and modification timestamp. Since there we have access to 
> the local resource, we can use the data from there instead and circumvent the 
> problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (FLINK-8801) S3's eventual consistent read-after-write may fail yarn deployment of resources to S3

2019-07-16 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-8801.
--
   Resolution: Fixed
Fix Version/s: 1.9.0

also merged to release-1.9 via b56234ce4e

> S3's eventual consistent read-after-write may fail yarn deployment of 
> resources to S3
> -
>
> Key: FLINK-8801
> URL: https://issues.apache.org/jira/browse/FLINK-8801
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, FileSystems, Runtime / Coordination
>Affects Versions: 1.4.0, 1.5.0, 1.6.4, 1.7.2, 1.8.1, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.9.0, 1.10.0, 1.4.3, 1.5.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> According to 
> https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel:
> {quote}
> Amazon S3 provides read-after-write consistency for PUTS of new objects in 
> your S3 bucket in all regions with one caveat. The caveat is that if you make 
> a HEAD or GET request to the key name (to find if the object exists) before 
> creating the object, Amazon S3 provides eventual consistency for 
> read-after-write.
> {quote}
> Some S3 file system implementations may actually execute such a request for 
> the about-to-write object and thus the read-after-write is only eventually 
> consistent. {{org.apache.flink.yarn.Utils#setupLocalResource()}} currently 
> relies on a consistent read-after-write since it accesses the remote resource 
> to get file size and modification timestamp. Since there we have access to 
> the local resource, we can use the data from there instead and circumvent the 
> problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Reopened] (FLINK-8801) S3's eventual consistent read-after-write may fail yarn deployment of resources to S3

2019-07-16 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber reopened FLINK-8801:


> S3's eventual consistent read-after-write may fail yarn deployment of 
> resources to S3
> -
>
> Key: FLINK-8801
> URL: https://issues.apache.org/jira/browse/FLINK-8801
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, FileSystems, Runtime / Coordination
>Affects Versions: 1.4.0, 1.5.0, 1.6.4, 1.7.2, 1.8.1, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.4.3, 1.5.0, 1.10.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> According to 
> https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel:
> {quote}
> Amazon S3 provides read-after-write consistency for PUTS of new objects in 
> your S3 bucket in all regions with one caveat. The caveat is that if you make 
> a HEAD or GET request to the key name (to find if the object exists) before 
> creating the object, Amazon S3 provides eventual consistency for 
> read-after-write.
> {quote}
> Some S3 file system implementations may actually execute such a request for 
> the about-to-write object and thus the read-after-write is only eventually 
> consistent. {{org.apache.flink.yarn.Utils#setupLocalResource()}} currently 
> relies on a consistent read-after-write since it accesses the remote resource 
> to get file size and modification timestamp. Since there we have access to 
> the local resource, we can use the data from there instead and circumvent the 
> problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (FLINK-8801) S3's eventual consistent read-after-write may fail yarn deployment of resources to S3

2019-07-16 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber resolved FLINK-8801.

   Resolution: Fixed
Fix Version/s: 1.10.0

merged to master via 770a404

> S3's eventual consistent read-after-write may fail yarn deployment of 
> resources to S3
> -
>
> Key: FLINK-8801
> URL: https://issues.apache.org/jira/browse/FLINK-8801
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, FileSystems, Runtime / Coordination
>Affects Versions: 1.4.0, 1.5.0, 1.6.4, 1.7.2, 1.8.1, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.4.3, 1.5.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> According to 
> https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel:
> {quote}
> Amazon S3 provides read-after-write consistency for PUTS of new objects in 
> your S3 bucket in all regions with one caveat. The caveat is that if you make 
> a HEAD or GET request to the key name (to find if the object exists) before 
> creating the object, Amazon S3 provides eventual consistency for 
> read-after-write.
> {quote}
> Some S3 file system implementations may actually execute such a request for 
> the about-to-write object and thus the read-after-write is only eventually 
> consistent. {{org.apache.flink.yarn.Utils#setupLocalResource()}} currently 
> relies on a consistent read-after-write since it accesses the remote resource 
> to get file size and modification timestamp. Since there we have access to 
> the local resource, we can use the data from there instead and circumvent the 
> problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Reopened] (FLINK-8801) S3's eventual consistent read-after-write may fail yarn deployment of resources to S3

2019-07-15 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber reopened FLINK-8801:


> S3's eventual consistent read-after-write may fail yarn deployment of 
> resources to S3
> -
>
> Key: FLINK-8801
> URL: https://issues.apache.org/jira/browse/FLINK-8801
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, FileSystems, Runtime / Coordination
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.4.3, 1.5.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> According to 
> https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel:
> {quote}
> Amazon S3 provides read-after-write consistency for PUTS of new objects in 
> your S3 bucket in all regions with one caveat. The caveat is that if you make 
> a HEAD or GET request to the key name (to find if the object exists) before 
> creating the object, Amazon S3 provides eventual consistency for 
> read-after-write.
> {quote}
> Some S3 file system implementations may actually execute such a request for 
> the about-to-write object and thus the read-after-write is only eventually 
> consistent. {{org.apache.flink.yarn.Utils#setupLocalResource()}} currently 
> relies on a consistent read-after-write since it accesses the remote resource 
> to get file size and modification timestamp. Since there we have access to 
> the local resource, we can use the data from there instead and circumvent the 
> problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (FLINK-8801) S3's eventual consistent read-after-write may fail yarn deployment of resources to S3

2019-07-15 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-8801:
---
Affects Version/s: 1.9.0
   1.6.4
   1.7.2
   1.8.1

> S3's eventual consistent read-after-write may fail yarn deployment of 
> resources to S3
> -
>
> Key: FLINK-8801
> URL: https://issues.apache.org/jira/browse/FLINK-8801
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, FileSystems, Runtime / Coordination
>Affects Versions: 1.4.0, 1.5.0, 1.6.4, 1.7.2, 1.8.1, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.4.3, 1.5.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> According to 
> https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel:
> {quote}
> Amazon S3 provides read-after-write consistency for PUTS of new objects in 
> your S3 bucket in all regions with one caveat. The caveat is that if you make 
> a HEAD or GET request to the key name (to find if the object exists) before 
> creating the object, Amazon S3 provides eventual consistency for 
> read-after-write.
> {quote}
> Some S3 file system implementations may actually execute such a request for 
> the about-to-write object and thus the read-after-write is only eventually 
> consistent. {{org.apache.flink.yarn.Utils#setupLocalResource()}} currently 
> relies on a consistent read-after-write since it accesses the remote resource 
> to get file size and modification timestamp. Since there we have access to 
> the local resource, we can use the data from there instead and circumvent the 
> problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (FLINK-13173) Only run openSSL tests if desired

2019-07-10 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-13173.
---
Resolution: Fixed

Fixed in master via 6d79968f04d549d37b3bcda086a1484e78f61ac3

> Only run openSSL tests if desired
> -
>
> Key: FLINK-13173
> URL: https://issues.apache.org/jira/browse/FLINK-13173
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network, Tests, Travis
>Affects Versions: 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Rename {{flink.tests.force-openssl}} to {{flink.tests.with-openssl}} and only 
> run openSSL-based unit tests if this is set. This way, we avoid systems where 
> the bundled dynamic libraries do not work. Travis seems to run fine and will 
> have this property set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-13172) JVM crash with dynamic netty-tcnative wrapper to openSSL on some OS

2019-07-09 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-13172:

Description: 
The dynamically-linked wrapper library in 
{{flink-shaded-netty-tcnative-dynamic}} may not work on all systems, depending 
on how the system-provided openSSL library is built.
As a result, when trying to run Flink with {{security.ssl.provider: OPENSSL}} 
or just running a test based on {{SSLUtilsTest}} (which checks for openSSL 
availability which is enough to trigger the error below), the JVM will crash, 
e.g. with
- on SUSE-based systems:
{code}
/usr/lib64/jvm/java-openjdk/bin/java: relocation error: 
/tmp/liborg_apache_flink_shaded_netty4_netty_tcnative_linux_x86_644115489043239307863.so:
 symbol TLSv1_2_server_method version OPENSSL_1.0.1 not defined in file 
libssl.so.1.0.0 with link time reference
{code}
- on Arch Linux:
{code}
/usr/lib/jvm/default/bin/java: relocation error: 
/tmp/liborg_apache_flink_shaded_netty4_netty_tcnative_linux_x86_648476498532937980008.so:
 symbol SSLv3_method version OPENSSL_1.0.0 not defined in file libssl.so.1.0.0 
with link time reference
{code}

Possible solutions:
# build your own OS-dependent dynamically-linked {{netty-tcnative}} library and 
shade it in your own build of {{flink-shaded-netty-tcnative-dynamic}}, or
# use {{flink-shaded-netty-tcnative-static}}:
{code}
git clone https://github.com/apache/flink-shaded.git
cd flink-shaded
mvn clean package -Pinclude-netty-tcnative-static -pl 
flink-shaded-netty-tcnative-static
{code}
# get your OS-dependent build into netty-tcnative as a special branch similar 
to what they currently do with Fedora-based systems

  was:
The dynamically-linked wrapper library in 
{{flink-shaded-netty-tcnative-dynamic}} may not work on all systems, depending 
on how the system-provided openSSL library is built.
As a result, when trying to run Flink with {{security.ssl.provider: OPENSSL}} 
or running a test based on {{SSLUtilsTest}} (which checks for openSSL 
availability), the JVM will crash, e.g. with
- on SUSE-based systems:
{code}
/usr/lib64/jvm/java-openjdk/bin/java: relocation error: 
/tmp/liborg_apache_flink_shaded_netty4_netty_tcnative_linux_x86_644115489043239307863.so:
 symbol TLSv1_2_server_method version OPENSSL_1.0.1 not defined in file 
libssl.so.1.0.0 with link time reference
{code}
- on Arch Linux:
{code}
/usr/lib/jvm/default/bin/java: relocation error: 
/tmp/liborg_apache_flink_shaded_netty4_netty_tcnative_linux_x86_648476498532937980008.so:
 symbol SSLv3_method version OPENSSL_1.0.0 not defined in file libssl.so.1.0.0 
with link time reference
{code}

Possible solutions:
# build your own OS-dependent dynamically-linked {{netty-tcnative}} library and 
shade it in your own build of {{flink-shaded-netty-tcnative-dynamic}}, or
# use {{flink-shaded-netty-tcnative-static}}:
{code}
git clone https://github.com/apache/flink-shaded.git
cd flink-shaded
mvn clean package -Pinclude-netty-tcnative-static -pl 
flink-shaded-netty-tcnative-static
{code}
# get your OS-dependent build into netty-tcnative as a special branch similar 
to what they currently do with Fedora-based systems


> JVM crash with dynamic netty-tcnative wrapper to openSSL on some OS
> ---
>
> Key: FLINK-13172
> URL: https://issues.apache.org/jira/browse/FLINK-13172
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network, Tests
>Affects Versions: 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>
> The dynamically-linked wrapper library in 
> {{flink-shaded-netty-tcnative-dynamic}} may not work on all systems, 
> depending on how the system-provided openSSL library is built.
> As a result, when trying to run Flink with {{security.ssl.provider: OPENSSL}} 
> or just running a test based on {{SSLUtilsTest}} (which checks for openSSL 
> availability which is enough to trigger the error below), the JVM will crash, 
> e.g. with
> - on SUSE-based systems:
> {code}
> /usr/lib64/jvm/java-openjdk/bin/java: relocation error: 
> /tmp/liborg_apache_flink_shaded_netty4_netty_tcnative_linux_x86_644115489043239307863.so:
>  symbol TLSv1_2_server_method version OPENSSL_1.0.1 not defined in file 
> libssl.so.1.0.0 with link time reference
> {code}
> - on Arch Linux:
> {code}
> /usr/lib/jvm/default/bin/java: relocation error: 
> /tmp/liborg_apache_flink_shaded_netty4_netty_tcnative_linux_x86_648476498532937980008.so:
>  symbol SSLv3_method version OPENSSL_1.0.0 not defined in file 
> libssl.so.1.0.0 with link time reference
> {code}
> Possible solutions:
> # build your own OS-dependent dynamically-linked {{netty-tcnative}} library 
> and shade it in your own build of {{flink-shaded-netty-tcnative-dynamic}}, or
> # use {{flink-shaded-netty-tcnative-static}}:
> {code}
> git clone 

[jira] [Updated] (FLINK-13173) Only run openSSL tests if desired

2019-07-09 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-13173:

Priority: Critical  (was: Major)

> Only run openSSL tests if desired
> -
>
> Key: FLINK-13173
> URL: https://issues.apache.org/jira/browse/FLINK-13173
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network, Tests, Travis
>Affects Versions: 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Critical
> Fix For: 1.9.0
>
>
> Rename {{flink.tests.force-openssl}} to {{flink.tests.with-openssl}} and only 
> run openSSL-based unit tests if this is set. This way, we avoid systems where 
> the bundled dynamic libraries do not work. Travis seems to run fine and will 
> have this property set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13173) Only run openSSL tests if desired

2019-07-09 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-13173:
---

 Summary: Only run openSSL tests if desired
 Key: FLINK-13173
 URL: https://issues.apache.org/jira/browse/FLINK-13173
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Network, Tests, Travis
Affects Versions: 1.9.0
Reporter: Nico Kruber
Assignee: Nico Kruber
 Fix For: 1.9.0


Rename {{flink.tests.force-openssl}} to {{flink.tests.with-openssl}} and only 
run openSSL-based unit tests if this is set. This way, we avoid systems where 
the bundled dynamic libraries do not work. Travis seems to run fine and will 
have this property set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13172) JVM crash with dynamic netty-tcnative wrapper to openSSL on some OS

2019-07-09 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-13172:
---

 Summary: JVM crash with dynamic netty-tcnative wrapper to openSSL 
on some OS
 Key: FLINK-13172
 URL: https://issues.apache.org/jira/browse/FLINK-13172
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network, Tests
Affects Versions: 1.9.0
Reporter: Nico Kruber
Assignee: Nico Kruber


The dynamically-linked wrapper library in 
{{flink-shaded-netty-tcnative-dynamic}} may not work on all systems, depending 
on how the system-provided openSSL library is built.
As a result, when trying to run Flink with {{security.ssl.provider: OPENSSL}} 
or running a test based on {{SSLUtilsTest}} (which checks for openSSL 
availability), the JVM will crash, e.g. with
- on SUSE-based systems:
{code}
/usr/lib64/jvm/java-openjdk/bin/java: relocation error: 
/tmp/liborg_apache_flink_shaded_netty4_netty_tcnative_linux_x86_644115489043239307863.so:
 symbol TLSv1_2_server_method version OPENSSL_1.0.1 not defined in file 
libssl.so.1.0.0 with link time reference
{code}
- on Arch Linux:
{code}
/usr/lib/jvm/default/bin/java: relocation error: 
/tmp/liborg_apache_flink_shaded_netty4_netty_tcnative_linux_x86_648476498532937980008.so:
 symbol SSLv3_method version OPENSSL_1.0.0 not defined in file libssl.so.1.0.0 
with link time reference
{code}

Possible solutions:
# build your own OS-dependent dynamically-linked {{netty-tcnative}} library and 
shade it in your own build of {{flink-shaded-netty-tcnative-dynamic}}, or
# use {{flink-shaded-netty-tcnative-static}}:
{code}
git clone https://github.com/apache/flink-shaded.git
cd flink-shaded
mvn clean package -Pinclude-netty-tcnative-static -pl 
flink-shaded-netty-tcnative-static
{code}
# get your OS-dependent build into netty-tcnative as a special branch similar 
to what they currently do with Fedora-based systems



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11082) Fix the calculation of backlog in PipelinedSubpartition

2019-07-08 Thread Nico Kruber (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880476#comment-16880476
 ] 

Nico Kruber commented on FLINK-11082:
-

Yes, [~zjwang] - nonetheless, the floating buffers sit idle until an exclusive 
buffer comes back (even if the backlog is 0) which I was worried about at first.
However, [~pnowojski] convinced me that this may not be a big issue after all: 
We are kind of making the assumption, that {{#exclusive_buffers + 
#floating_buffers}} is enough to achieve full throughput but even if some 
floating buffers sit around being idle, we will in that case have filled 
exclusive buffers to work on: in total at least {{#exclusive_buffers + 
#floating_buffers}} which is fine.

We just have to keep it in mind when reasoning about the use of the floating 
buffers, e.g. a high-load channel should have the floating buffers available 
but as soon as any other channel has 1 buffer at least, it will also fight for 
a floating buffer even if it does not use it at the moment. Because of the 
reasoning above, it should still be ok.

> Fix the calculation of backlog in PipelinedSubpartition
> ---
>
> Key: FLINK-11082
> URL: https://issues.apache.org/jira/browse/FLINK-11082
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Affects Versions: 1.5.6, 1.7.1
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The backlog of subpartition should indicate how many buffers are consumable, 
> then the consumer could feedback the corresponding credits for transporting 
> these buffers. But in current PipelinedSubpartitionimplementation, the 
> backlog is increased by 1 when a BufferConsumer is added into 
> PipelinedSubpartition, and decreased by 1 when a BufferConsumer is removed 
> from PipelinedSubpartition. So the backlog only reflects how many buffers are 
> retained in PipelinedSubpartition, which is not always equivalent to the 
> number of consumable buffers.
> The backlog inconsistency might result in floating buffers misdistribution on 
> consumer side, because the consumer would request floating buffers based on 
> backlog value, then one floating buffer might not be used in 
> RemoteInputChannel long time after requesting.
> Considering the solution, the last buffer in PipelinedSubpartition could only 
> be consumable in the case of flush triggered or partition finished. So we 
> could calculate the backlog precisely based on partition flushed/finished 
> conditions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-13018) Serving docs locally with jekyll fails with inotify limit

2019-06-27 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-13018.
---
Resolution: Works for Me

Looks like this was solved by rebooting and thus resetting any rogue inotify 
instances.

I also verified that starting the build and stopping it (with the current code) 
does not leak inotify instances by looking at
{code}
find /proc/*/fd/* -type l -lname 'anon_inode:inotify' 2>/dev/null | wc -l
{code}

> Serving docs locally with jekyll fails with inotify limit
> -
>
> Key: FLINK-13018
> URL: https://issues.apache.org/jira/browse/FLINK-13018
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>
> Both {{build-docs.sh -i}} and {{build-docs.sh -p}} currently fail (also in 
> the dockerized builds in {{docs/docker}}):
> {code}
> $ ./build_docs.sh -p
> Fetching gem metadata from https://rubygems.org/..
> ...
> Bundle complete! 8 Gemfile dependencies, 36 gems now installed.
> Bundled gems are installed into `./.rubydeps`
> Configuration file: /home/nico/Projects/flink/docs/_config.yml
> Source: /home/nico/Projects/flink/docs
>Destination: /home/nico/Projects/flink/docs/content
>  Incremental build: disabled. Enable with --incremental
>   Generating... 
> done in 167.943 seconds.
> jekyll 3.7.2 | Error:  Too many open files - Failed to initialize inotify: 
> the user limit on the total number of inotify instances has been reached.
> {code}
> I wouldn't suggest working around by setting a higher inotify limit but 
> upgrading jekyll did not solve it and so far there are two options:
> # disable watching files via {{--no-watch}}
> # use polling instead of `inotify` via `--force_polling`
> # try to reduce the set of files by adding excludes for (expected) static 
> files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-13018) Serving docs locally with jekyll fails with inotify limit

2019-06-27 Thread Nico Kruber (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874172#comment-16874172
 ] 

Nico Kruber commented on FLINK-13018:
-

There was some (very old) discussion around this issue at rb-inotify which 
contained an interesting quote
{quote}
16:11  now i can't even test anymore, rb-inotifier does not 
automatically close inotify handles, so if you dont manually do it, such as 
when the app doesn't exit cleanly, the handles are left open on the kernel 
forever...
16:11  so now its maxed out my user limit for inotify
16:11  Exception: Too many open files - Failed to initialize inotify: 
the user limit on the total number of inotify instances has been reached.
16:13  so between EventMachine crashing ruby and rb-inotifier, it's 
really screwed me over
{quote}

Now, it seems like 0.10.0 uses a different way of working with IO resources and 
cleaning them up - upgrading this dependency may be a (far-fetched) solution. 
Also, it could be anything on your system starting inotify instances which 
could cause this failure.

> Serving docs locally with jekyll fails with inotify limit
> -
>
> Key: FLINK-13018
> URL: https://issues.apache.org/jira/browse/FLINK-13018
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>
> Both {{build-docs.sh -i}} and {{build-docs.sh -p}} currently fail (also in 
> the dockerized builds in {{docs/docker}}):
> {code}
> $ ./build_docs.sh -p
> Fetching gem metadata from https://rubygems.org/..
> ...
> Bundle complete! 8 Gemfile dependencies, 36 gems now installed.
> Bundled gems are installed into `./.rubydeps`
> Configuration file: /home/nico/Projects/flink/docs/_config.yml
> Source: /home/nico/Projects/flink/docs
>Destination: /home/nico/Projects/flink/docs/content
>  Incremental build: disabled. Enable with --incremental
>   Generating... 
> done in 167.943 seconds.
> jekyll 3.7.2 | Error:  Too many open files - Failed to initialize inotify: 
> the user limit on the total number of inotify instances has been reached.
> {code}
> I wouldn't suggest working around by setting a higher inotify limit but 
> upgrading jekyll did not solve it and so far there are two options:
> # disable watching files via {{--no-watch}}
> # use polling instead of `inotify` via `--force_polling`
> # try to reduce the set of files by adding excludes for (expected) static 
> files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-13018) Serving docs locally with jekyll fails with inotify limit

2019-06-27 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-13018:

Description: 
Both {{build-docs.sh -i}} and {{build-docs.sh -p}} currently fail (also in the 
dockerized builds in {{docs/docker}}):

{code}
$ ./build_docs.sh -p
Fetching gem metadata from https://rubygems.org/..
...
Bundle complete! 8 Gemfile dependencies, 36 gems now installed.
Bundled gems are installed into `./.rubydeps`
Configuration file: /home/nico/Projects/flink/docs/_config.yml
Source: /home/nico/Projects/flink/docs
   Destination: /home/nico/Projects/flink/docs/content
 Incremental build: disabled. Enable with --incremental
  Generating... 
done in 167.943 seconds.
jekyll 3.7.2 | Error:  Too many open files - Failed to initialize inotify: the 
user limit on the total number of inotify instances has been reached.
{code}

I wouldn't suggest working around by setting a higher inotify limit but 
upgrading jekyll did not solve it and so far there are two options:
# disable watching files via {{--no-watch}}
# use polling instead of `inotify` via `--force_polling`
# try to reduce the set of files by adding excludes for (expected) static files


  was:
Both {{build-docs.sh -i}} and {{build-docs.sh -p}} currently fail (also in the 
dockerized builds in {{docs/docker}}):

{code}
$ ./build_docs.sh -p
Fetching gem metadata from https://rubygems.org/..
...
Bundle complete! 8 Gemfile dependencies, 36 gems now installed.
Bundled gems are installed into `./.rubydeps`
Configuration file: /home/nico/Projects/flink/docs/_config.yml
Source: /home/nico/Projects/flink/docs
   Destination: /home/nico/Projects/flink/docs/content
 Incremental build: disabled. Enable with --incremental
  Generating... 
done in 167.943 seconds.
jekyll 3.7.2 | Error:  Too many open files - Failed to initialize inotify: the 
user limit on the total number of inotify instances has been reached.
{code}

Probably, {{inotify}} is used in a way to monitor single files and not just 
directories but I don't know that and couldn't find a way to change how jekyll 
is using inotify.

I wouldn't suggest working around by setting a higher inotify limit but 
upgrading jekyll did not solve it and so far there are two options:
# disable watching files via {{--no-watch}}
# use polling instead of `inotify` via `--force_polling`
# try to reduce the set of files by adding excludes for (expected) static files



> Serving docs locally with jekyll fails with inotify limit
> -
>
> Key: FLINK-13018
> URL: https://issues.apache.org/jira/browse/FLINK-13018
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>
> Both {{build-docs.sh -i}} and {{build-docs.sh -p}} currently fail (also in 
> the dockerized builds in {{docs/docker}}):
> {code}
> $ ./build_docs.sh -p
> Fetching gem metadata from https://rubygems.org/..
> ...
> Bundle complete! 8 Gemfile dependencies, 36 gems now installed.
> Bundled gems are installed into `./.rubydeps`
> Configuration file: /home/nico/Projects/flink/docs/_config.yml
> Source: /home/nico/Projects/flink/docs
>Destination: /home/nico/Projects/flink/docs/content
>  Incremental build: disabled. Enable with --incremental
>   Generating... 
> done in 167.943 seconds.
> jekyll 3.7.2 | Error:  Too many open files - Failed to initialize inotify: 
> the user limit on the total number of inotify instances has been reached.
> {code}
> I wouldn't suggest working around by setting a higher inotify limit but 
> upgrading jekyll did not solve it and so far there are two options:
> # disable watching files via {{--no-watch}}
> # use polling instead of `inotify` via `--force_polling`
> # try to reduce the set of files by adding excludes for (expected) static 
> files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13018) Serving docs locally with jekyll fails with inotify limit

2019-06-27 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-13018:
---

 Summary: Serving docs locally with jekyll fails with inotify limit
 Key: FLINK-13018
 URL: https://issues.apache.org/jira/browse/FLINK-13018
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.9.0
Reporter: Nico Kruber
Assignee: Nico Kruber


Both {{build-docs.sh -i}} and {{build-docs.sh -p}} currently fail (also in the 
dockerized builds in {{docs/docker}}):

{code}
$ ./build_docs.sh -p
Fetching gem metadata from https://rubygems.org/..
...
Bundle complete! 8 Gemfile dependencies, 36 gems now installed.
Bundled gems are installed into `./.rubydeps`
Configuration file: /home/nico/Projects/flink/docs/_config.yml
Source: /home/nico/Projects/flink/docs
   Destination: /home/nico/Projects/flink/docs/content
 Incremental build: disabled. Enable with --incremental
  Generating... 
done in 167.943 seconds.
jekyll 3.7.2 | Error:  Too many open files - Failed to initialize inotify: the 
user limit on the total number of inotify instances has been reached.
{code}

Probably, {{inotify}} is used in a way to monitor single files and not just 
directories but I don't know that and couldn't find a way to change how jekyll 
is using inotify.

I wouldn't suggest working around by setting a higher inotify limit but 
upgrading jekyll did not solve it and so far there are two options:
# disable watching files via {{--no-watch}}
# use polling instead of `inotify` via `--force_polling`
# try to reduce the set of files by adding excludes for (expected) static files




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-13017) Broken and irreproducible dockerized docs build

2019-06-27 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-13017:

Description: 
The build tools around {{docs/docker}} seem broken and (on my machine) give 
errors like the following while it is working on a colleague's machine:
{code}
bash: /etc/bash_completion.d/git-prompt.sh: No such file or directory
bash: __git_ps1: command not found
{code}

{code}
/usr/bin/env: 'ruby.ruby2.5': No such file or directory
bash: __git_ps1: command not found
{code}

Reason seems to be that your whole user's $HOME is mounted (writable!) into the 
docker container. We should just mount the docs directory to get
# builds which are independent from the host system (making them reproducible)
# not have the commands in the container affect the host(!)

  was:
The build tools around {{docs/docker}} seem broken and (on my machine) give 
errors like the following while it is working on a colleague's machine:
{code}
bash: /etc/bash_completion.d/git-prompt.sh: No such file or directory
bash: __git_ps1: command not found
{code}

{code}
```/usr/bin/env: 'ruby.ruby2.5': No such file or directory
bash: __git_ps1: command not found```
{code}

Reason seems to be that your whole user's $HOME is mounted (writable!) into the 
docker container. We should just mount the docs directory to get
# builds which are independent from the host system (making them reproducible)
# not have the commands in the container affect the host(!)


> Broken and irreproducible dockerized docs build
> ---
>
> Key: FLINK-13017
> URL: https://issues.apache.org/jira/browse/FLINK-13017
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Critical
>
> The build tools around {{docs/docker}} seem broken and (on my machine) give 
> errors like the following while it is working on a colleague's machine:
> {code}
> bash: /etc/bash_completion.d/git-prompt.sh: No such file or directory
> bash: __git_ps1: command not found
> {code}
> {code}
> /usr/bin/env: 'ruby.ruby2.5': No such file or directory
> bash: __git_ps1: command not found
> {code}
> Reason seems to be that your whole user's $HOME is mounted (writable!) into 
> the docker container. We should just mount the docs directory to get
> # builds which are independent from the host system (making them reproducible)
> # not have the commands in the container affect the host(!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (FLINK-9728) Wrong ruby version for building flink docs in docker container

2019-06-27 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-9728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber resolved FLINK-9728.

Resolution: Duplicate

> Wrong ruby version for building flink docs in docker container
> --
>
> Key: FLINK-9728
> URL: https://issues.apache.org/jira/browse/FLINK-9728
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Florian Schmidt
>Priority: Minor
>
> When trying to use the dockerized jekyll build I get the following error
>  
> {code:java}
> $ ./build_docs.sh -p
> Your Ruby version is 2.0.0, but your Gemfile specified >= 2.1.0{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13017) Broken and irreproducible dockerized docs build

2019-06-27 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-13017:
---

 Summary: Broken and irreproducible dockerized docs build
 Key: FLINK-13017
 URL: https://issues.apache.org/jira/browse/FLINK-13017
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.8.0, 1.7.2, 1.6.4, 1.9.0
Reporter: Nico Kruber
Assignee: Nico Kruber


The build tools around {{docs/docker}} seem broken and (on my machine) give 
errors like the following while it is working on a colleague's machine:
{code}
bash: /etc/bash_completion.d/git-prompt.sh: No such file or directory
bash: __git_ps1: command not found
{code}

{code}
```/usr/bin/env: 'ruby.ruby2.5': No such file or directory
bash: __git_ps1: command not found```
{code}

Reason seems to be that your whole user's $HOME is mounted (writable!) into the 
docker container. We should just mount the docs directory to get
# builds which are independent from the host system (making them reproducible)
# not have the commands in the container affect the host(!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (FLINK-12980) Getting Started - Add Top-Level Section to Existing Documentation

2019-06-27 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber resolved FLINK-12980.
-
   Resolution: Fixed
Fix Version/s: 1.9.0

Merged in master via e377bf669cb8a12fe9eec241236421698198ea1e (unfortunately 
with the wrong jira ticket tag)

> Getting Started - Add Top-Level Section to Existing Documentation
> -
>
> Key: FLINK-12980
> URL: https://issues.apache.org/jira/browse/FLINK-12980
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Konstantin Knauf
>Assignee: Konstantin Knauf
>Priority: Major
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (FLINK-12784) Support retention policy for InfluxDB metrics reporter

2019-06-27 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber resolved FLINK-12784.
-
   Resolution: Fixed
Fix Version/s: 1.9.0

Merged in master via 0004056b494c42f5aa2b2cab5876eed7cfc20875

> Support retention policy for InfluxDB metrics reporter
> --
>
> Key: FLINK-12784
> URL: https://issues.apache.org/jira/browse/FLINK-12784
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.8.0
>Reporter: Mans Singh
>Assignee: Mans Singh
>Priority: Minor
>  Labels: influxdb, metrics, pull-request-available
> Fix For: 1.9.0
>
>   Original Estimate: 12h
>  Time Spent: 20m
>  Remaining Estimate: 11h 40m
>
> InfluxDB metrics reporter uses default retention policy for saving metrics to 
> InfluxDB.  This enhancement will allow user to specify retention policy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-12957) Fix thrift and protobuf dependency examples in documentation

2019-06-26 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-12957.
---

> Fix thrift and protobuf dependency examples in documentation
> 
>
> Key: FLINK-12957
> URL: https://issues.apache.org/jira/browse/FLINK-12957
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.7.2, 1.8.0, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.7.3, 1.8.1, 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The examples in the docs are not up-to-date anymore and should be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (FLINK-12957) Fix thrift and protobuf dependency examples in documentation

2019-06-26 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber resolved FLINK-12957.
-
   Resolution: Fixed
Fix Version/s: 1.8.1
   1.7.3

merged into:
- 1.8: 9ad7cda7a145537e2968416a361e0d22d85828e6
- 1.7: 5b8154c3c6c60ca8a850d82856db84d4334bc327

> Fix thrift and protobuf dependency examples in documentation
> 
>
> Key: FLINK-12957
> URL: https://issues.apache.org/jira/browse/FLINK-12957
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.7.2, 1.8.0, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.7.3, 1.8.1, 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The examples in the docs are not up-to-date anymore and should be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (FLINK-12957) Fix thrift and protobuf dependency examples in documentation

2019-06-26 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber reopened FLINK-12957:
-

> Fix thrift and protobuf dependency examples in documentation
> 
>
> Key: FLINK-12957
> URL: https://issues.apache.org/jira/browse/FLINK-12957
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.7.2, 1.8.0, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The examples in the docs are not up-to-date anymore and should be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12987) DescriptiveStatisticsHistogram#getCount does not return the number of elements seen

2019-06-25 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-12987:
---

 Summary: DescriptiveStatisticsHistogram#getCount does not return 
the number of elements seen
 Key: FLINK-12987
 URL: https://issues.apache.org/jira/browse/FLINK-12987
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Metrics
Affects Versions: 1.8.0, 1.7.2, 1.6.4
Reporter: Nico Kruber
Assignee: Nico Kruber


{{DescriptiveStatisticsHistogram#getCount()}} returns the number of elements in 
the current window and not the number of total elements seen over time. In 
contrast, {{DropwizardHistogramWrapper}} does this correctly.

We should unify the behaviour and add a unit test for it (there is no generic 
histogram test yet).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12981) Ignore NaN values in histogram's percentile implementation

2019-06-25 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-12981:

Description: Histogram metrics use "long" values and therefore, there is no 
{{Double.NaN}} in {{DescriptiveStatistics}}' data and there is no need to 
cleanse it while working with it.  (was: Histrogram metrics use "long" values 
and therefore, there is no {{Double.NaN}} in {{DescriptiveStatistics}}' data  
and there is no need to cleanse it while working with it.)

> Ignore NaN values in histogram's percentile implementation
> --
>
> Key: FLINK-12981
> URL: https://issues.apache.org/jira/browse/FLINK-12981
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Metrics
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>
> Histogram metrics use "long" values and therefore, there is no {{Double.NaN}} 
> in {{DescriptiveStatistics}}' data and there is no need to cleanse it while 
> working with it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12983) Replace descriptive histogram's storage back-end

2019-06-25 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-12983:

Description: 
{{DescriptiveStatistics}} relies on their {{ResizableDoubleArray}} for storing 
double values for their histograms. However, this is constantly resizing an 
internal array and seems to have quite some overhead.

Additionally, we're not using {{SynchronizedDescriptiveStatistics}} which, 
according to its docs, we should. Currently, we seem to be somewhat safe 
because {{ResizableDoubleArray}} has some synchronized parts but these are 
scheduled to go away with commons.math version 4.

Internal tests with the current implementation, one based on a linear array of 
twice the histogram size (and moving values back to the start once the window 
reaches the end), and one using a circular array (wrapping around with flexible 
start position) has shown these numbers using the optimised code from 
FLINK-10236, FLINK-12981, and FLINK-12982:
# only adding values to the histogram
{code}
Benchmark   Mode  Cnt  Score
Error   Units
HistogramBenchmarks.dropwizardHistogramAdd thrpt   30   47985.359 ±
25.847  ops/ms
HistogramBenchmarks.descriptiveHistogramAddthrpt   30   70158.792 ±   
276.858  ops/ms
--- with FLINK-10236, FLINK-12981, and FLINK-12982 ---
HistogramBenchmarks.descriptiveHistogramAddthrpt   30   75303.040 ±   
475.355  ops/ms
HistogramBenchmarks.descrHistogramCircularAdd  thrpt   30  200906.902 ±   
384.483  ops/ms
HistogramBenchmarks.descrHistogramLinearAddthrpt   30  189788.728 ±   
233.283  ops/ms
{code}
# after adding each value, also retrieving a common set of metrics:
{code}
Benchmark   Mode  Cnt  Score
Error   Units
HistogramBenchmarks.dropwizardHistogramthrpt   30 400.274 ± 
4.930  ops/ms
HistogramBenchmarks.descriptiveHistogram   thrpt   30 124.533 ± 
1.060  ops/ms
--- with FLINK-10236, FLINK-12981, and FLINK-12982 ---
HistogramBenchmarks.descriptiveHistogram   thrpt   30 251.895 ± 
1.809  ops/ms
HistogramBenchmarks.descrHistogramCircular thrpt   30 301.068 ± 
2.077  ops/ms
HistogramBenchmarks.descrHistogramLinear   thrpt   30 234.050 ± 
5.485  ops/ms
{code}

  was:
{{DescriptiveStatistics}} relies on their {{ResizableDoubleArray}} for storing 
double values for their histograms. However, this is constantly resizing an 
internal array and seems to have quite some overhead.

Additionally, we're not using {{SynchronizedDescriptiveStatistics}} which, 
according to its docs, we should. Currently, we seem to be somewhat safe 
because {{ResizableDoubleArray}} has some synchronized parts but these are 
scheduled to go away with commons.math version 4.

Internal tests with the current implementation, one based on a linear array of 
twice the histogram size (and moving values back to the start once the window 
reaches the end), and one using a circular array (wrapping around with flexible 
start position) has shown these numbers using the optimised code from 
FLINK-10236, FLINK-12981, and FLINK-12982:
# only adding values to the histogram
{code}
Benchmark   Mode  Cnt  Score
Error   Units
HistogramBenchmarks.dropwizardHistogramAdd thrpt   30   47985.359 ±
25.847  ops/ms
HistogramBenchmarks.descriptiveHistogramAddthrpt   30   70158.792 ±   
276.858  ops/ms
--- with FLINK-10236, FLINK-12981, and FLINK-12982 ---
HistogramBenchmarks.descriptiveHistogramAddthrpt   30   75303.040 ±   
475.355  ops/ms
HistogramBenchmarks.histogramCircularArrayAdd  thrpt   30  790123.475 ± 
48420.672  ops/ms
HistogramBenchmarks.histogramLinearArrayAddthrpt   30  385126.074 ±  
3038.773  ops/ms
{code}
# after adding each value, also retrieving a common set of metrics:
{code}
Benchmark   Mode  Cnt  Score
Error   Units
HistogramBenchmarks.dropwizardHistogramthrpt   30 400.274 ± 
4.930  ops/ms
HistogramBenchmarks.descriptiveHistogram   thrpt   30 124.533 ± 
1.060  ops/ms
--- with FLINK-10236, FLINK-12981, and FLINK-12982 ---
HistogramBenchmarks.descriptiveHistogram   thrpt   30 251.895 ± 
1.809  ops/ms
HistogramBenchmarks.histogramCircularArray thrpt   30 298.881 ±
10.027  ops/ms
HistogramBenchmarks.histogramLinearArray   thrpt   30 234.380 ± 
5.014  ops/ms
{code}


> Replace descriptive histogram's storage back-end
> 
>
> Key: FLINK-12983
> URL: https://issues.apache.org/jira/browse/FLINK-12983
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Metrics
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>
> 

[jira] [Created] (FLINK-12984) Only call Histogram#getStatistics() once per set of retrieved statistics

2019-06-25 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-12984:
---

 Summary: Only call Histogram#getStatistics() once per set of 
retrieved statistics
 Key: FLINK-12984
 URL: https://issues.apache.org/jira/browse/FLINK-12984
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Metrics
Reporter: Nico Kruber
Assignee: Nico Kruber


In some occasions, {{Histogram#getStatistics()}} was called multiple times to 
retrieve different statistics. However, at least the Dropwizard implementation 
has some constant overhead per call and we should maybe rather interpret this 
method as returning a point-in-time snapshot of the histogram in order to get 
consistent values when querying them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-10236) Reduce histogram percentile/quantile retrieval overhead

2019-06-25 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-10236.
---
Resolution: Duplicate

> Reduce histogram percentile/quantile retrieval overhead
> ---
>
> Key: FLINK-10236
> URL: https://issues.apache.org/jira/browse/FLINK-10236
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Metrics
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.5.3, 1.6.0, 1.7.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>
> Most of our metrics reporters for histograms always report multiple 
> quantiles: 0.5, 0.75, 0.90, 0.95, 0.98, 0.99, and 0.999.
> This is retrieved from 
> {{HistogramStatistics}}/{{DescriptiveStatisticsHistogramStatistics}} but we 
> do not have any optimisation for retrieving this many percentiles though and 
> the plain use of {{DescriptiveStatistics#getPercentile}} has some constant 
> overhead that could be avoided over multiple executions using 
> {{Percentile#setData(double[])}} to cache the current data set of the 
> snapshot.
> In addition, min, max, mean, and standard deviation also each iterate over 
> the array which could be done a single time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12982) Make DescriptiveStatisticsHistogramStatistics a true point-in-time snapshot

2019-06-25 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-12982:

Description: 
Instead of redirecting {{DescriptiveStatisticsHistogramStatistics}} calls to 
{{DescriptiveStatistics}}, it takes a point-in-time snapshot using an own
  {{UnivariateStatistic}} implementation that
 * calculates min, max, mean, and standard deviation in one go (as opposed to 
four iterations over the values array!)
 * caches pivots for the percentile calculation to speed up retrieval of 
multiple percentiles/quartiles

This is also similar to the semantics of our implementation using codahale's 
{{DropWizard}}.

  was:
Instead of redirecting {{DescriptiveStatisticsHistogramStatistics}} calls to 
{{DescriptiveStatistics}}, it takes a point-in-time snapshot using an own
UnivariateStatistic implementation that
* calculates min, max, mean, and standard deviation in one go (as opposed to 
four iterations over the values array!)
* caches pivots for the percentile calculation to speed up retrieval of 
multiple percentiles/quartiles

This is similar to the semantics of our implementation using codahale's 
{{DropWizard}}.


> Make DescriptiveStatisticsHistogramStatistics a true point-in-time snapshot
> ---
>
> Key: FLINK-12982
> URL: https://issues.apache.org/jira/browse/FLINK-12982
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Metrics
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>
> Instead of redirecting {{DescriptiveStatisticsHistogramStatistics}} calls to 
> {{DescriptiveStatistics}}, it takes a point-in-time snapshot using an own
>   {{UnivariateStatistic}} implementation that
>  * calculates min, max, mean, and standard deviation in one go (as opposed to 
> four iterations over the values array!)
>  * caches pivots for the percentile calculation to speed up retrieval of 
> multiple percentiles/quartiles
> This is also similar to the semantics of our implementation using codahale's 
> {{DropWizard}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12983) Replace descriptive histogram's storage back-end

2019-06-25 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-12983:
---

 Summary: Replace descriptive histogram's storage back-end
 Key: FLINK-12983
 URL: https://issues.apache.org/jira/browse/FLINK-12983
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Metrics
Reporter: Nico Kruber
Assignee: Nico Kruber


{{DescriptiveStatistics}} relies on their {{ResizableDoubleArray}} for storing 
double values for their histograms. However, this is constantly resizing an 
internal array and seems to have quite some overhead.

Additionally, we're not using {{SynchronizedDescriptiveStatistics}} which, 
according to its docs, we should. Currently, we seem to be somewhat safe 
because {{ResizableDoubleArray}} has some synchronized parts but these are 
scheduled to go away with commons.math version 4.

Internal tests with the current implementation, one based on a linear array of 
twice the histogram size (and moving values back to the start once the window 
reaches the end), and one using a circular array (wrapping around with flexible 
start position) has shown these numbers using the optimised code from 
FLINK-10236, FLINK-12981, and FLINK-12982:
# only adding values to the histogram
{code}
Benchmark   Mode  Cnt  Score
Error   Units
HistogramBenchmarks.dropwizardHistogramAdd thrpt   30   47985.359 ±
25.847  ops/ms
HistogramBenchmarks.descriptiveHistogramAddthrpt   30   70158.792 ±   
276.858  ops/ms
--- with FLINK-10236, FLINK-12981, and FLINK-12982 ---
HistogramBenchmarks.descriptiveHistogramAddthrpt   30   75303.040 ±   
475.355  ops/ms
HistogramBenchmarks.histogramCircularArrayAdd  thrpt   30  790123.475 ± 
48420.672  ops/ms
HistogramBenchmarks.histogramLinearArrayAddthrpt   30  385126.074 ±  
3038.773  ops/ms
{code}
# after adding each value, also retrieving a common set of metrics:
{code}
Benchmark   Mode  Cnt  Score
Error   Units
HistogramBenchmarks.dropwizardHistogramthrpt   30 400.274 ± 
4.930  ops/ms
HistogramBenchmarks.descriptiveHistogram   thrpt   30 124.533 ± 
1.060  ops/ms
--- with FLINK-10236, FLINK-12981, and FLINK-12982 ---
HistogramBenchmarks.descriptiveHistogram   thrpt   30 251.895 ± 
1.809  ops/ms
HistogramBenchmarks.histogramCircularArray thrpt   30 298.881 ±
10.027  ops/ms
HistogramBenchmarks.histogramLinearArray   thrpt   30 234.380 ± 
5.014  ops/ms
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12982) Make DescriptiveStatisticsHistogramStatistics a true point-in-time snapshot

2019-06-25 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-12982:
---

 Summary: Make DescriptiveStatisticsHistogramStatistics a true 
point-in-time snapshot
 Key: FLINK-12982
 URL: https://issues.apache.org/jira/browse/FLINK-12982
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Metrics
Reporter: Nico Kruber
Assignee: Nico Kruber


Instead of redirecting {{DescriptiveStatisticsHistogramStatistics}} calls to 
{{DescriptiveStatistics}}, it takes a point-in-time snapshot using an own
UnivariateStatistic implementation that
* calculates min, max, mean, and standard deviation in one go (as opposed to 
four iterations over the values array!)
* caches pivots for the percentile calculation to speed up retrieval of 
multiple percentiles/quartiles

This is similar to the semantics of our implementation using codahale's 
{{DropWizard}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12981) Ignore NaN values in histogram's percentile implementation

2019-06-25 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-12981:
---

 Summary: Ignore NaN values in histogram's percentile implementation
 Key: FLINK-12981
 URL: https://issues.apache.org/jira/browse/FLINK-12981
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Metrics
Reporter: Nico Kruber
Assignee: Nico Kruber


Histrogram metrics use "long" values and therefore, there is no {{Double.NaN}} 
in {{DescriptiveStatistics}}' data  and there is no need to cleanse it while 
working with it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12909) add try catch when find a unique file name for the spilling channel

2019-06-25 Thread Nico Kruber (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872286#comment-16872286
 ] 

Nico Kruber commented on FLINK-12909:
-

Thanks [~zjwang] for clarification - I also was not aware of the existing PR 
which I now linked to this ticket ([~xymaqingxiang] it seems that simply 
changing the title is not enough).

I think, with the past two messages, I seem to be getting the reason for this 
change: if you have multiple {{tempDirs}} / {{io.tmp.dirs}} set up, if there is 
a failure while creating a spilling channel on one of them (due to disk 
failure), we could still continue with some other directory. With only one 
temporary directory, the change doesn't provide too much.

I'll leave some comments on the PR as well.

> add try catch when find a unique file name for the spilling channel
> ---
>
> Key: FLINK-12909
> URL: https://issues.apache.org/jira/browse/FLINK-12909
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Reporter: xymaqingxiang
>Priority: Major
>  Labels: pull-request-available
>
> h2. What is the purpose of the change
> Catch exceptions thrown due to disk loss, try to find a unique file name for 
> the spilling channel again.
> Modify the createSpillingChannel() method of the 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer
>  class to solve this problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12909) add try catch when find a unique file name for the spilling channel

2019-06-25 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-12909:

Labels: pull-request-available  (was: )

> add try catch when find a unique file name for the spilling channel
> ---
>
> Key: FLINK-12909
> URL: https://issues.apache.org/jira/browse/FLINK-12909
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Reporter: xymaqingxiang
>Priority: Major
>  Labels: pull-request-available
>
> h2. What is the purpose of the change
> Catch exceptions thrown due to disk loss, try to find a unique file name for 
> the spilling channel again.
> Modify the createSpillingChannel() method of the 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer
>  class to solve this problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12909) add try catch when find a unique file name for the spilling channel

2019-06-24 Thread Nico Kruber (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871199#comment-16871199
 ] 

Nico Kruber commented on FLINK-12909:
-

[~xymaqingxiang] I'm not sure I understand the problem yet:
 # What concrete problem are you facing and where?
 # When is the disk lost?
 ** If you mean during the 10 attempts in {{createSpillingChannel()}}, then 
there is very little time span to have a disk loss and immediately recover 
during these attempts (we don't wait between attempts).

> add try catch when find a unique file name for the spilling channel
> ---
>
> Key: FLINK-12909
> URL: https://issues.apache.org/jira/browse/FLINK-12909
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Reporter: xymaqingxiang
>Priority: Major
>
> h2. What is the purpose of the change
> Catch exceptions thrown due to disk loss, try to find a unique file name for 
> the spilling channel again.
> Modify the createSpillingChannel() method of the 
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer
>  class to solve this problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12957) Fix thrift and protobuf dependency examples in documentation

2019-06-24 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-12957:
---

 Summary: Fix thrift and protobuf dependency examples in 
documentation
 Key: FLINK-12957
 URL: https://issues.apache.org/jira/browse/FLINK-12957
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.8.0, 1.7.2, 1.9.0
Reporter: Nico Kruber
Assignee: Nico Kruber


The examples in the docs are not up-to-date anymore and should be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-12871) Wrong SSL setup examples in docs

2019-06-17 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-12871.
---
   Resolution: Fixed
Fix Version/s: 1.8.1
   1.9.0
   1.7.3

Fixed via:
 * master: 59b9a938862124050de56970a4b618a82c831943
 * release-1.8: 0b2525734ea920a5347f9dee915b684aee98a702
 * release-1.7: 272fafe66830a99e99dffdc42ea27c00a6cc8a5e

> Wrong SSL setup examples in docs
> 
>
> Key: FLINK-12871
> URL: https://issues.apache.org/jira/browse/FLINK-12871
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.7.2, 1.8.0, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.3, 1.9.0, 1.8.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The SSL setup examples [1] were updated to rely on PKCS12 format (instead of 
> the old JKS keystore) but PKCS12 does not support separate passwords for the 
> key store and the key itself.
> {code}
> > Warning:  Different store and key passwords not supported for PKCS12 
> > KeyStores. Ignoring user-specified -keypass value.
> {code}
> Also, some of the examples still rely on the old JKS keystore and are not 
> using PKCS12 yet.
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/security-ssl.html#example-ssl-setup-standalone-and-kubernetes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12860) Document openSSL algorithm restrictions

2019-06-17 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-12860:

Description: Netty's openSSL engine works slightly differently with respect 
to supported cipher algorithms (and potentially more things that are 
different). This should be documented before the 1.9 release.  (was: Netty's 
openSSL engine works slightly differently with respect to supported cipher 
algorithms and certificate storage formats. This should be documented before 
the 1.9 release.)

> Document openSSL algorithm restrictions
> ---
>
> Key: FLINK-12860
> URL: https://issues.apache.org/jira/browse/FLINK-12860
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation, Runtime / Network
>Affects Versions: 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>
> Netty's openSSL engine works slightly differently with respect to supported 
> cipher algorithms (and potentially more things that are different). This 
> should be documented before the 1.9 release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12860) Document openSSL algorithm restrictions

2019-06-17 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-12860:

Summary: Document openSSL algorithm restrictions  (was: Document openSSL 
certificate and algorithm restrictions)

> Document openSSL algorithm restrictions
> ---
>
> Key: FLINK-12860
> URL: https://issues.apache.org/jira/browse/FLINK-12860
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation, Runtime / Network
>Affects Versions: 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>
> Netty's openSSL engine works slightly differently with respect to supported 
> cipher algorithms and certificate storage formats. This should be documented 
> before the 1.9 release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12871) Wrong SSL setup examples in docs

2019-06-17 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-12871:

Affects Version/s: 1.9.0
   1.7.2

> Wrong SSL setup examples in docs
> 
>
> Key: FLINK-12871
> URL: https://issues.apache.org/jira/browse/FLINK-12871
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.7.2, 1.8.0, 1.9.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>
> The SSL setup examples [1] were updated to rely on PKCS12 format (instead of 
> the old JKS keystore) but PKCS12 does not support separate passwords for the 
> key store and the key itself.
> {code}
> > Warning:  Different store and key passwords not supported for PKCS12 
> > KeyStores. Ignoring user-specified -keypass value.
> {code}
> Also, some of the examples still rely on the old JKS keystore and are not 
> using PKCS12 yet.
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/security-ssl.html#example-ssl-setup-standalone-and-kubernetes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12871) Wrong SSL setup examples in docs

2019-06-17 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber updated FLINK-12871:

Description: 
The SSL setup examples [1] were updated to rely on PKCS12 format (instead of 
the old JKS keystore) but PKCS12 does not support separate passwords for the 
key store and the key itself.

{code}
> Warning:  Different store and key passwords not supported for PKCS12 
> KeyStores. Ignoring user-specified -keypass value.
{code}


Also, some of the examples still rely on the old JKS keystore and are not using 
PKCS12 yet.


[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/security-ssl.html#example-ssl-setup-standalone-and-kubernetes

  was:
The SSL setup examples [1] were updated to rely on PKCS12 format (instead of 
the old JKS keystore) but PKCS12 does not support separate passwords for the 
key store and the key itself.
Also, some of the examples still rely on the old JKS keystore and are not using 
PKCS12 yet.


[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/security-ssl.html#example-ssl-setup-standalone-and-kubernetes


> Wrong SSL setup examples in docs
> 
>
> Key: FLINK-12871
> URL: https://issues.apache.org/jira/browse/FLINK-12871
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.8.0
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
>
> The SSL setup examples [1] were updated to rely on PKCS12 format (instead of 
> the old JKS keystore) but PKCS12 does not support separate passwords for the 
> key store and the key itself.
> {code}
> > Warning:  Different store and key passwords not supported for PKCS12 
> > KeyStores. Ignoring user-specified -keypass value.
> {code}
> Also, some of the examples still rely on the old JKS keystore and are not 
> using PKCS12 yet.
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/security-ssl.html#example-ssl-setup-standalone-and-kubernetes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12871) Wrong SSL setup examples in docs

2019-06-17 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-12871:
---

 Summary: Wrong SSL setup examples in docs
 Key: FLINK-12871
 URL: https://issues.apache.org/jira/browse/FLINK-12871
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.8.0
Reporter: Nico Kruber
Assignee: Nico Kruber


The SSL setup examples [1] were updated to rely on PKCS12 format (instead of 
the old JKS keystore) but PKCS12 does not support separate passwords for the 
key store and the key itself.
Also, some of the examples still rely on the old JKS keystore and are not using 
PKCS12 yet.


[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/security-ssl.html#example-ssl-setup-standalone-and-kubernetes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12860) Document openSSL certificate and algorithm restrictions

2019-06-15 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-12860:
---

 Summary: Document openSSL certificate and algorithm restrictions
 Key: FLINK-12860
 URL: https://issues.apache.org/jira/browse/FLINK-12860
 Project: Flink
  Issue Type: Sub-task
  Components: Documentation, Runtime / Network
Affects Versions: 1.9.0
Reporter: Nico Kruber
Assignee: Nico Kruber


Netty's openSSL engine works slightly differently with respect to supported 
cipher algorithms and certificate storage formats. This should be documented 
before the 1.9 release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-12517) Run network tests with dynamically-linked openSSL

2019-06-15 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-12517.
---

> Run network tests with dynamically-linked openSSL
> -
>
> Key: FLINK-12517
> URL: https://issues.apache.org/jira/browse/FLINK-12517
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
> Fix For: 1.9.0
>
>
> FLINK-9816 adds the ability to work with Netty's wrapper around native 
> openSSL implementations. We should set up unit tests that verify the 
> artifacts we provide, i.e. the dynamically-linked openSSL one in flink-shaded.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-12839) Package flink-shaded-netty-tcnative-dynamic into opt/

2019-06-15 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-12839.
---

> Package flink-shaded-netty-tcnative-dynamic into opt/
> -
>
> Key: FLINK-12839
> URL: https://issues.apache.org/jira/browse/FLINK-12839
> Project: Flink
>  Issue Type: Sub-task
>  Components: Build System, Runtime / Network
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
> Fix For: 1.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-12518) Run e2e tests with openSSL

2019-06-15 Thread Nico Kruber (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nico Kruber closed FLINK-12518.
---

> Run e2e tests with openSSL
> --
>
> Key: FLINK-12518
> URL: https://issues.apache.org/jira/browse/FLINK-12518
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network, Tests
>Reporter: Nico Kruber
>Assignee: Nico Kruber
>Priority: Major
> Fix For: 1.9.0
>
>
> We should modify one end-to-end test each to run with:
>  * Java-based SSL
>  * dynamically linked openSSL
>  * statically linked openSSL



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   9   10   >