[Bug 63235] Defer potentially expensive Charset.availableCharsets() call

2019-03-14 Thread bugzilla
https://bz.apache.org/bugzilla/show_bug.cgi?id=63235

Mark Thomas  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #12 from Mark Thomas  ---
Fixed in:
- master for 9.0.18 onwards
- 8.5.x for 8.5.40 onwards
- 7.0.x for 7.0.94 onwards

-- 
You are receiving this mail because:
You are the assignee for the bug.
-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



[Bug 63235] Defer potentially expensive Charset.availableCharsets() call

2019-03-08 Thread bugzilla
https://bz.apache.org/bugzilla/show_bug.cgi?id=63235

--- Comment #11 from Mark Thomas  ---
I've put together an alternative proposal:
https://github.com/apache/tomcat/pull/146

-- 
You are receiving this mail because:
You are the assignee for the bug.
-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



[Bug 63235] Defer potentially expensive Charset.availableCharsets() call

2019-03-06 Thread bugzilla
https://bz.apache.org/bugzilla/show_bug.cgi?id=63235

--- Comment #10 from Phillip Webb  ---
@Mark

I've updated the PR based on your comments. There is now a cache for concurrent
access as well as protection against DoS attacks. I've also tried to add some
tests. The list of common charsets and the DoS threshold will no doubt need
refining.

If you want me to update it again with an option so that this is opt-in rather
than always on please let me know (and point me to an example of how you do
this kind of thing).

-- 
You are receiving this mail because:
You are the assignee for the bug.
-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



[Bug 63235] Defer potentially expensive Charset.availableCharsets() call

2019-03-06 Thread bugzilla
https://bz.apache.org/bugzilla/show_bug.cgi?id=63235

--- Comment #9 from Phillip Webb  ---

> Can you provide some context around why 31ms is significant for the use 
> case you are considering? It doesn't seem significant for the typical use 
> cases we see.

Spring Boot uses Embedded Tomcat as it's default servlet container and startup
time has been quite a focus in recent months. As much as possible we want
developers to be able to restart their applications quickly during development
time. We're also seeing a lot of interest in using Spring Boot in FaaS
environments. These are quite different to traditional deployments as
applications are only started on demand, and are closed very quickly.


> I do plan to look at the first option I described in comment #2. There are a 
> few things 
> on my TODO list so if you wanted to re-work your PR in that direction it 
> would certainly 
> help progress things.

I'll take another look. I made one change but I was missing the context of
#51400. Looking at the
Java 8 source code [1], I'd say that's still an issue.


[1]
https://github.com/openjdk-mirror/jdk/blame/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/sun/nio/cs/FastCharsetProvider.java#L131-L135

-- 
You are receiving this mail because:
You are the assignee for the bug.
-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



[Bug 63235] Defer potentially expensive Charset.availableCharsets() call

2019-03-06 Thread bugzilla
https://bz.apache.org/bugzilla/show_bug.cgi?id=63235

--- Comment #8 from Mark Thomas  ---
Just to note that the performance issue with unknown charsets wasn't memory
related. It was lock contention. Bug 51400 has the details.

Can you provide some context around why 31ms is significant for the use case
you are considering? It doesn't seem significant for the typical use cases we
see.

I do plan to look at the first option I described in comment #2. There are a
few things on my TODO list so if you wanted to re-work your PR in that
direction it would certainly help progress things.

Another thing worth doing is testing with a more recent JVM to see if the
original issue still exists. If it doesn't, we have a very simple solution
available - remove the cache entirely.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



[Bug 63235] Defer potentially expensive Charset.availableCharsets() call

2019-03-06 Thread bugzilla
https://bz.apache.org/bugzilla/show_bug.cgi?id=63235

--- Comment #7 from Phillip Webb  ---
> I don't see the benefit as there's a clear downside and the 
> complexity skyrockets. IMO you should focus on the String 
> optimization since it doesn't have these issues.

I'll continue with the String optimizations as well, but they only help with
memory and don't give any startup performance improvements.

> On my desktop computer, I measured the "all charset" loading code using
> System.nanoTime as taking 31105409ns, so that would be 31 ms.

For me, saving 31 ms on each restart is a very worthwhile gain and something
I'd love to be able to opt in to. I even think that when running on the server
this kind of optimization is something I would find useful in many situations.
For example, if I have many internal services that I know will all be hit with
only UTF-8, why wouldn't I want to enable the option?

I'd love to be able to find a solution that helps this use-case without adding
too much complexity. From the comments on #57808 and #58859 it doesn't appear
that I'm alone.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



[Bug 63235] Defer potentially expensive Charset.availableCharsets() call

2019-03-06 Thread bugzilla
https://bz.apache.org/bugzilla/show_bug.cgi?id=63235

--- Comment #6 from Remy Maucherat  ---
I don't see the benefit as there's a clear downside and the complexity
skyrockets. IMO you should focus on the String optimization since it doesn't
have these issues.

On my desktop computer, I measured the "all charset" loading code using
System.nanoTime as taking 31105409ns, so that would be 31 ms. IMO, even as an
optin and when using something where it is harmless like an IDE, I think users
would be best served not wasting their time configuring such an option. On a
server, I think these are CPU cycles reasonably well spent given the
as-good-as-possible runtime performance and robustness.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



[Bug 63235] Defer potentially expensive Charset.availableCharsets() call

2019-03-06 Thread bugzilla
https://bz.apache.org/bugzilla/show_bug.cgi?id=63235

--- Comment #5 from Phillip Webb  ---
@Mark Thanks for pointing me at those previous issues, I should have thought to
search first!

It seems like I may have misunderstood the reason for the cache existing in the
first place. My original assumption was that it was necessary for
case-insensitive name lookups. It appears, on OpenJDK at least, calling
`Charset.forName` works with any case combination works. The following will run
without error:

Set names = new LinkedHashSet<>();
Charset.availableCharsets().forEach((key, value) -> {
names.add(value.name());
names.addAll(value.aliases());
});
names.forEach(name -> {
System.out.println(Charset.forName(name.toLowerCase()));
});

So the main reasons for the cache, as I understand it, are:

1) Improve lookup performance for initial requests by having a hot cache
2) Prevent DoS attacks by limiting the impact of a user requesting an invalid
charset

Looking at each of these in turn:


1) Improve lookup performance for initial requests by having a hot cache

> populate the cache values at start-up for 'popular' charsets (ISO-8859-1, 
> UTF-8 plus any others we think are likely to be used)

I like this suggestion and I think this would be a good match for most users.
Popular charsets would be fast, and initial requests to others would take a
slight hit. The `Charset` class itself includes a cache, so this could be as
simple as calling `Charset.forName` a few times.


2) Prevent DoS attacks by limiting the impact of a user requesting an invalid
charset

Looking at the existing JDK `Charset` and `AbstractCharsetProvider`[1] class,
it appears that asking for charsets that don't exist does not have an impact on
memory. When I run the following code, I don't see a large change:

for (int i = 0; i < 1000; i++) {
try {
Charset.forName("missing-" + i);
}
catch (Exception ex) {
}
if (i % 1 == 0) {
System.out.println(Runtime.getRuntime().totalMemory());
System.out.println(Runtime.getRuntime().maxMemory());
System.out.println(Runtime.getRuntime().freeMemory());
System.out.println("--");
}
}

Of course, this might be JVM specific so it may well be worth still protecting
against that. One simple option might be to count lookup misses and after a
certain threshold switch back to a set limited by availableCharsets. Something
like:


public static Charset getCharset(String enc) throws
UnsupportedEncodingException {
Map cache = encodingToCharsetCache;
if (cache != null) {
Charset charset = cache.get(enc.toLowerCase());
if (charset == null) {
throw new UnsupportedEncodingException(
sm.getString("b2cConverter.unknownEncoding", enc));
}
}
try {
return Charset.forName(enc);
} catch (Exception ex) {
int count = charsetMissCount.incrementAndGet();
if (count > THRESHOLD) {
encodingToCharsetCache = populateEncodingCache();
}
throw ex;
}
}


I'm happy to rework the PR if that's a sensible suggestion? Even if it were
opt-int, I think it would very nice to not have to pay the startup cost that
the existing code incurs, especially if you use an embedded Tomcat and restart
frequently during application development.

[1]
http://cr.openjdk.java.net/~sherman/6898310/webrev/src/share/classes/sun/nio/cs/AbstractCharsetProvider.java.html

-- 
You are receiving this mail because:
You are the assignee for the bug.
-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



[Bug 63235] Defer potentially expensive Charset.availableCharsets() call

2019-03-06 Thread bugzilla
https://bz.apache.org/bugzilla/show_bug.cgi?id=63235

--- Comment #4 from Mark Thomas  ---
Of course. That makes option 2 a non-starter.

The other option isn't that far from the current code. I think it is worth
looking at what it does to memory footprint and start-up time so we can see if
the added complexity is worth the benefit.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



[Bug 63235] Defer potentially expensive Charset.availableCharsets() call

2019-03-06 Thread bugzilla
https://bz.apache.org/bugzilla/show_bug.cgi?id=63235

--- Comment #3 from Remy Maucherat  ---
I think this adds complexity and possible problems for no real benefit. For
example option 2 could have a very big cache (all you need to do is request
random charset values) and could need a lot more sync than a static cache
(which is basically free).

-- 
You are receiving this mail because:
You are the assignee for the bug.
-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



[Bug 63235] Defer potentially expensive Charset.availableCharsets() call

2019-03-06 Thread bugzilla
https://bz.apache.org/bugzilla/show_bug.cgi?id=63235

Mark Thomas  changed:

   What|Removed |Added

   Severity|normal  |enhancement

--- Comment #2 from Mark Thomas  ---
Lazy loading simply defers the performance hit so a user experiences a delay
during a request rather than it happening at startup. The balance of opinion in
the past has been that it is better to take such a hit on startup rather than
during a request.

This has been discussed previously. For the history see these bugs and the bugs
linked from them:
- bug 57808
- bug 58859

If we take some ideas from each of the various solutions that have been
proposed at various points then I think it might be possible to come up with
something that addresses all of the various concerns:
- retain a cache of charset name to Charset
- populate the keys only from a list of known charsets generated off-line by
  looking at the various JREs (OpenJDK, Oracle, IBM to start with) at system
start
- add a unit test to monitor for new Charsets in new JRE versions
- populate the cache values at start-up for 'popular' charsets (ISO-8859-1,
UTF-8
  plus any others we think are likely to be used)
- lazily populate any cache values for other Charsets
- if a charset is requested, the key is present but the JVM can't create it,
remove the
  key from the cache to speed up any future requests (saves trying to load
non-existent
  Charset again)

As I typed the above I though of perhaps a simpler variation:
- retain a cache of charset name to Charset
- populate the cache values at start-up for 'popular' charsets (ISO-8859-1,
UTF-8
  plus any others we think are likely to be used)
- lazily populate any cache values for other Charsets
- if a requested Charset does not exist, insert a special "NoSuchCharset"
object to speed
  up future requests


The more I think about it, the more I like that second option.

Thoughts?

-- 
You are receiving this mail because:
You are the assignee for the bug.
-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



[Bug 63235] Defer potentially expensive Charset.availableCharsets() call

2019-03-06 Thread bugzilla
https://bz.apache.org/bugzilla/show_bug.cgi?id=63235

Andy Wilkinson  changed:

   What|Removed |Added

 CC||awilkin...@pivotal.io

-- 
You are receiving this mail because:
You are the assignee for the bug.
-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org



[Bug 63235] Defer potentially expensive Charset.availableCharsets() call

2019-03-05 Thread bugzilla
https://bz.apache.org/bugzilla/show_bug.cgi?id=63235

--- Comment #1 from Phillip Webb  ---
Pull request for consideration: https://github.com/apache/tomcat/pull/142

-- 
You are receiving this mail because:
You are the assignee for the bug.
-
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org