Sorry for the long delay. The Jasig infrastructure migration was taking all my time and I'm just catching up.

You're right and those caches won't make a difference for new users.

At this point I'd recommend sticking a profiler like YourKit on it. With a difference that dramatic you should be able to fairly easily come up with some candidates between 2.6 and 3.2

Sorry I don't have more to go on but I can't think of much that has changed with regards to creating new users in 3.x. You have to get Portlet Entity objects created for the user but that should be pretty fast.

-Eric

On 06/08/2010 05:36 PM, Alex Bragg wrote:
I Agree.  If we go to production with 3.1.1  we'll definitely need to increase 
those cache sizes.  I didn't increase all of them mainly because I was just 
looking for any improvement I could find.  I would have continued if I had 
found some.

Unfortunately, for these tests, I don't think those settings are going to help 
much.  I'll still increase them though, if you think they will.  These are all 
brand new users that have never logged into the portal before.  Our worst-case 
scenario is new-user registration, and the testing I'm doing is specifically 
designed to compare how well uPortal 3.1.1 handles that versus 2.6.1.  I'm told 
we need to be able to support up to 45,000 new logins per hour to accommodate 
peak load.

Here is a 50-user test with 4 tomcats behind the LB with all users that have 
logged in before, and the caches are populated (even as small as they are).  
Clearly, 3.1.1 is very competitive with 2.6.1 in that scenario.  Since 2.6.1 
has more stuff in static caches, I'd be willing to bet that 3.1.1 would be even 
faster with additional ehcache settings as you note.

3.1.1
--------
Label           Samples Average Median  90%     Min     Max     Error % 
Throughput      KB/sec
Login Page      4000    456     436     711     30      1424    0       11.14   
        139.91
Login           4000    512     505     778     64      1546    0       11.14   
        236.15
Tab 1           4000    291     248     528     33      1415    0       11.14   
        263.41
Tab 2           4000    397     381     664     35      1451    0       11.14   
        290.95
Tab 3           4000    420     411     683     49      1436    0       11.14   
        290.92
Tab 4           4000    390     383     634     42      1429    0       11.14   
        277.58
Logout          4000    405     394     634     34      1484    0       11.14   
        139.91
TOTAL           28000   410     396     675     30      1546    0       77.85   
        1636.64

2.6.1
--------
Label           Samples Average Median  90%     Min     Max     Error % 
Throughput      KB/sec
Login Page      4000    231     215     391     10      945     0       19.16   
        37.71
Login           4000    440     414     690     27      14258   0       19.16   
        53.5
Tab 1           4000    179     154     327     11      1363    0       19.15   
        112.94
Tab 2           4000    210     185     376     19      1703    0       19.15   
        170.56
Tab 3           4000    225     199     395     17      994     0       19.16   
        170.39
Tab 4           4000    219     197     374     16      2077    0       19.16   
        141.43
Logout          4000    295     280     491     15      1215    0       19.16   
        37.7
TOTAL           28000   257     221     468     10      14258   0       133.91  
        723.21


Due to that bug with 2.6.1 I mentioned in my original post, I'm only able to gather 
results for users that have previously logged into 2.6.1. It isn't an apples-to-apples 
comparison.  If I could get that fixed, and re-run the tests for 2.6.1, I'm sure I'd be 
closer to saying, "We can replace 2.6.1 with 3.1.1 without having to add additional 
hardware."  For other reasons, we're not able to add new hardware at this time.

Do you have any recommendations or intuitions about what we might tune/analyze 
for new-user logins to make them equal?

Thanks,
Alex



----- Original Message -----
From: "Eric Dalquist"<[email protected]>
To: [email protected]
Sent: Tuesday, June 8, 2010 2:16:59 PM GMT -07:00 U.S. Mountain Time (Arizona)
Subject: Re: [uportal-dev] Fwd: uPortal Peformance

So in the 3.2.1 ehcache.xml file:
https://source.jasig.org/uPortal/tags/rel-3-2-1-GA/uportal-impl/src/main/resources/properties/ehcache.xml

There are comments above each cache describing how it is used. I'm
curious why you didn't increase cache sizes for all of the caches that
pertain to user specific data?

For example the "org.jasig.portal.portlet.dao.jpa.PortletEntityImpl"
cache has a comment above it that states "1 x subscribed portlet x
user". For example we have that cache set to 15000 entries per server
here at UW and now that we're getting new hardware and slowly moving
from our old 8 server cluster to a 4 server cluster I'll probably
increase that even more and we only see peaks of around 100k logins per day.

I'd highly recommend going through and for every cache entry in
ehcache.xml that has a "x user" component making sure it is sized large
enough to actually hold on to all the data your users need. The fact
that your PortletEntityImpl has 0 hits, 13000 misses and 1000 entries
makes me think it is way too small. Our current stats from one machine
which only covers 8080 logins since over the last ~10 hours shows 203784
hits, 109671 misses, 3547 entries in our PortletEntityImpl cache.

If memory isn't an immediate concern I'd even say set these cache sizes
100x what you think they need to be set to then run the test.

-Eric




On 06/08/2010 03:40 PM, Alex Bragg wrote:
OK.  I did some tweaking on ehcache.xml.  First, I ran a baseline with just a thousand 
new users and recorded all the cache numbers.  Next, I doubled the maximum allowed 
elements for the caches listed below that are marked with a "*".  I didn't see 
a reason to change the others.  The first number is the hits, the second numbers is 
misses, and the last is number of elements in the cache.  The first row is the baseline 
and the second row is after updates.  Long story short, I saw no real difference.  The 
hit ratios were identical.  If anything it got slower.  At this short interval, I believe 
the TTLs have no real effect.

                  Page         Elapsed
                  Response     Run
                  Time (s)     Time
Baseline        2.014        3m10s
After Changes   2.063        3m17s

          Hits    Misses  Objects in cache
PortalStats.org.hibernate.cache.StandardQueryCache
          0       0       0
          0       0       0
PortalStats.org.hibernate.cache.UpdateTimestampsCache
          0       0       0
          0       0       0
PortalStats.org.jasig.portal.events.EventType
          0       0       0
          0       0       0
*org.hibernate.cache.StandardQueryCache
          13000   13007   250
          13000   13007   500
org.hibernate.cache.UpdateTimestampsCache
          0       13000   3
          0       13000   3
org.jasig.portal.ChannelDefinition
          49037   34      16
          49037   34      16
org.jasig.portal.channels.CONTENT_CACHE
          0       0       0
          0       0       0
org.jasig.portal.groups.CompositeEntityIdentifier.NAME_PARSE_CACHE
          554915  20      10
          552151  20      10
org.jasig.portal.groups.IEntity
          71277   2004    2
          71002   2004    2
org.jasig.portal.groups.IEntityGroup
          278460  8       4
          277078  8       4
org.jasig.portal.layout.dlm.Evaluator
          0       0       0
          0       0       0
org.jasig.portal.layout.dlm.LAYOUT_CACHE
          208092  6049    1
          208085  6001    1
org.jasig.portal.portlet.dao.jpa.PortletDefinitionImpl
          8000    0       5
          8000    0       5
org.jasig.portal.portlet.dao.jpa.PortletEntityImpl
          0       13000   1000
          0       13000   1000
org.jasig.portal.portlet.dao.jpa.PortletPreferenceImpl
          24000   0       9
          24000   0       9
org.jasig.portal.portlet.dao.jpa.PortletPreferenceImpl.values
          24000   0       9
          24000   0       9
org.jasig.portal.portlet.dao.jpa.PortletPreferencesImpl
          8002    5       1100
          8002    5       1100
org.jasig.portal.portlet.dao.jpa.PortletPreferencesImpl.portletPreferences
          8002    13000   1100
          8002    13000   1100
*org.jasig.portal.security.IPermissionSet
          219009  3019    1000
          218037  3020    1005
*org.jasig.portal.security.provider.AuthorizationImpl.AUTH_PRINCIPAL_CACHE
          349622  2010    150
          348058  2010    300
*org.jasig.portal.utils.ResourceLoader.RESOURCE_URL_CACHE
          67869   52032   0
          67901   52045   22
*org.jasig.portal.utils.ResourceLoader.RESOURCE_URL_NOT_FOUND_CACHE
          51964   67      0
          51970   86      16
org.jasig.portal.utils.cache.ConfigurablePageCachingFilter.PAGE_CACHE
          0       0       0
          0       0       0
*org.jasig.services.persondir.USER_INFO.merged
          3003    4003    0
          3003    4003    1
*org.jasig.services.persondir.USER_INFO.up_person_dir
          1001    4003    0
          1001    4003    1
*org.jasig.services.persondir.USER_INFO.up_user
          1001    4003    0
          1001    4003    1

----- Original Message -----
From: "Eric Dalquist"<[email protected]>
To: [email protected]
Sent: Tuesday, June 8, 2010 7:35:23 AM GMT -07:00 U.S. Mountain Time (Arizona)
Subject: Re: [uportal-dev] Fwd: uPortal Peformance

I had replied with this on the uportal-user list:

For the performance my first pointer would be to
uportal-impl/src/main/resources/properties/ehcache.xml

In each release of uPortal we've been moving more and more data out of
static caches and the user session into Ehcache. I'm not sure its out in
a released version yet but I recently did some review of the default
cache config and tuning here at UW and checked in an updated config file
that at least has comments describing how each cache is used.

Also all of the cache statistics are available via JMX. I'd recommend
that you monitor those as you're doing your load testing and see which
caches are filling up and which have poor hit rates. Tuning the size and
TTLs of the caches should do a lot to reduce database IO and load times.

So I guess I'd be very interested to have you do some basic tuning in
ehcache then re-run the tests and watch the caches to see if they are
both large enough and have appropriate TTLs for your usage patterns.

-Eric

On 06/08/2010 12:13 AM, Alex Bragg wrote:

Hello,

I'm doing some performance testing, and I could use some hints on a couple of 
issues.  First, I'm looking for some hints on things I can tweak in 3.1.1/3.2.1 
to improve performance under heavy load.  Second, I'm hitting a bug in 2.6.1 
that is preventing me from gathering solid baseline performance numbers, and 
perhaps someone else has seen it.  Let me explain in further detail.

We have been preparing for an upgrade of our production systems from uPortal 
2.6.1 to uPortal 3.x.  Currently, we're looking at two 3.x versions, 3.1.1 and 
3.2.1.  In my development environment, I have installed 2.6.1, 3.1.1, and 
3.2.1.  My 2.6.1 install is running out of a 5.5.28 Tomcat, and my 3.x versions 
are running in a 6.0.24 Tomcat.  All versions are running under Java version 
1.6.0_12-b04, 64-bit, and I have an Oracle 11gR2 database backing them.

The layout in each instance is a simple 5-tab layout, with nothing on the 
default tab.  I have a custom testing portlet that simply executes a SQL query 
5, 10, or 15 times and renders a 3-line text output.  On the remaining four 
tabs, I have mixtures of two or more of these testing portlets.  I run tests 
with JMeter, and the click path is get login page, login, click tab 2, click 
tab 3, click tab 4, click tab 5, and logout.  JMeter verifies each page renders 
properly.  The tests I run execute this click path 4000 times spread across 1, 
4, 50, and 200 threads, and there are no waits built into the scripts.

Here are results from the tests I have run so far.  The values are the 90th percentile 
page-response time in seconds.  Please note that the number for 2.6.1 in the 200-thread 
column isn't valid.  At the 200-thread level most of the 200 threads complete their 20 
iterations before JMeter starts additional threads during ramp-up.  I end up with no more 
than 4 or 5 threads running concurrently.  Another thing that skews these numbers is that 
I can only get valid results using users that have successfully logged in before.  
Anything above 2 threads with users that have not previously logged in results in 
channels failing to render (with the message "You are not authorized to view this 
channel").

version 1       4       50      200     50-lb2  200-lb2 50-lb4  200-lb4
2.6.1   0.07    0.08    0.7     *0.08*  0.69    4.56
3.1.1   0.09    0.09    1.96    7.81    1.18    6.02    1.12    5.49
3.2.1   0.17    0.18    7.04    26.43   6.17    20.22

The "lb2" and "lb4" designators signify that I have started multiple Tomcats on 
the server, 2 for lb2 and 4 for lb4, and I'm balancing load with HAProxy.  I see much better 
utilization on the server, and both page-response times and elapsed test run times (below) both 
improve significantly even though I have not added any additional hardware.

This table shows the elapsed time in seconds to complete the above tests.

version 1       4       50      200     50-lb2  200-lb2 50-lb4  200-lb4
2.6.1   934     454     216     212     209.09  263.43
3.1.1   1,537   462     495     813     386.92  660.39  421.41  414.32
3.2.1   3,299   862     1,999   3,958   1259.99 2636.8

Basically, what I see here is that at low concurrency 2.6.1 and 3.1.1 are 
fairly comparable, and 3.2.1 is noticeably slower.  At 50 threads and above, I 
see that 2.6.1 is much faster than 3.x.  I also see that at very high loads, 
3.x seems to have a point where it just falls over the edge of a cliff.

Part of that I'm sure is the change in page sizes.  Here are the page sizes 
JMeter reports (this does not include embedded resources).

                   2.6.1           3.1.1           3.2.1
                   Avg. Bytes      Avg. Bytes      Avg. Bytes
Login Page      2014.93         12865           23963
Login           2958.61         21716           21909
Tab 1           5950.05         24221.21        24656
Tab 2           8840.34         26755.38        27430
Tab 3           8835.95         26753.3         27428
Tab 4           7380.03         25525.27        26068
Logout          2014.94         12865           23963
TOTAL           5427.84         21528.74        25059.57

So, back to my two questions.

1. What has changed in 3.1.1 that might explain a significant (at least 2x 
slowdown under load)?  To me it feels like 2.6.1 is caching rendered elements 
to a much greater degree than 3.1.1.  What can I tweak to improve this?

2. Is anyone aware of something I can change to fix the behavior with new 
logins in 2.6.1 to prevent this issue with channels not authorized?

Thanks,
Alex Bragg
Unicon, Inc.






Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to