Hi!

my view of things what was happening, may not be accurate, as it was  
first time I was touching this part of the cluster ;-)

1. ms1 had just 128 web threads configured, which could be occupied by  
both looking up file, serving it (and blocking if squid doesn't  
consume it fast enough), and blocking on fastcgi handlers
2. even though ms1 I/O was loaded, it wasn't loaded enough to justify  
20s waits on empty file creations via NFS
3. read operations via NFS were relatively fast, comparing with write  
operations
4. pybal was quite aggressive in depooling scalers
5. scalers were in one of two states, blocked on NFS write or  
depooled, due to stampede of requests hitting one server, and all  
others being depooled
6. if scaler would be depooled, not only it would not get requests,  
but wouldn't be able to write output back (my assumption, based on  
their frozen states, not actual verified fact)
7. due to 4-6, ms1 http threads would be stuck in 404-handler fastcgi  
(waiting for scalers to respond), thus also blocking way out for  
existing files
8. ms1 was spending lots of CPU in zfs`metaslab_alloc() tree (full  
tree, if anyone is interested, is at http://flack.defau.lt/ms1.svg -  
use FF3.5, then you can search for metaslab_alloc in it.
9. some digging around the internals (it is amazing, how I forgot  
pretty much everything I learned in 6h of ZFS sessions last  
November ;-) showed that the costs could have been increased by amount  
of our snapshots.

What was done:

1. Increased amount of worker threads on ms1 (why the heck was it that  
small anyway)
2. Made balancing way less eager to depool servers (thanks sir Mark)
3. Disabled ZIL (didn't give much expected effect though, as problem  
was elsewhere)
4. Dropped few oldest snapshots - thus targeting the metaslab_alloc()  
issue.

Cheers,
Domas

P.S. For anyone Solaris savvy (I am not, despite where I work), you  
know what this means:

               unix`mutex_delay_default+0x7
               unix`mutex_vector_enter+0x99
               genunix`cv_wait+0x70
               zfs`space_map_load_wait+0x20
               zfs`space_map_load+0x36
               zfs`metaslab_activate+0x6f
               zfs`metaslab_group_alloc+0x18d
               zfs`metaslab_alloc_dva+0xdb
               zfs`metaslab_alloc+0x6d
               zfs`zio_dva_allocate+0x62
               zfs`zio_execute+0x60
               genunix`taskq_thread+0xbc
               unix`thread_start+0x8
              3816

               unix`mutex_delay_default+0xa
               unix`mutex_vector_enter+0x99
               genunix`cv_wait+0x70
               zfs`space_map_load_wait+0x20
               zfs`space_map_load+0x36
               zfs`metaslab_activate+0x6f
               zfs`metaslab_group_alloc+0x18d
               zfs`metaslab_alloc_dva+0xdb
               zfs`metaslab_alloc+0x6d
               zfs`zio_dva_allocate+0x62
               zfs`zio_execute+0x60
               genunix`taskq_thread+0xbc
               unix`thread_start+0x8
              4068

               unix`mutex_delay_default+0xa
               unix`mutex_vector_enter+0x99
               zfs`metaslab_group_alloc+0x136
               zfs`metaslab_alloc_dva+0xdb
               zfs`metaslab_alloc+0x6d
               zfs`zio_write_allocate_gang_members+0x171
               zfs`zio_dva_allocate+0xcc
               zfs`zio_execute+0x60
               genunix`taskq_thread+0xbc
               unix`thread_start+0x8
              4615

              7500

               unix`mutex_delay_default+0x7
               unix`mutex_vector_enter+0x99
               zfs`metaslab_group_alloc+0x136
               zfs`metaslab_alloc_dva+0xdb
               zfs`metaslab_alloc+0x6d
               zfs`zio_dva_allocate+0x62
               zfs`zio_execute+0x60
               genunix`taskq_thread+0xbc
               unix`thread_start+0x8
              7785

               unix`mutex_delay_default+0xa
               unix`mutex_vector_enter+0x99
               zfs`metaslab_group_alloc+0x136
               zfs`metaslab_alloc_dva+0xdb
               zfs`metaslab_alloc+0x6d
               zfs`zio_dva_allocate+0x62
               zfs`zio_execute+0x60
               genunix`taskq_thread+0xbc
               unix`thread_start+0x8
             10487

               genunix`avl_walk+0x39
               zfs`space_map_alloc+0x21
               zfs`metaslab_group_alloc+0x1a2
               zfs`metaslab_alloc_dva+0xdb
               zfs`metaslab_alloc+0x6d
               zfs`zio_write_allocate_gang_members+0x171
               zfs`zio_dva_allocate+0xcc
               zfs`zio_execute+0x60
               genunix`taskq_thread+0xbc
               unix`thread_start+0x8
             16297

               genunix`avl_walk+0x39
               zfs`space_map_alloc+0x21
               zfs`metaslab_group_alloc+0x1a2
               zfs`metaslab_alloc_dva+0xdb
               zfs`metaslab_alloc+0x6d
               zfs`zio_dva_allocate+0x62
               zfs`zio_execute+0x60
               genunix`taskq_thread+0xbc
               unix`thread_start+0x8
             26149



_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to