Re: [dtrace-discuss] How to dig deeper
The buffer cache was already huge. So I decided not to increase it. There is a KEEP pool of 5G which is hardly used. If needed I will sacrifice this cache and add it to the DEFAULT cache. Until now it looks promising. The average log file sync wait time has dropped from about 70ms to 7ms. But we will see how things devellop. Thanks so far to everybody who helped. Regards Hans-Peter -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
On Fri, Dec 05, 2008 at 05:40:19AM -0800, Hans-Peter wrote: Ok thanks a lot sofar. We have planned down time for remounting the filesystems this weekend. We will see what happens. You can remount it online: mount -o remount,forcedirectio /mount point mount -o remount,noforcedirectio /mount point Regards Przemyslaw Bak (przemol) -- http://przemol.blogspot.com/ -- Wygraj telefon komorkowy! Sprawdz http://link.interia.pl/f1fc0 ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
Do you have directio enabled on UFS? Especially for the redo logs? With directio enabled, UFS writes to the log do not serialize on the RW lock for the log file(s). directio will also bypass the memory cache, so you need to increase the Oracle db_block_buffers when enabling UFS directio. Thanks, /jim Hans-Peter wrote: Hi Mauro, Yes I understand why sysinfo is not the best to measure IO. But I just wanted to see when in the whole trace the actual physical write was being done. So it seems to me that, because the sysinfo:::pwrite is right at the end the performance bottle neck is more because of the locking etc. The database files are on ufs filesystems. Is zfs better? Regards Hans-Peter ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
On Thu, Dec 04, 2008 at 04:55:04AM -0800, Hans-Peter wrote: Do you have directio enabled on UFS? Especially for the redo logs? That is the strange thing. filesystemio_options has been set to setall which is asynch and directio. But when I dtrace the fbt calls I see only directio calls for the database writer but not for the log writer. Perhaps we should mount the filesystems with forcedirectio. Forcedirectio forces direct io on _all_ files while 'setall' lets database decide which files should be accessed with direct options. It (setall) means that database has ability to write to some files with buffered access e.g. log files (which is better). Forcedirectio disables the choice and database is forced to use direct access to all files. So if you have Oracle use setall. If you have non-Oracle and that database doesn't have setall-like option - use forcedirectio. Would noatime also help? Usually setall/forcedirectio gives you enough. Regards Przemyslaw Bak (przemol) -- http://przemol.blogspot.com/ -- Wygraj telefon komorkowy! Sprawdz http://link.interia.pl/f1fc0 ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
Hi Hans-Peter, There is an Oracle bug that is not fixed until 10.2.0.4 (if I remember right) where filesystemio_options is broken and so it won't do the correct thing with directio and ufs. The locks you are seeing are classic single-writer-lock issues and indicate that you aren't getting directio. The Oracle bug number is Oracle Bug ID 5752399 (Oracle setall w/ UFS). You might check to be sure you have the fix to that. The only workaround is to use forcedirectio in the mount options, which has been pointed out has other issues for some activities, so getting the bug fixed would be a better idea. Regards, Dave Miller Hans-Peter wrote, On 12/04/08 06:55: Do you have directio enabled on UFS? Especially for the redo logs? That is the strange thing. filesystemio_options has been set to setall which is asynch and directio. But when I dtrace the fbt calls I see only directio calls for the database writer but not for the log writer. Perhaps we should mount the filesystems with forcedirectio. Would noatime also help? Regards With directio enabled, UFS writes to the log do not serialize on the RW lock for the log file(s). directio will also bypass the memory cache, so you need to increase the Oracle db_block_buffers when enabling UFS directio. Thanks, /jim Hans-Peter wrote: Hi Mauro, Yes I understand why sysinfo is not the best to measure IO. But I just wanted to see when in the whole trace the actual physical write was being done. So it seems to me that, because the sysinfo:::pwrite is right at the end the performance bottle neck is more because of the locking etc. The database files are on ufs filesystems. Is zfs better? Regards Hans-Peter ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
For the record, my friend Phil Harman reminded me that it's not the log files we care about for directio in terms of single-writer lock break-up. We care about directio for redo logs to avoid read-modify-write, which happens when the write is not memory-page aligned. Sorry about that. Nonetheless, you must still pursue getting UFS directio enabled and working on your system (dealing with the single-writer lock problem is still important, it's just not directly applicable to the redo logs). Thanks, /jim Hans-Peter wrote: Do you have directio enabled on UFS? Especially for the redo logs? That is the strange thing. filesystemio_options has been set to setall which is asynch and directio. But when I dtrace the fbt calls I see only directio calls for the database writer but not for the log writer. Perhaps we should mount the filesystems with forcedirectio. Would noatime also help? Regards With directio enabled, UFS writes to the log do not serialize on the RW lock for the log file(s). directio will also bypass the memory cache, so you need to increase the Oracle db_block_buffers when enabling UFS directio. Thanks, /jim Hans-Peter wrote: Hi Mauro, Yes I understand why sysinfo is not the best to measure IO. But I just wanted to see when in the whole trace the actual physical write was being done. So it seems to me that, because the sysinfo:::pwrite is right at the end the performance bottle neck is more because of the locking etc. The database files are on ufs filesystems. Is zfs better? Regards Hans-Peter ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
%busy is meaningless unless you're looking at a single disk that can only have 1 outstanding IO in it's active queue, which is to say %busy is a useless metric for anything disk that's been designed and built in the last decade. Ignore %busy. Focus on queue depths and queue service times, both of which are reported in iotstat. And again, averages can be misleading. Use the iostat data as a starting point, but if you really think you have some disk IO issues, use DTrace and aggregations to get more precise data on disk IO times. Thanks, /jim Hans-Peter wrote: Hello Jim, The iostat average are not too bad. But the busy% is 100% most of the time. regards Hans-Peter ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
It seems VOP_RWLOCK takes a long time. ufs_rwlock starts at line 12, but ends at line 15611. Is there another application keeps reading/writing in your system? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Hans-Peter Sent: 2008年12月3日 21:41 To: dtrace-discuss@opensolaris.org Subject: Re: [dtrace-discuss] How to dig deeper Hello all, I added a clause to my script. sysinfo::: /self-traceme==1 pid == $1/ { trace(execname); printf(sysinfo: timestamp : %d , timestamp); } A subsequent trace created a file of about 19000 lines. I loaded in Excel to be able to subtrace timestamps etc. The longest jump in timestamp is between the first pair of savectx and restorectx at line 70. I count 50 savectx/restorectx calls in my trace. But the actual physical write as indicated by the sysinfo is almost at the end of the file directly after the ufs_write and fop_write call end (line 18746). 24 - tsd_agent_get oracle timestamp 1795469946839100 24 - ufs_lockfs_top_vop_return oracle timestamp 1795469946841500 24 - ufs_lockfs_top_vop_return oracle timestamp 1795469946844300 24 - ufs_lockfs_end oracle timestamp 1795469946846700 24 - ufs_write oracle timestamp 1795469946849600 24 - fop_write oracle timestamp 179546994685350057,365,700 24 | pwrite syswriteoracle sysinfo timestamp 1795469946856800 24 | pwrite writech oracle sysinfo timestamp 1795469946860200 So it seems that the actual write not the bottle neck but I attached a zip file with the excel document. Am I right in thinking that is is more an OS issue than a storage issue? Regards Hans-Peter -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
The sysinfo provider isn't the best choice for measuring disk IO times. Run; #dtrace -s /usr/demo/dtrace/iotime.d /jim Hans-Peter wrote: Hello all, I added a clause to my script. sysinfo::: /self-traceme==1 pid == $1/ { trace(execname); printf(sysinfo: timestamp : %d , timestamp); } A subsequent trace created a file of about 19000 lines. I loaded in Excel to be able to subtrace timestamps etc. The longest jump in timestamp is between the first pair of savectx and restorectx at line 70. I count 50 savectx/restorectx calls in my trace. But the actual physical write as indicated by the sysinfo is almost at the end of the file directly after the ufs_write and fop_write call end (line 18746). 24 - tsd_agent_get oracle timestamp 1795469946839100 24 - ufs_lockfs_top_vop_return oracle timestamp 1795469946841500 24 - ufs_lockfs_top_vop_return oracle timestamp 1795469946844300 24 - ufs_lockfs_end oracle timestamp 1795469946846700 24 - ufs_write oracle timestamp 1795469946849600 24 - fop_write oracle timestamp 179546994685350057,365,700 24 | pwrite syswriteoracle sysinfo timestamp 1795469946856800 24 | pwrite writech oracle sysinfo timestamp 1795469946860200 So it seems that the actual write not the bottle neck but I attached a zip file with the excel document. Am I right in thinking that is is more an OS issue than a storage issue? Regards Hans-Peter ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
Also - since this is Oracle, are the Oracle files on a file system, or raw devices? If a file system, which one? /jim Jim Mauro wrote: The sysinfo provider isn't the best choice for measuring disk IO times. Run; #dtrace -s /usr/demo/dtrace/iotime.d /jim Hans-Peter wrote: Hello all, I added a clause to my script. sysinfo::: /self-traceme==1 pid == $1/ { trace(execname); printf(sysinfo: timestamp : %d , timestamp); } A subsequent trace created a file of about 19000 lines. I loaded in Excel to be able to subtrace timestamps etc. The longest jump in timestamp is between the first pair of savectx and restorectx at line 70. I count 50 savectx/restorectx calls in my trace. But the actual physical write as indicated by the sysinfo is almost at the end of the file directly after the ufs_write and fop_write call end (line 18746). 24 - tsd_agent_get oracle timestamp 1795469946839100 24 - ufs_lockfs_top_vop_return oracle timestamp 1795469946841500 24 - ufs_lockfs_top_vop_return oracle timestamp 1795469946844300 24 - ufs_lockfs_end oracle timestamp 1795469946846700 24 - ufs_write oracle timestamp 1795469946849600 24 - fop_write oracle timestamp 179546994685350057,365,700 24 | pwrite syswriteoracle sysinfo timestamp 1795469946856800 24 | pwrite writech oracle sysinfo timestamp 1795469946860200 So it seems that the actual write not the bottle neck but I attached a zip file with the excel document. Am I right in thinking that is is more an OS issue than a storage issue? Regards Hans-Peter ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
Hi Lejun, No there is only a 1 Oracle database running on this system. regards Hans-Peter It seems VOP_RWLOCK takes a long time. ufs_rwlock starts at line 12, but ends at line 15611. Is there another application keeps reading/writing in your system? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Hans-Peter Sent: 2008年12月3日 21:41 To: dtrace-discuss@opensolaris.org Subject: Re: [dtrace-discuss] How to dig deeper Hello all, I added a clause to my script. sysinfo::: /self-traceme==1 pid == $1/ { trace(execname); printf(sysinfo: timestamp : %d , timestamp); } A subsequent trace created a file of about 19000 lines. I loaded in Excel to be able to subtrace timestamps etc. The longest jump in timestamp is between the first pair of savectx and restorectx at line 70. I count 50 savectx/restorectx calls in my trace. But the actual physical write as indicated by the sysinfo is almost at the end of the file directly after the ufs_write and fop_write call end (line 18746). 24 - tsd_agent_get oracle timestamp 1795469946839100 24 - ufs_lockfs_top_vop_return oracle timestamp 1795 469946841500 24 - ufs_lockfs_top_vop_return oracle timestamp 1795 469946844300 24 - ufs_lockfs_end oracle timestamp 179546994684670 0 24 - ufs_write oracle timestamp 1795469946849600 24 - fop_write oracle timestamp 179546994685350057 ,365,700 24 | pwrite syswriteoracle sysinfo timestamp 1795469 946856800 24 | pwrite writech oracle sysinfo timestamp 17954699 46860200 So it seems that the actual write not the bottle neck but I attached a zip file with the excel document. Am I right in thinking that is is more an OS issue than a storage issue? Regards Hans-Peter -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
Hi Mauro, Yes I understand why sysinfo is not the best to measure IO. But I just wanted to see when in the whole trace the actual physical write was being done. So it seems to me that, because the sysinfo:::pwrite is right at the end the performance bottle neck is more because of the locking etc. The database files are on ufs filesystems. Is zfs better? Regards Hans-Peter -- This message posted from opensolaris.org ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org
Re: [dtrace-discuss] How to dig deeper
Start with iostat. It's simple, and provides an average of service times for disk IOs (iostat -xnz 1, the asvc_t column is average service times in milliseconds). As Jim Litchfield pointed out in a previous thread, keep in mind it is an average, so you won't see nasty peaks, but if the average is bad, it's time to talk to the SAN/Storage folks. Use DTrace with the IO provider to get more precise measurements. You can borrow from existing scripts in /usr/demo/dtrace to add time stamps, and use aggregations to capture IO times. If you need assistance with such a script, please let us know and we'll be glad to post one (it looks like you're already pretty comfortable with DTrace...). Thanks, /jim Hans-Peter wrote: In order to get more information on IO performance problems I created the script below: #!/usr/sbin/dtrace -s #pragma D option flowindent syscall::*write*:entry /pid == $1 guard++ == 0/ { self - ts = timestamp; self-traceme = 1; printf(fd: %d, arg0); } fbt::: /self-traceme/ { /* elapsd =timestamp - self - ts; printf( elapsed : %d , elapsd); */ printf( timestamp : %d , timestamp); } syscall::*write*:return /self-traceme/ { self-traceme = 0; elapsed=timestamp - self - ts; printf( timestamp : %d , timestamp); printf(\telapsed : %d , elapsed); exit(0); } I gives me the timestamp for every fbt call during a write system call. A snippet is here below 8 - schedctl_save timestamp : 1627201334052600 8- savectx timestamp : 1627201334053000 0 - restorectx timestamp : 1627202741110300 difference = 1.407.057.300 0- schedctl_restore timestamp : 1627202741115100 0- schedctl_restore timestamp : 1627202741116100 Visible is that the thread of for 1.4 s off cpu. Storage is on SAN with fibers between the system and the storage. Is it possible to dig deeper with dtrace to see how the HBA's are performing. Other suggestions are welcome too. Regards Hans-Peter ___ dtrace-discuss mailing list dtrace-discuss@opensolaris.org