[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 Thomas Koenig changed: What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |FIXED --- Comment #42 from Thomas Koenig --- Resolved, closing.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #41 from Thomas Koenig --- Author: tkoenig Date: Tue Jul 23 08:57:45 2019 New Revision: 273727 URL: https://gcc.gnu.org/viewcvs?rev=273727&root=gcc&view=rev Log: 2019-07-23 Thomas König Backport from trunk PR libfortran/91030 * gfortran.texi (GFORTRAN_FORMATTED_BUFFER_SIZE): Document. (GFORTRAN_UNFORMATTED_BUFFER_SIZE): Likewise. 2019-07-23 Thomas König Backport from trunk PR libfortran/91030 * io/unix.c (BUFFER_SIZE): Delete. (BUFFER_FORMATTED_SIZE_DEFAULT): New variable. (BUFFER_UNFORMATTED_SIZE_DEFAULT): New variable. (unix_stream): Add buffer_size. (buf_read): Use s->buffer_size instead of BUFFER_SIZE. (buf_write): Likewise. (buf_init): Add argument unformatted. Handle block sizes for unformatted vs. formatted, using defaults if provided. (fd_to_stream): Add argument unformatted in call to buf_init. * libgfortran.h (options_t): Add buffer_size_formatted and buffer_size_unformatted. * runtime/environ.c (variable_table): Add GFORTRAN_UNFORMATTED_BUFFER_SIZE and GFORTRAN_FORMATTED_BUFFER_SIZE. Modified: branches/gcc-9-branch/gcc/fortran/ChangeLog branches/gcc-9-branch/gcc/fortran/gfortran.texi branches/gcc-9-branch/libgfortran/ChangeLog branches/gcc-9-branch/libgfortran/io/unix.c branches/gcc-9-branch/libgfortran/libgfortran.h branches/gcc-9-branch/libgfortran/runtime/environ.c
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #40 from Thomas Koenig --- Author: tkoenig Date: Sun Jul 21 15:55:49 2019 New Revision: 273643 URL: https://gcc.gnu.org/viewcvs?rev=273643&root=gcc&view=rev Log: 2019-07-21 Thomas König PR libfortran/91030 * gfortran.texi (GFORTRAN_FORMATTED_BUFFER_SIZE): Document (GFORTRAN_UNFORMATTED_BUFFER_SIZE): Likewise. 2019-07-21 Thomas König PR libfortran/91030 * io/unix.c (BUFFER_SIZE): Delete. (BUFFER_FORMATTED_SIZE_DEFAULT): New variable. (BUFFER_UNFORMATTED_SIZE_DEFAULT): New variable. (unix_stream): Add buffer_size. (buf_read): Use s->buffer_size instead of BUFFER_SIZE. (buf_write): Likewise. (buf_init): Add argument unformatted. Handle block sizes for unformatted vs. formatted, using defaults if provided. (fd_to_stream): Add argument unformatted in call to buf_init. * libgfortran.h (options_t): Add buffer_size_formatted and buffer_size_unformatted. * runtime/environ.c (variable_table): Add GFORTRAN_UNFORMATTED_BUFFER_SIZE and GFORTRAN_FORMATTED_BUFFER_SIZE. Modified: trunk/gcc/fortran/ChangeLog trunk/gcc/fortran/gfortran.texi trunk/libgfortran/ChangeLog trunk/libgfortran/io/unix.c trunk/libgfortran/libgfortran.h trunk/libgfortran/runtime/environ.c
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #39 from Janne Blomqvist --- Now, with the fixed benchmark in the previous comment, on Lustre (version 2.5) system I get: Test using 25000 bytes Block size of file system: 4096 bs = 1024, 53.27 MiB/s bs = 2048, 73.99 MiB/s bs = 4096, 222.41 MiB/s bs = 8192, 351.38 MiB/s bs = 16384, 483.86 MiB/s bs = 32768, 583.76 MiB/s bs = 65536, 677.11 MiB/s bs = 131072, 748.60 MiB/s bs = 262144, 700.69 MiB/s bs = 524288, 811.76 MiB/s bs =1048576, 1032.99 MiB/s bs =2097152, 1034.03 MiB/s bs =4194304, 1063.74 MiB/s bs =8388608, 1030.15 MiB/s bs = 16777216, 1084.82 MiB/s bs = 33554432, 1067.05 MiB/s bs = 67108864, 1063.79 MiB/s On the same system, on a NFS filesystem connected with Infiniband I get: Test using 25000 bytes Block size of file system: 1048576 bs = 1024, 301.41 MiB/s bs = 2048, 351.51 MiB/s bs = 4096, 471.39 MiB/s bs = 8192, 444.61 MiB/s bs = 16384, 510.88 MiB/s bs = 32768, 527.99 MiB/s bs = 65536, 516.57 MiB/s bs = 131072, 481.38 MiB/s bs = 262144, 514.29 MiB/s bs = 524288, 462.06 MiB/s bs =1048576, 528.30 MiB/s bs =2097152, 526.76 MiB/s bs =4194304, 501.09 MiB/s bs =8388608, 493.61 MiB/s bs = 16777216, 550.24 MiB/s bs = 33554432, 532.20 MiB/s bs = 67108864, 532.82 MiB/s So for Lustre, a buffer size bigger than the current 8 kB at least seems justified. While Lustre sees improvements all the way to 1 MB buffer size, such large buffers by default seems a bit excessive.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #38 from Janne Blomqvist --- First, I think there's a bug in the benchmark in comment #c20. It writes blocksize * sizeof(double), but then advances only blocksize for each iteration of the loop. Fixed version writing just bytes below: #include #include #include #include #include #include #include #include double walltime (void) { struct timeval TV; double elapsed; gettimeofday(&TV, NULL); elapsed = (double) TV.tv_sec + 1.0e-6*((double) TV.tv_usec); return elapsed; } #define NAME "out.dat" #define N 25000 int main() { int fd; unsigned char *p, *w; long i, size, blocksize, left, to_write; int bits; double t1, t2; struct statvfs buf; printf ("Test using %ld bytes\n", (long) N); statvfs (".", &buf); printf ("Block size of file system: %ld\n", buf.f_bsize); p = malloc(N * sizeof (*p)); for (i=0; i 0) { if (left >= blocksize) to_write = blocksize; else to_write = left; write (fd, w, blocksize); w += to_write; left -= to_write; } close (fd); t2 = walltime (); printf ("%.2f MiB/s\n", N / (t2-t1) / 1048576); } free (p); unlink (NAME); return 0; }
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #37 from Janne Blomqvist --- One thing we could do would be to switch to pread and pwrite instead of using lseek. That would avoid a few syscalls when updating the record length marker. Though I guess the issue with GPFS isn't directly related to the number of syscalls?
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 Janne Blomqvist changed: What|Removed |Added CC||jb at gcc dot gnu.org --- Comment #36 from Janne Blomqvist --- I have access to a system with Lustre, which is another parallel file system popular in HPC. Unfortunately I don't have gcc trunk setup there, but I can easily test the C benchmark; give me a day or two.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #35 from Jerry DeLisle --- (In reply to Thomas Koenig from comment #34) > There is another point to consider. > > I suppose not very many people use big-endian data formats > these days. Little-endian dominates these days, and people > who require that conversion on a regular basis (why does > HPC need that, by the way?) are probably few and far between. > > Another question is if people who do serious HPC work do > a lot of stuff (without conversion) like > > write(10) x(1::2) > > which would actually use the buffers, instead of > > write (10) x > > where the whole buffering discussion does not apply. > > Jerry, if you use strides in writing, without conversion, > what result would you get for different block sizes? > Disregard my previous data. If I run the tests manually outside of the script you provided I get consistent results: $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=1024 ./a.out 2.7986080646514893 $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=4096 ./a.out 2.5836510658264160 $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=8192 ./a.out 2.5744562149047852 $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=16384 ./a.out 2.4813480377197266 $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=32768 ./a.out 2.5214788913726807 $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=65536 ./a.out 2.4661610126495361 $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=131072 ./a.out 2.4065649509429932 $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=262144 ./a.out 2.4941890239715576 $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=524288 ./a.out 2.3842790126800537 $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=1048576 ./a.out 2.4531490802764893 $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=2097152 ./a.out 2.5236811637878418 So there is a sweet spot at the 131072 point on this particular machine, so I agree we should be able to go higher (that inconsistency I reported earlier was bugging me enough to experiment and I discovered this, Ryzen 2500U). Strides without conversion: $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=65536 ./a.out 1.8322470188140869 $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=65536 ./a.out 1.8337209224700928 $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=131072 ./a.out 1.8346250057220459 $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=262144 ./a.out 1.8497080802917480 $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=524288 ./a.out 1.8243398666381836 $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=1048576 ./a.out 1.7886412143707275 $ GFORTRAN_BUFFER_SIZE_UNFORMATTED=2097152 ./a.out 1.8285851478576660 All things considered I would say go for the higher value and the users can set the environment variable lower if they need to.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #34 from Thomas Koenig --- There is another point to consider. I suppose not very many people use big-endian data formats these days. Little-endian dominates these days, and people who require that conversion on a regular basis (why does HPC need that, by the way?) are probably few and far between. Another question is if people who do serious HPC work do a lot of stuff (without conversion) like write(10) x(1::2) which would actually use the buffers, instead of write (10) x where the whole buffering discussion does not apply. Jerry, if you use strides in writing, without conversion, what result would you get for different block sizes? If that is reasonably fast, then I am now leaning towards making the default buffer much larger for unformatted. Formatted default can stay as it is (adjustable via environment variable), making the buffers larger there would just be a waste of memory because of the large CPU load in converting floating point numbers (unless somebody can show a reasonable benchmark demonstrating otherwise).
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #33 from Jerry DeLisle --- Well, I am not opposed to it. What we do not want is to pessimize older smaller machines where it does matter a lot. However if Thomas strategy above is adjusted from 32768 to 65536 then out of the box it will work for your system which is the very first one like this we have encountered (it appears unique from my perspective). We are simply trying to strike the balance across a population for which we have a microscopic sample size shown in this PR. We came up with the 8192 before from also a small sample size. I have another machine here where it makes no difference either way and another where it does really good most of the time at 1024 (believe it or not). Thomas approach is an attempt at the heuristic. Now your idea of a page size angle I need to exlore a bit here and see what this thing is doing. I doubt the HPC users are the majority in number but they are certainly highly important. I know many users around here where I am that use gfortran on there office workstations for preliminary testing and development before they go to the big iron to finalize. With the above said, I think your specific needs at 65536 can be satisfied and we do appreciate the data and testing from you. I do wonder if we need to make "Optimizing I/O" a blatently obvious topic right at the TOP of all our documentation on web page as well as docs.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #32 from David Edelsohn --- If the performance measured by Jerry is hitting limits of the 4 x 32KiB L1 D-Cache of the Ryzen 2500U, then the system has bigger problems than FORTRAN I/O buffer size. What is the target audience / market for GNU FORTRAN? FORTRAN primarily is used for numerically intensive computing and HPC. This issue was discovered through an experiment by an organization that perform huge HPC simulations and inquired about the performance of GNU FORTRAN. I suggest that GNU FORTRAN implement defaults appropriate for HPC systems if it wants to increase adoption in large-scale commercial environments. If we can find some heuristics that allow GNU FORTRAN to distinguish between consumer and commercial systems, that would be ideal.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #31 from David Edelsohn --- What is the PAGESIZE on the Ryzen system? On the POWER systems, the PAGESIZE is 64K. Maybe the optimal buffer size (write size) allows the filesystem to perform double-buffering at the PAGESIZE.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #30 from Thomas Koenig --- > Why are you opposed to the larger 65536 or 131072 as a default? Please look at Jerry's numbers from comment #24. They show a severe regression (for his system) for blocksizes > 32768.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #29 from David Edelsohn --- > For formatted files, chose the value that the user supplied > via an environment variable. If the user supplied nothing, then > > - query the recommended block size via calling fstat and evaluating > st_blksize. > - If st_blksize is less than 8192, use 8192 (current behavior) > - if st_blksize is more than 32768, use 32768 > - otherwise use st_blksize I assume that you meant UNformatted files. Why are you opposed to the larger 65536 or 131072 as a default? The benefit at that level is reproducible, _even for filesystems with smaller block size_. Why propose another default value that restricts GNU FORTRAN performance when given the opportunity to fix this and make GNU FORTRAN performance look very good "out of the box". Few people will bother to read the documentation to look for environment variables or even realize that unformatted I/O performance is the bottleneck.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #28 from Thomas Koenig --- (In reply to Jerry DeLisle from comment #27) > (In reply to Thomas Koenig from comment #26) > > Jerry, you are working on a Linux box, right? What does > > > > stat -f -c %b . > > > > tell you? > > 13429330 So we cannot really take the values from the file system for the buffer size at face value, at least not for determining the buffer size for this particular case. Last question (grasping at straws here): Are the values from comment #24 reproducible, do you always get this big jump for block size 6553? If they are indeed reproducible, I would suggest using an approach slightly modified from the attached patch: For formatted files, chose the value that the user supplied via an environment variable, or 8192 otherwise. (Formatted is so slow that we might as well save the memory). For formatted files, chose the value that the user supplied via an environment variable. If the user supplied nothing, then - query the recommended block size via calling fstat and evaluating st_blksize. - If st_blksize is less than 8192, use 8192 (current behavior) - if st_blksize is more than 32768, use 32768 - otherwise use st_blksize How does that sound?
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #27 from Jerry DeLisle --- (In reply to Thomas Koenig from comment #26) > Jerry, you are working on a Linux box, right? What does > > stat -f -c %b . > > tell you? 13429330 Ryzen 2500U with M.2 SSD Fedora 30, Kernel 5.1.15-300.fc30.x86_64
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #26 from Thomas Koenig --- Jerry, you are working on a Linux box, right? What does stat -f -c %b . tell you?
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #25 from Thomas Koenig --- (In reply to Jerry DeLisle from comment #24) > On a different Ryzen machine: > > $ ./run.sh > 1024 3.2604169845581055 > 2048 2.7804551124572754 > 4096 2.6416599750518799 > 8192 2.5986809730529785 > 16384 2.5525100231170654 > 32768 2.5145640373229980 > 65536 9.2993371486663818 > 131072 9.0313489437103271 Oops. That increase for 65536 might be an L1 cache effect. Note: We are measuring only transfer speed to cache here. Transfer to actual hard disks will be much slower. It is still relevant though, especially since for the usual cycle of repeatedly calculating and writing data. The OS can then sync the data to disc at its leisure while the next calculation is running. So, what would be a good strategy to select a block size?
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #24 from Jerry DeLisle --- On a different Ryzen machine: $ ./run.sh 1024 3.2604169845581055 2048 2.7804551124572754 4096 2.6416599750518799 8192 2.5986809730529785 16384 2.5525100231170654 32768 2.5145640373229980 65536 9.2993371486663818 131072 9.0313489437103271
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #23 from Thomas Koenig --- Some numbers for the provisionary patch, varying the size for the buffers. With the patch, the original benchmark (minus some output, only the elapsed time is shown) and the script for a in 1024 2048 4096 8192 16384 32768 65536 131072 do rm -f out.dat sync ; sync; sync sleep 1 echo -n $a GFORTRAN_BUFFER_SIZE_UNFORMATTED=$a ./a.out done rm -f out.dat sync I get on my home Ryzen box with ext4 1024 2.959884643555 2048 2.4514980316162109 4096 2.2090110778808594 8192 1.9955158233642578 16384 2.0065548419952393 32768 1.9320869445800781 65536 1.9494299888610840 131072 1.8885779380798340 On gcc135 (POWER9) I get 1024 6.2069039344787598 2048 3.5782949924468994 4096 2.2184860706329346 8192 1.4914679527282715 16384 1.1247980594635010 32768 0.95092821121215820 65536 0.85877490043640137 131072 0.82407808303833008 and on gcc115 (aarch64): 1024 10.543070077896118 2048 7.3426060676574707 4096 5.7169480323791504 8192 4.7394258975982666 16384 4.2912349700927734 32768 4.0224111080169678 65536 3.8719530105590820 131072 3.8628818988800049 so 64 k looks like a good choice, except for the Ryzen machine, where 8k would be sufficient.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #22 from David Edelsohn --- The following are unofficial results on an unspecified system running GPFS. These should not be considered official anything and should not be referenced for benchmarking. Test using 2.50e+08 doubles Block size of file system: 16777216 bs = 1024, 126.53 MiB/s bs = 2048, 218.69 MiB/s bs = 4096, 335.00 MiB/s bs = 8192, 436.25 MiB/s bs = 16384, 774.91 MiB/s bs = 32768, 619.28 MiB/s bs = 65536, 1018.89 MiB/s bs = 131072, 659.44 MiB/s bs = 262144, 629.90 MiB/s bs = 524288, .63 MiB/s bs =1048576, 678.90 MiB/s bs =2097152, 1029.28 MiB/s bs =4194304, 668.27 MiB/s bs =8388608, 662.53 MiB/s bs = 16777216, .37 MiB/s bs = 33554432, 694.28 MiB/s bs = 67108864, 1091.94 MiB/s
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #21 from Thomas Koenig --- Created attachment 46537 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46537&action=edit Something to benchmark.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #20 from Thomas Koenig --- (In reply to David Edelsohn from comment #18) > For GPFS, the striping unit is 16M. The 8K buffer size chosen by GFortran > is a huge performance sink. We have confirmed this with testing. Could you share some benchmarks on this? I'd really like if the gfortran maintainers could form their own judgment on this, based on numbers. Here's a benchmark program: #include #include #include #include #include #include #include #include double walltime (void) { struct timeval TV; double elapsed; gettimeofday(&TV, NULL); elapsed = (double) TV.tv_sec + 1.0e-6*((double) TV.tv_usec); return elapsed; } #define NAME "out.dat" #define N 25000 int main() { int fd; double *p, *w; long i, size, blocksize, left, to_write; int bits; double t1, t2; struct statvfs buf; printf ("Test using %e doubles\n", N * 1.0); statvfs (".", &buf); printf ("Block size of file system: %ld\n", buf.f_bsize); p = malloc(N * sizeof (*p)); for (i=0; i 0) { if (left >= blocksize) to_write = blocksize; else to_write = left; write (fd, w, blocksize * sizeof (double)); w += to_write; left -= to_write; } close (fd); t2 = walltime (); printf ("%.2f MiB/s\n", N / (t2-t1) / 1048576); } free (p); unlink (NAME); return 0; } And here is some output on my home system (ext4): Test using 2.50e+08 doubles Block size of file system: 4096 bs = 1024, 175.81 MiB/s bs = 2048, 244.40 MiB/s bs = 4096, 247.27 MiB/s bs = 8192, 227.46 MiB/s bs = 16384, 195.55 MiB/s bs = 32768, 223.14 MiB/s bs = 65536, 168.95 MiB/s bs = 131072, 240.70 MiB/s bs = 262144, 260.39 MiB/s bs = 524288, 265.38 MiB/s bs =1048576, 261.67 MiB/s bs =2097152, 259.94 MiB/s bs =4194304, 258.71 MiB/s bs =8388608, 262.19 MiB/s bs = 16777216, 260.19 MiB/s bs = 33554432, 263.37 MiB/s bs = 67108864, 264.47 MiB/s And here is something on gcc135 (POWER9), also ext4: Test using 2.50e+08 doubles Block size of file system: 4096 bs = 1024, 206.76 MiB/s bs = 2048, 293.66 MiB/s bs = 4096, 347.13 MiB/s bs = 8192, 298.23 MiB/s bs = 16384, 397.51 MiB/s bs = 32768, 401.86 MiB/s bs = 65536, 431.83 MiB/s bs = 131072, 475.88 MiB/s bs = 262144, 470.09 MiB/s bs = 524288, 478.84 MiB/s bs =1048576, 485.68 MiB/s bs =2097152, 485.33 MiB/s bs =4194304, 483.96 MiB/s bs =8388608, 482.88 MiB/s bs = 16777216, 485.04 MiB/s bs = 33554432, 483.92 MiB/s bs = 67108864, 485.55 MiB/s So, write thoughput sort of seems to level out at ~ 131072 block size, 2**17. For Fortran, this is only really relevant for unformatted files.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #19 from David Edelsohn --- IBM XLF provides an XLFRTEOPTS environment variable, which includes control over buffer size. The documentation makes it clear that XLF uses the block size of the device by default: buffer_size=size Specifies the size of I/O buffers in bytes instead of using the block size of devices. size must be either -1 or an integer value that is greater than or equal to 4096. The default, -1, uses the block size of the device where the file resides. Using this option can reduce the amount of memory used for I/O buffers when an application runs out of memory because the block size of devices is very large and the application opens many files at the same time. Note the following when using this runtime option: Preconnected units remain unaffected by this option. Their buffer size is the same as the block size of the device where they reside except when the block size is larger than 64KB, in which case the buffer size is set to 64KB. This runtime option does not apply to files on a tape device or logical volume. Specifying the buffer size with the SETRTEOPTS procedure overrides any value previously set by the XLFRTEOPTS environment variable or SETRTEOPTS procedure. The resetting of this option does not affect units that have already been opened. https://www.ibm.com/support/knowledgecenter/SSAT4T_16.1.1/com.ibm.xlf1611.lelinux.doc/compiler_ref/rteopts.html
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #18 from David Edelsohn --- For GPFS, the striping unit is 16M. The 8K buffer size chosen by GFortran is a huge performance sink. We have confirmed this with testing. The recommendation from GPFS is that one should query the filesystem with fstat() and write in chunks of the block size. Instead of arbitrarily choosing a uniform buffer size of 8K, GFortran would achieve better I/O performance in general by dynamically querying the filesystem characteristics and choosing a buffer size tuned to the filesystem. Presumably one must find some balance of memory consumption if the application opens a huge number of files. Or maybe some environment variable to override the buffer size. IBM XL FORTRAN achieves better performance, even for EXT4, by adapting I/O to the filesystem block size.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #17 from Thomas Koenig --- (In reply to David Edelsohn from comment #16) > libgfortran unix.c:raw_write() will access the WRITE system call with up to > 2GB of data, which the testcase is using for the native format. > > Should libgfortran I/O buffer at least use sysconf(_SC_PAGESIZE) instead of > hard coding 8192? Depends. I would try not to blow away too much cache for such an operation. So far, this problem appears to be limited to POWER, and more specifically to file systems which are typically used in HPC. Could you (generic you, people who have access to such systems) show us some benchmarks which show performance as a function of block write size?
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #16 from David Edelsohn --- libgfortran unix.c:raw_write() will access the WRITE system call with up to 2GB of data, which the testcase is using for the native format. Should libgfortran I/O buffer at least use sysconf(_SC_PAGESIZE) instead of hard coding 8192?
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #15 from Thomas Koenig --- (In reply to David Edelsohn from comment #13) > Why should -fconvert affect the strategy for writing? If we get passed a contiguous block of memory (like in your test case) we can do this in a single write. If we want to swap bytes, this needs to be done on a basis of each data item. It would be wasteful to write out each data item by itself, so do this by copying to a buffer until it is full, and then writing out the buffer. The effect on speed could be tested simply enough. Just write two test programs, one of them mimicking the current behavior of libgfortran (writing out 8192 byte blocks, possibly starting with a smaller size) and the other one with one huge write. Use the write() system call directly. Benchmark both.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #14 from Jerry DeLisle --- (In reply to David Edelsohn from comment #13) > Why should -fconvert affect the strategy for writing? Hi David, very interesting bug report and a good question. I would like to investigate further if I know what platform this is on. Since GPFS is a parallel operating system that is potentially going across a network, it could be depending on the test environment. Also, there are a few cases where code paths in libgfortran can depend on OS features. So in your results here: > With EXT4: difference is 2x > With SHM: difference is 4.5x > With GPFS: difference is 10x > > Is libgfortran doing something unusual with the creation of files? Are all your results here under identical OS and are the physical drives local to the test machine hardware? If we can reproduce this on a gcc compile farm machine or maybe at OSU Open Software lab which I can access, we ought to be able to do better.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #13 from David Edelsohn --- Why should -fconvert affect the strategy for writing?
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #12 from Andrew Pinski --- (In reply to David Edelsohn from comment #10) > With EXT4: difference is 2x > With SHM: difference is 4.5x > With GPFS: difference is 10x > > Is libgfortran doing something unusual with the creation of files? So it looks like native is just doing one write system call while opposite endian is doing 8k chunks write system calls. This seems like an issue with the file system if it cannot handle 8k chunks. Maybe increasing the chunk size to 64k inside libgfortran will help. Something like: diff --git a/libgfortran/io/unix.c b/libgfortran/io/unix.c index c2fc674..5d24ac4 100644 --- a/libgfortran/io/unix.c +++ b/libgfortran/io/unix.c @@ -193,7 +193,7 @@ fallback_access (const char *path, int mode) /* Unix and internal stream I/O module */ -static const int BUFFER_SIZE = 8192; +static const int BUFFER_SIZE = 64*1024; typedef struct {
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #11 from Thomas Koenig --- (In reply to David Edelsohn from comment #10) > With EXT4: difference is 2x > With SHM: difference is 4.5x > With GPFS: difference is 10x > > Is libgfortran doing something unusual with the creation of files? Not really, but there is one difference. Stracing with -fconvert=native gives open("out.dat", O_RDWR|O_CREAT|O_CLOEXEC, 0666) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=28, ...}) = 0 write(3, "\0\0\0\0", 4) = 4 write(3, "\0\0\0\0\0\0\360?\0\0\0\0\0\0\360?\0\0\0\0\0\0\360?\0\0\0\0\0\0\360?"..., 20) = 20 lseek(3, 0, SEEK_SET) = 0 write(3, "\0\2245w", 4) = 4 lseek(3, 24, SEEK_SET) = 24 write(3, "\0\2245w", 4) = 4 ftruncate(3, 28)= 0 close(3)= 0 and using -fconvert=swap open("out.dat", O_RDWR|O_CREAT|O_CLOEXEC, 0666) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=28, ...}) = 0 write(3, "\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0"..., 7684) = 7684 write(3, "?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0"..., 8192) = 8192 write(3, "?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0"..., 8192) = 8192 [...] write(3, "?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0"..., 8192) = 8192 write(3, "?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0"..., 8192) = 8192 write(3, "?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0?\360\0\0\0\0\0\0"..., 5632) = 5632 lseek(3, 0, SEEK_SET) = 0 write(3, "w5\224\0", 4) = 4 lseek(3, 24, SEEK_SET) = 24 write(3, "w5\224\0", 4) = 4 ftruncate(3, 28)= 0 close(3)= 0 Would this make such a large difference?
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #10 from David Edelsohn --- With EXT4: difference is 2x With SHM: difference is 4.5x With GPFS: difference is 10x Is libgfortran doing something unusual with the creation of files?
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #9 from Thomas Koenig --- On powerpc64le-unknown-linux-gnu: write time(sec) = 0.48150300979614258 done real0m0.889s user0m0.279s sys 0m0.608s vs. write time(sec) =1.4788339138031006 done real0m1.880s user0m0.669s sys 0m1.208s Less good, but not as bad as you're reporting. On aarch64-unknown-linux-gnu: write time(sec) =3.3060228824615479 done real0m4.739s user0m0.300s sys 0m4.420s vs. write time(sec) =4.7578129768371582 done real0m6.091s user0m1.080s sys 0m5.000s The factor is also bearable.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #8 from Andrew Pinski --- (In reply to Thomas Koenig from comment #7) > Also, which version of gfortran did you use? > > If it was before r195413, I can very well believe those > numbers. Note that revision made it into GCC 4.8.0.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 Thomas Koenig changed: What|Removed |Added Status|NEW |WAITING --- Comment #7 from Thomas Koenig --- Also, which version of gfortran did you use? If it was before r195413, I can very well believe those numbers.
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #6 from Thomas Koenig --- I cannot reproduce this on an AMD Ryzen 7 1700X (little-endian): $ gfortran -fconvert=native wr.f90 walltime.c cc1: Warnung: command-line option »-fconvert=native« is valid for Fortran but not for C $ rm -f out.dat ; time ./a.out ; rm -f out.dat write time(sec) =1.0676949024200439 done real0m1.399s user0m0.112s sys 0m1.083s $ gfortran -fconvert=big-endian wr.f90 walltime.c cc1: Warnung: command-line option »-fconvert=big-endian« is valid for Fortran but not for C $ rm -f out.dat ; time ./a.out ; rm -f out.dat write time(sec) =1.4781639575958252 done real0m1.773s user0m0.397s sys 0m1.196s which looks reasonable. Platform specific? Which OS/processor combination did you test this on?
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #5 from David Edelsohn --- XL Fortran with -qufmt=be : 0.75 sec XL Fortran native : 0.30 sec
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 --- Comment #4 from Thomas Koenig --- (In reply to David Edelsohn from comment #3) > Conversion carries an overhead, but the overhead need not be worse than > necessary. The conversion overhead for libgfortran is significantly worse > than for competing, proprietary compilers. > > -fconvert=big-endian relative to no conversion > Compiler Slowdown > > GFortran 1000% > IBM XLF 200% > Intel Fortran 20% Do you also have absolute numbers?
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 Thomas Koenig changed: What|Removed |Added Severity|normal |enhancement
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 David Edelsohn changed: What|Removed |Added Severity|enhancement |normal --- Comment #3 from David Edelsohn --- Conversion carries an overhead, but the overhead need not be worse than necessary. The conversion overhead for libgfortran is significantly worse than for competing, proprietary compilers. -fconvert=big-endian relative to no conversion Compiler Slowdown GFortran 1000% IBM XLF 200% Intel Fortran 20%
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 Thomas Koenig changed: What|Removed |Added Severity|normal |enhancement
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 Thomas Koenig changed: What|Removed |Added CC||tkoenig at gcc dot gnu.org --- Comment #2 from Thomas Koenig --- https://gcc.gnu.org/onlinedocs/gfortran/CONVERT-specifier.html "Using anything but the native representation for unformatted data carries a significant speed overhead. If speed in this area matters to you, it is best if you use this only for data that needs to be portable."
[Bug libfortran/91030] Poor performance of I/O -fconvert=big-endian
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91030 David Edelsohn changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2019-06-28 Ever confirmed|0 |1 --- Comment #1 from David Edelsohn --- Confirmed.