Milian, are these images for geomapping or digital pathology? I am guessing
so since you mentioned mipmap subIFDs.

If so, there is a good 20-25 years of experience in how to efficiently
write and read these large files, and the solution has inevitably come down
to tiling the data, and absolutely not mmapping.
In fact, when you say the size of the image leaves you no alternative, I
think perhaps you really should really think that the size of the images
means mmapping should be out of the question.
Are you generating the TIFFs yourself, or are they being generated as some
kind of well-known TIFF derivative? If so, there may well be SDKs or
well-known approaches - also, if you are not generating the data, can you
really ensure that the files meet all your criteria for mmapping? If not,
you will need to read/write in memory anyway... similarly, if your other
software stacks are not all on the same machine, or if you are using GPUs
with limited memory.

Regards,
Kemp Watson

On Thu, Dec 16, 2021 at 9:44 AM Milian Wolff via Tiff <[email protected]>
wrote:

> On Donnerstag, 16. Dezember 2021 15:27:35 CET Bob Friesenhahn wrote:
> > On Thu, 16 Dec 2021, Milian Wolff via Tiff wrote:
> > > As you can see, these offsets are misaligned for a 2byte/16bit
> greyscale
> > > image.
> > >
> > > Looking at the libtiff API, we cannot find anything that would allow
> us to
> > > ensure that the SubIFDs are aligned correctly. Are we missing
> something or
> > > is this simply not possible currently?
> > >
> > > We think that it would only require a small change in the code base,
> > > namely
> > > ensuring that the seek at [1] ends at an aligned address based on the
> > > BitsPerSample for the current IFD.
> >
> > The notion of writing 'aligned' data (which requires inserting some
> > dead space to assure alignment) is interesting and seems useful.
> > This is mostly useful when the data is not compressed.  I have not
> > heard of this before in the context of the TIFF format, but some some
> > other formats take care to assure it.  Obviously, alignment could only
> > be assured if the file is written by a TIFF writer which assures it.
> >
> > You seem to be talking about aligning the TIFF data samples (a good
> > thing), but there may be other beneficial alignment factors such as
> > alignment to mmap memory page boundaries, or filesystem block-size
> > boundaries.
>
> Correct. The mmap alignment we can handle on the client/reader side, that
> is
> less of an issue. Obviously, it would be ideal if the offset would be page
> aligned too, to reduce the memory "waste" at the start of the first page,
> but
> that's less of an issue imo.
>
> Filesystem block-size could help as another optimization, but that sounds
> pretty low level to me.
>
> The pixel/data sample alignment on the other hand is a crucial
> requirement.
> Without that, one cannot use the mmapped data after all, as it would lead
> to
> potential bus errors on ARM and UBSAN warnings about misalignment on x86.
>
> > Regardless, take care not to be sucked into the vortex of using memory
> > mapping to read files.  When using memory mapping to read data, your
> > program loses control and the thread which is doing the reading is put
> > to sleep while the I/O is taking place.  The I/Os are usually the size
> > of a memory page, which is often just 4k.  This means that your
> > program gets put to sleep more often than desired, with many more
> > context switches than if a larger copy-based I/O was used.  If the
> > data has recently been read before then memory mapping seems great
> > since the data is likely to already be cached in memory.
>
> I am aware of all that. But the size of images we are dealing with leaves
> us
> no altnerative. It's simply not feasible for us to copy so many GB of data
> into memory for us. And because we are integrating with other software
> stacks,
> we also cannot pass through an API around `TIFF*` to do the reading on
> demand
> either. We really need contiguous buffers...
>
> > When reading from a file, it is common for the operating system to try
> > to deduce if the reading is sequential or random.  If it is able to
> > deduce that the reading is sequential, then it may pre-read data in
> > order to lessen the hit (time spent sleeping) when the data is read in
> > order.  Operating systems may not have useful support for detecting
> > sequential reads when using mmap to do the reading.  TIFF requires
> > random access and so the operating system might be slow to detect and
> > optimize for a sequential read.
> >
> > If the operating system does not provide a "unified page cache" with
> > the filesystem, then there may be a filesystem data cache, and another
> > copy of the data for use with mmap.  This increases memory usage and
> > does not avoid a data copy.  It seems like the "unified page cache"
> > approach has falled by the wayside since it is difficult to implement
> > with the many filesystems available.  Instead operating systems have
> > moved toward offering "direct I/O" to lessen caching and data copies.
>
> We control who's reading the file, so there's just going to be the single
> mmap
> and no other copy of it.
>
> > In summary, the use of mmap and carefully aligned input data might not
> > provide actual benefit over larger programmed (or scheduled via
> > async-I/O) reads into a aligned buffer, even though it clearly
> > requires an extra memory copy.
>
> What you are writing isn't wrong, but it's sadly completely besides the
> point
> for us.
>
> Cheers
>
> --
> Milian Wolff | [email protected] | Senior Software Engineer
> KDAB (Deutschland) GmbH, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt, C++ and OpenGL
> Experts_______________________________________________
> Tiff mailing list
> [email protected]
> https://lists.osgeo.org/mailman/listinfo/tiff
>
_______________________________________________
Tiff mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/tiff

Reply via email to