GEOS is pretty tied to the interleaved representation. LIke JTS, it uses Coordinate struct to pass around coordinates, with fields for X, Y, and Z values. If we wanted to support a separate-array representation, we'd either need to either change our Coordinate struct to store pointers to the X and Y values, or remove the Coordinate struct altogether and have algorithms ask a CoordinateSequence for X and Y values directly. Both seem like a large effort to eliminate the first of what is likely several copies.
That said, an updated CoordinateSequence could still bring about more incremental improvements in this space: - Eliminating the virtual function overhead of coordinate access (this is what the linked PR does) - Allowing the use of an external buffer instead of the one managed by std::vector, so we could have a zero-copy method of bringing data into GEOS from an interleaved buffer - Making Coordinate a bit more generic, so we could avoid storing the Z value (or requiring it in an external buffer) in the cases where we don't want it. Dan On Thu, Sep 29, 2022 at 11:38 AM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > We are currently having a discussion about interleaved (XYXY) vs separate > arrays (XX YY) in the GeoArrow context ( > https://github.com/geopandas/geo-arrow-spec/pull/26), and I was wondering > if the GEOS developers have thoughts on that topic (whether it was ever > specifically discussed if the current approach of CoordinateArraySequence > to store all dimensions interleaved is the best option for GEOS or if > separate arrays could be considered as well). But so the above already > touches upon that topic, and I also noticed related discussion at > https://github.com/libgeos/geos/pull/674#issuecomment-1245741058 > > In general, from the Shapely/GeoPandas side, we are certainly interested > in being able to create CoordinateSequences zero-copy from arrays of > coordinates (the current GEOSCoordSeq_copyFromBuffer_r et al already helps > a lot though, although it would be nice if it could use a single memcpy for > 2D as well (related to underlying memory storage above) or even avoid this > one memcpy). > > Best, > Joris > > On Wed, 24 Aug 2022 at 13:50, Daniel Baston <dbas...@gmail.com> wrote: > >> This has come up a few times and despite some pushback from users [1] I >> agree that the benefits of the simplification probably outweigh the costs. >> Replacing CoordinateSequence with a concrete implementation is top on my >> list for the GDAL grant work [2] that I will be available to begin on >> September 1. My general plan was to >> >> 1. Replace CoordinateSequence with CoordinateArraySequence and check >> performance improvement. Even without algorithmic improvements like the >> ones you're talking about, I would expect that devirtualizating Coordinate >> access, and the consequent enabling of inlining, will provide a good >> benefit across the board. >> 2. Replace CoordinateArraySequence with a class backed by a buffer of >> doubles to support future generalization of Coordinate dimension. This may >> also allow us to offer zero-copy to clients that happen to use an XYXY >> representation internally (PostGIS) >> 3. Experiment with a union or std::variant to sneak in an optimized >> stack-only implementation for Points >> >> I do wonder if our embrace of XYXY is the best choice here and if XXYY >> would enable vectorization in some cases. That would be a pretty invasive >> experiment to perform across the whole library, but maybe we could look at >> something like OrientationIndex in isolation. >> >> Dan >> >> [1] https://github.com/libgeos/geos/issues/564 >> [2] https://lists.osgeo.org/pipermail/geos-devel/2022-March/010672.html >> >> On Tue, Aug 23, 2022 at 3:22 PM Paul Ramsey <pram...@cleverelephant.ca> >> wrote: >> >>> One of the things that tinkering with the SegmenString layer of overlay >>> brought out to me was the extent to which we construct CoordinateSequence >>> almost exclusively out of CoordinateArraySequence. Like, all the time. At >>> yet, because we handle those CoordinateArraySequence at the API level >>> almost exclusively as CoordinateSequence we lose the ability to do some >>> handy optimizations. >>> >>> Like, if one were going to (as one does on every single >>> CoordinateSequence that enters the overlay code) >>> (1) test if there are repeated points and >>> (2a) remove any if there are >>> (2b) just return the untouched CoordinateSequence if there aren't >>> a useful pattern would be for ::hasRepeatedPoints() to return/populate a >>> list of indexes at which repeated points appear and for >>> ::removeRepeatedPoints() to do bulk copies of all the points in between >>> those indexes. This is foreclosed by the CoordinateSequence API, you can >>> play this trick nicely with a std::vector living underneath, but the API >>> doesn't let us see that (in fact) that's what we have 99.9% of the time. >>> >>> So, one obvious thing to do would be to remove the virtual methods in >>> CoordinateSequence and pull the implementation up to that level, >>> std::vector and all, and give up on the idea of an abstract interface that >>> we don't actually use. For a handful of use cases, where data access cost >>> is greater than computation cost (area, length, distance(?), some others >>> (?)) this might be "bad" in some theoretical way, but note that currently >>> we still don't actually have that abstract layer in place for a zero copy >>> computation. Removing the virtual methods and inheritance from >>> CoordinateSequence would foreclose an option that (a) we seem unlikely to >>> ever deliver on and (b) has narrow performance benefits even if we did >>> deliver on it. >>> >>> Meanwhile, the flip case seems to likely have a *lot* of performance >>> benefits just hanging around waiting to be harvested. Coordinate access >>> without going through the inheritance structure; access to some bulk >>> operations like the repeated points case. >>> >>> For the "zero copy" crew, I feel like a big chunk of gains for them >>> could be harvested by ensuring that point-based operations are available >>> and don't require construction of a full Point() object. So things like >>> PreparedGeometry->intersects(x, y). Sure, you still have to copy in your >>> polygon feature and prepare it, but much of the overhead in that would >>> still exist in a "zero copy" paradigm (all the internal index buildings). >>> Meanwhile you'd no longer need to create a full Point() to do a >>> point-in-poly test, and that would hopefully be a big win for most users. >>> >>> Random thoughs on a sunny day, >>> P >>> >>> _______________________________________________ >>> geos-devel mailing list >>> geos-devel@lists.osgeo.org >>> https://lists.osgeo.org/mailman/listinfo/geos-devel >>> >> _______________________________________________ >> geos-devel mailing list >> geos-devel@lists.osgeo.org >> https://lists.osgeo.org/mailman/listinfo/geos-devel >> > _______________________________________________ > geos-devel mailing list > geos-devel@lists.osgeo.org > https://lists.osgeo.org/mailman/listinfo/geos-devel >
_______________________________________________ geos-devel mailing list geos-devel@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/geos-devel