If you have a billion molecule data source and would like to try an at-scale test, I'd be willing to help out with provisioning the hardware, looking at the efficiency of the plans, etc., using rdkit with Aurora PostgreSQL.
If I understand how the rdkit GIST index filtering mechanism works for a given similarity metric, a parallel GIST index scan ought to be able to scale almost linearly scale with the number of cores, provided that the RDBMS is built on a scalable storage subsystem. If so, the largest instance size that's currently supported has 96 cores, so we can do a fairly high degree of parallelism. On 6/5/20, 1:07 PM, "dmaziuk via Rdkit-discuss" <rdkit-discuss@lists.sourceforge.net> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. On 6/5/2020 4:45 AM, Greg Landrum wrote: > Having said that, the team behind ZINC used to use the RDKit cartridge with > PostgreSQL as the backend for ZINC. They had the database sharded > across multiple instances and managed to get the fingerprint indices to > work there. I don't remember the substructure search performance being > terrible, but it wasn't great either. They have since switched to a > specialized system (Arthor from NextMove software), which offers > significantly better performance. Generally speaking a database of a billion rows needs hardware capable of running it. Buy a server with 1TB RAM and 64 cores and a couple of U.2 NVME drives and see how Postgres runs on that. Then you need to look at the database, e.g. query in an indexed billion-row table could be OK but inserting a billion-first row will not be. If you want to scale to these kinds of volumes, you need to do some work. (And much of the point of no-sql hadoop "cloud" workflows is that if you can parallelize what you're doing to multiple machines, at some data size they will start outperforming a centralized fast search engine.) Dima _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss