Re: [Rdkit-discuss] Scalability of Postgres cartridge
Thank you everyone for the suggestions. For now I don't have immediate plans to adopt the cartridge but it's good to know these things when the time comes. Best, Ivan On Mon, Jun 8, 2020 at 6:49 PM Finnerty, Jim via Rdkit-discuss < rdkit-discuss@lists.sourceforge.net> wrote: > If you have a billion molecule data source and would like to try an > at-scale test, I'd be willing to help out with provisioning the hardware, > looking at the efficiency of the plans, etc., using rdkit with Aurora > PostgreSQL. > > If I understand how the rdkit GIST index filtering mechanism works for a > given similarity metric, a parallel GIST index scan ought to be able to > scale almost linearly scale with the number of cores, provided that the > RDBMS is built on a scalable storage subsystem. > > If so, the largest instance size that's currently supported has 96 cores, > so we can do a fairly high degree of parallelism. > > On 6/5/20, 1:07 PM, "dmaziuk via Rdkit-discuss" < > rdkit-discuss@lists.sourceforge.net> wrote: > > CAUTION: This email originated from outside of the organization. Do > not click links or open attachments unless you can confirm the sender and > know the content is safe. > > > > On 6/5/2020 4:45 AM, Greg Landrum wrote: > > > Having said that, the team behind ZINC used to use the RDKit > cartridge with > > PostgreSQL as the backend for ZINC. They had the database sharded > > across multiple instances and managed to get the fingerprint indices > to > > work there. I don't remember the substructure search performance > being > > terrible, but it wasn't great either. They have since switched to a > > specialized system (Arthor from NextMove software), which offers > > significantly better performance. > > Generally speaking a database of a billion rows needs hardware capable > of running it. Buy a server with 1TB RAM and 64 cores and a couple of > U.2 NVME drives and see how Postgres runs on that. > > Then you need to look at the database, e.g. query in an indexed > billion-row table could be OK but inserting a billion-first row will > not be. > > If you want to scale to these kinds of volumes, you need to do some > work. > > (And much of the point of no-sql hadoop "cloud" workflows is that if > you > can parallelize what you're doing to multiple machines, at some data > size they will start outperforming a centralized fast search engine.) > > Dima > > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Scalability of Postgres cartridge
If you have a billion molecule data source and would like to try an at-scale test, I'd be willing to help out with provisioning the hardware, looking at the efficiency of the plans, etc., using rdkit with Aurora PostgreSQL. If I understand how the rdkit GIST index filtering mechanism works for a given similarity metric, a parallel GIST index scan ought to be able to scale almost linearly scale with the number of cores, provided that the RDBMS is built on a scalable storage subsystem. If so, the largest instance size that's currently supported has 96 cores, so we can do a fairly high degree of parallelism. On 6/5/20, 1:07 PM, "dmaziuk via Rdkit-discuss" wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. On 6/5/2020 4:45 AM, Greg Landrum wrote: > Having said that, the team behind ZINC used to use the RDKit cartridge with > PostgreSQL as the backend for ZINC. They had the database sharded > across multiple instances and managed to get the fingerprint indices to > work there. I don't remember the substructure search performance being > terrible, but it wasn't great either. They have since switched to a > specialized system (Arthor from NextMove software), which offers > significantly better performance. Generally speaking a database of a billion rows needs hardware capable of running it. Buy a server with 1TB RAM and 64 cores and a couple of U.2 NVME drives and see how Postgres runs on that. Then you need to look at the database, e.g. query in an indexed billion-row table could be OK but inserting a billion-first row will not be. If you want to scale to these kinds of volumes, you need to do some work. (And much of the point of no-sql hadoop "cloud" workflows is that if you can parallelize what you're doing to multiple machines, at some data size they will start outperforming a centralized fast search engine.) Dima ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Scalability of Postgres cartridge
On 6/5/2020 4:45 AM, Greg Landrum wrote: Having said that, the team behind ZINC used to use the RDKit cartridge with PostgreSQL as the backend for ZINC. They had the database sharded across multiple instances and managed to get the fingerprint indices to work there. I don't remember the substructure search performance being terrible, but it wasn't great either. They have since switched to a specialized system (Arthor from NextMove software), which offers significantly better performance. Generally speaking a database of a billion rows needs hardware capable of running it. Buy a server with 1TB RAM and 64 cores and a couple of U.2 NVME drives and see how Postgres runs on that. Then you need to look at the database, e.g. query in an indexed billion-row table could be OK but inserting a billion-first row will not be. If you want to scale to these kinds of volumes, you need to do some work. (And much of the point of no-sql hadoop "cloud" workflows is that if you can parallelize what you're doing to multiple machines, at some data size they will start outperforming a centralized fast search engine.) Dima ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Scalability of Postgres cartridge
Hi Ivan, I have not pushed the cartridge towards storing billions of molecules. I did a blog post looking at performance with 10 million rows ( http://rdkit.blogspot.com/2020/01/some-thoughts-on-performance-of-rdkit.html) but, as I mentioned there, I probably wouldn't choose a relational database for the billion molecule case (you're unlikely to have multiple linked tables with data there, so there's not much point in using a relational DB). Having said that, the team behind ZINC used to use the RDKit cartridge with PostgreSQL as the backend for ZINC. They had the database sharded across multiple instances and managed to get the fingerprint indices to work there. I don't remember the substructure search performance being terrible, but it wasn't great either. They have since switched to a specialized system (Arthor from NextMove software), which offers significantly better performance. Best, -greg On Thu, Jun 4, 2020 at 2:17 PM Ivan Tubert-Brohman < ivan.tubert-broh...@schrodinger.com> wrote: > Hi, > > I've never tried the RDKit PostgreSQL cartridge but I'm curious about it. > In particular I wonder how far have people pushed it in terms of > database size. The documentation gives examples with several million rows; > has anyone tried it with a couple billion rows? How fast are substructure > queries with databases of that size? How much storage is needed after > accounting for the fingerprints etc. > > Best regards, > Ivan > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Scalability of Postgres cartridge
Hi, I've never tried the RDKit PostgreSQL cartridge but I'm curious about it. In particular I wonder how far have people pushed it in terms of database size. The documentation gives examples with several million rows; has anyone tried it with a couple billion rows? How fast are substructure queries with databases of that size? How much storage is needed after accounting for the fingerprints etc. Best regards, Ivan ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss