Re: [Rdkit-discuss] Scalability of Postgres cartridge

2020-06-10 Thread Ivan Tubert-Brohman
Thank you everyone for the suggestions. For now I don't have immediate
plans to adopt the cartridge but it's good to know these things when the
time comes.

Best,
Ivan

On Mon, Jun 8, 2020 at 6:49 PM Finnerty, Jim via Rdkit-discuss <
rdkit-discuss@lists.sourceforge.net> wrote:

> If you have a billion molecule data source and would like to try an
> at-scale test, I'd be willing to help out with provisioning the hardware,
> looking at the efficiency of the plans, etc., using rdkit with Aurora
> PostgreSQL.
>
> If I understand how the rdkit GIST index filtering mechanism works for a
> given similarity metric, a parallel GIST index scan ought to be able to
> scale almost linearly scale with the number of cores, provided that the
> RDBMS is built on a scalable storage subsystem.
>
> If so, the largest instance size that's currently supported has 96 cores,
> so we can do a fairly high degree of parallelism.
>
> On 6/5/20, 1:07 PM, "dmaziuk via Rdkit-discuss" <
> rdkit-discuss@lists.sourceforge.net> wrote:
>
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
>
> On 6/5/2020 4:45 AM, Greg Landrum wrote:
>
> > Having said that, the team behind ZINC used to use the RDKit
> cartridge with
> > PostgreSQL as the backend for ZINC. They had the database sharded
> > across multiple instances and managed to get the fingerprint indices
> to
> > work there. I don't remember the substructure search performance
> being
> > terrible, but it wasn't great either. They have since switched to a
> > specialized system (Arthor from NextMove software), which offers
> > significantly better performance.
>
> Generally speaking a database of a billion rows needs hardware capable
> of running it. Buy a server with 1TB RAM and 64 cores and a couple of
> U.2 NVME drives and see how Postgres runs on that.
>
> Then you need to look at the database, e.g. query in an indexed
> billion-row table could be OK but inserting a billion-first row will
> not be.
>
> If you want to scale to these kinds of volumes, you need to do some
> work.
>
> (And much of the point of no-sql hadoop "cloud" workflows is that if
> you
> can parallelize what you're doing to multiple machines, at some data
> size they will start outperforming a centralized fast search engine.)
>
> Dima
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Scalability of Postgres cartridge

2020-06-08 Thread Finnerty, Jim via Rdkit-discuss
If you have a billion molecule data source and would like to try an at-scale 
test, I'd be willing to help out with provisioning the hardware, looking at the 
efficiency of the plans, etc., using rdkit with Aurora PostgreSQL.

If I understand how the rdkit GIST index filtering mechanism works for a given 
similarity metric, a parallel GIST index scan ought to be able to scale almost 
linearly scale with the number of cores, provided that the RDBMS is built on a 
scalable storage subsystem. 

If so, the largest instance size that's currently supported has 96 cores, so we 
can do a fairly high degree of parallelism.

On 6/5/20, 1:07 PM, "dmaziuk via Rdkit-discuss" 
 wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



On 6/5/2020 4:45 AM, Greg Landrum wrote:

> Having said that, the team behind ZINC used to use the RDKit cartridge 
with
> PostgreSQL as the backend for ZINC. They had the database sharded
> across multiple instances and managed to get the fingerprint indices to
> work there. I don't remember the substructure search performance being
> terrible, but it wasn't great either. They have since switched to a
> specialized system (Arthor from NextMove software), which offers
> significantly better performance.

Generally speaking a database of a billion rows needs hardware capable
of running it. Buy a server with 1TB RAM and 64 cores and a couple of
U.2 NVME drives and see how Postgres runs on that.

Then you need to look at the database, e.g. query in an indexed
billion-row table could be OK but inserting a billion-first row will not be.

If you want to scale to these kinds of volumes, you need to do some work.

(And much of the point of no-sql hadoop "cloud" workflows is that if you
can parallelize what you're doing to multiple machines, at some data
size they will start outperforming a centralized fast search engine.)

Dima


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Scalability of Postgres cartridge

2020-06-05 Thread dmaziuk via Rdkit-discuss

On 6/5/2020 4:45 AM, Greg Landrum wrote:


Having said that, the team behind ZINC used to use the RDKit cartridge with
PostgreSQL as the backend for ZINC. They had the database sharded
across multiple instances and managed to get the fingerprint indices to
work there. I don't remember the substructure search performance being
terrible, but it wasn't great either. They have since switched to a
specialized system (Arthor from NextMove software), which offers
significantly better performance.


Generally speaking a database of a billion rows needs hardware capable 
of running it. Buy a server with 1TB RAM and 64 cores and a couple of 
U.2 NVME drives and see how Postgres runs on that.


Then you need to look at the database, e.g. query in an indexed 
billion-row table could be OK but inserting a billion-first row will not be.


If you want to scale to these kinds of volumes, you need to do some work.

(And much of the point of no-sql hadoop "cloud" workflows is that if you 
can parallelize what you're doing to multiple machines, at some data 
size they will start outperforming a centralized fast search engine.)


Dima


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Scalability of Postgres cartridge

2020-06-05 Thread Greg Landrum
Hi Ivan,

I have not pushed the cartridge towards storing billions of molecules. I
did a blog post looking at performance with 10 million rows (
http://rdkit.blogspot.com/2020/01/some-thoughts-on-performance-of-rdkit.html)
but, as I mentioned there, I probably wouldn't choose a relational database
for the billion molecule case (you're unlikely to have multiple linked
tables with data there, so there's not much point in using a relational DB).

Having said that, the team behind ZINC used to use the RDKit cartridge with
PostgreSQL as the backend for ZINC. They had the database sharded
across multiple instances and managed to get the fingerprint indices to
work there. I don't remember the substructure search performance being
terrible, but it wasn't great either. They have since switched to a
specialized system (Arthor from NextMove software), which offers
significantly better performance.

Best,
-greg



On Thu, Jun 4, 2020 at 2:17 PM Ivan Tubert-Brohman <
ivan.tubert-broh...@schrodinger.com> wrote:

> Hi,
>
> I've never tried the RDKit PostgreSQL cartridge but I'm curious about it.
> In particular I wonder how far have people pushed it in terms of
> database size. The documentation gives examples with several million rows;
> has anyone tried it with a couple billion rows? How fast are substructure
> queries with databases of that size? How much storage is needed after
> accounting for the fingerprints etc.
>
> Best regards,
> Ivan
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Scalability of Postgres cartridge

2020-06-04 Thread Ivan Tubert-Brohman
Hi,

I've never tried the RDKit PostgreSQL cartridge but I'm curious about it.
In particular I wonder how far have people pushed it in terms of
database size. The documentation gives examples with several million rows;
has anyone tried it with a couple billion rows? How fast are substructure
queries with databases of that size? How much storage is needed after
accounting for the fingerprints etc.

Best regards,
Ivan
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss