If you have a billion molecule data source and would like to try an at-scale 
test, I'd be willing to help out with provisioning the hardware, looking at the 
efficiency of the plans, etc., using rdkit with Aurora PostgreSQL.

If I understand how the rdkit GIST index filtering mechanism works for a given 
similarity metric, a parallel GIST index scan ought to be able to scale almost 
linearly scale with the number of cores, provided that the RDBMS is built on a 
scalable storage subsystem. 

If so, the largest instance size that's currently supported has 96 cores, so we 
can do a fairly high degree of parallelism.

On 6/5/20, 1:07 PM, "dmaziuk via Rdkit-discuss" 
<rdkit-discuss@lists.sourceforge.net> wrote:

    CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.
    
    
    
    On 6/5/2020 4:45 AM, Greg Landrum wrote:
    
    > Having said that, the team behind ZINC used to use the RDKit cartridge 
with
    > PostgreSQL as the backend for ZINC. They had the database sharded
    > across multiple instances and managed to get the fingerprint indices to
    > work there. I don't remember the substructure search performance being
    > terrible, but it wasn't great either. They have since switched to a
    > specialized system (Arthor from NextMove software), which offers
    > significantly better performance.
    
    Generally speaking a database of a billion rows needs hardware capable
    of running it. Buy a server with 1TB RAM and 64 cores and a couple of
    U.2 NVME drives and see how Postgres runs on that.
    
    Then you need to look at the database, e.g. query in an indexed
    billion-row table could be OK but inserting a billion-first row will not be.
    
    If you want to scale to these kinds of volumes, you need to do some work.
    
    (And much of the point of no-sql hadoop "cloud" workflows is that if you
    can parallelize what you're doing to multiple machines, at some data
    size they will start outperforming a centralized fast search engine.)
    
    Dima
    
    
    _______________________________________________
    Rdkit-discuss mailing list
    Rdkit-discuss@lists.sourceforge.net
    https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
    


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to