Re: [Rdkit-discuss] RDKit Postgres Cartridge Parallel Queries?

2018-06-01 Thread Greg Landrum
Hi Brian,

I just did a bit of looking here and either something has changed since my
first experiments with 9.6 or I was remembering incorrectly. The functions
exposed in the cartridge need to be marked as being "parallel safe" in
order to be usable in a parallel query. At the moment none of them are.
This would clearly be useful, so I'm going to start taking a look at adding
the relevant flags.

-greg


On Fri, Jun 1, 2018 at 5:05 PM Brian Cole  wrote:

> Doesn't appear like ::mol parallelized either. Only seeing the following
> use 1 CPU in top.
>
> ligandlibrary=# explain analyze select count(*) from ligands where
> rdkit_mol@>'Br'::mol;
>
>  QUERY PLAN
>
>
> -
>  Aggregate  (cost=50959.06..50959.07 rows=1 width=8) (actual
> time=791284.354..791284.354 rows=1 loops=1)
>->  Bitmap Heap Scan on ligands  (cost=3156.60..50926.74 rows=12927
> width=0) (actual time=252201.744..790985.637 rows=667236 loops=1)
>  Recheck Cond: (rdkit_mol @> 'Br'::mol)
>  Rows Removed by Index Recheck: 13725739
>  Heap Blocks: exact=42169 lossy=1254494
>  ->  Bitmap Index Scan on rdkit_substructure_idx
> (cost=0.00..3153.37 rows=12927 width=0) (actual time=252166.576..252166.576
> rows=14511013 loops=1)
>Index Cond: (rdkit_mol @> 'Br'::mol)
>  Planning time: 0.109 ms
>  Execution time: 791284.588 ms
> (9 rows)
>
> Time: 791385.473 ms (13:11.385)
> ligandlibrary=# select name, setting from pg_settings where name like
> 'dynamic_shared_memory_type';
> name| setting
> +-
>  dynamic_shared_memory_type | posix
> (1 row)
>
> Time: 41.439 ms
> ligandlibrary=# select name, setting from pg_settings where name like
> 'max_parallel_workers_per_gather';
>   name   | setting
> -+-
>  max_parallel_workers_per_gather | 2
> (1 row)
>
> Time: 0.926 ms
> ligandlibrary=#
>
> Maybe some other flag I need to specify. Only 2 cores in this system at
> the moment, maybe it only parallelizes when there's more than 2 cores?
>
> Thanks,
> Brian
>
>
> On Fri, Jun 1, 2018 at 10:07 AM, Greg Landrum 
> wrote:
>
>> I think they should. Does a ::mol query on the same table parallelize? If
>> it does but a ::qmol query does not maybe I forgot something in the SQL
>> function definitions
>>
>> On Fri, 1 Jun 2018 at 15:43, Brian Cole  wrote:
>>
>>> Hi Greg,
>>>
>>> Are SMARTS searches with the ::qmol type supposed to parallelize? They
>>> don't appear to be either.
>>>
>>> -Brian
>>>
>>> On Fri, Jun 1, 2018 at 1:46 AM, Greg Landrum 
>>> wrote:
>>>
 Hi Brian,

 When the new parallel queries came out I checked that they actually
 could be used and things seemed fine.
 The problem (and it's a sizable one) is that parallel queries don't use
 the index. Until parallel scans using GIST indices work, I don't think this
 is really going to help much.

 -greg


 On Fri, Jun 1, 2018 at 12:04 AM Brian Cole  wrote:

> It appears like Postgres 9.6+ supports parallel queries now to
> accelerate slow queries:
> https://www.postgresql.org/docs/10/static/parallel-query.html
>
> Has anyone successfully got this to accelerate substructure queries
> with the RDKit Postgres cartridge?
>
> Thanks,
> Brian
>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

>>>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit Postgres Cartridge Parallel Queries?

2018-06-01 Thread Brian Cole
Doesn't appear like ::mol parallelized either. Only seeing the following
use 1 CPU in top.

ligandlibrary=# explain analyze select count(*) from ligands where
rdkit_mol@>'Br'::mol;

 QUERY PLAN

-
 Aggregate  (cost=50959.06..50959.07 rows=1 width=8) (actual
time=791284.354..791284.354 rows=1 loops=1)
   ->  Bitmap Heap Scan on ligands  (cost=3156.60..50926.74 rows=12927
width=0) (actual time=252201.744..790985.637 rows=667236 loops=1)
 Recheck Cond: (rdkit_mol @> 'Br'::mol)
 Rows Removed by Index Recheck: 13725739
 Heap Blocks: exact=42169 lossy=1254494
 ->  Bitmap Index Scan on rdkit_substructure_idx
(cost=0.00..3153.37 rows=12927 width=0) (actual time=252166.576..252166.576
rows=14511013 loops=1)
   Index Cond: (rdkit_mol @> 'Br'::mol)
 Planning time: 0.109 ms
 Execution time: 791284.588 ms
(9 rows)

Time: 791385.473 ms (13:11.385)
ligandlibrary=# select name, setting from pg_settings where name like
'dynamic_shared_memory_type';
name| setting
+-
 dynamic_shared_memory_type | posix
(1 row)

Time: 41.439 ms
ligandlibrary=# select name, setting from pg_settings where name like
'max_parallel_workers_per_gather';
  name   | setting
-+-
 max_parallel_workers_per_gather | 2
(1 row)

Time: 0.926 ms
ligandlibrary=#

Maybe some other flag I need to specify. Only 2 cores in this system at the
moment, maybe it only parallelizes when there's more than 2 cores?

Thanks,
Brian


On Fri, Jun 1, 2018 at 10:07 AM, Greg Landrum 
wrote:

> I think they should. Does a ::mol query on the same table parallelize? If
> it does but a ::qmol query does not maybe I forgot something in the SQL
> function definitions
>
> On Fri, 1 Jun 2018 at 15:43, Brian Cole  wrote:
>
>> Hi Greg,
>>
>> Are SMARTS searches with the ::qmol type supposed to parallelize? They
>> don't appear to be either.
>>
>> -Brian
>>
>> On Fri, Jun 1, 2018 at 1:46 AM, Greg Landrum 
>> wrote:
>>
>>> Hi Brian,
>>>
>>> When the new parallel queries came out I checked that they actually
>>> could be used and things seemed fine.
>>> The problem (and it's a sizable one) is that parallel queries don't use
>>> the index. Until parallel scans using GIST indices work, I don't think this
>>> is really going to help much.
>>>
>>> -greg
>>>
>>>
>>> On Fri, Jun 1, 2018 at 12:04 AM Brian Cole  wrote:
>>>
 It appears like Postgres 9.6+ supports parallel queries now to
 accelerate slow queries:
 https://www.postgresql.org/docs/10/static/parallel-query.html

 Has anyone successfully got this to accelerate substructure queries
 with the RDKit Postgres cartridge?

 Thanks,
 Brian

 
 --
 Check out the vibrant tech community on one of the world's most
 engaging tech sites, Slashdot.org! http://sdm.link/slashdot__
 _
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

>>>
>>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit Postgres Cartridge Parallel Queries?

2018-06-01 Thread Greg Landrum
I think they should. Does a ::mol query on the same table parallelize? If
it does but a ::qmol query does not maybe I forgot something in the SQL
function definitions

On Fri, 1 Jun 2018 at 15:43, Brian Cole  wrote:

> Hi Greg,
>
> Are SMARTS searches with the ::qmol type supposed to parallelize? They
> don't appear to be either.
>
> -Brian
>
> On Fri, Jun 1, 2018 at 1:46 AM, Greg Landrum 
> wrote:
>
>> Hi Brian,
>>
>> When the new parallel queries came out I checked that they actually could
>> be used and things seemed fine.
>> The problem (and it's a sizable one) is that parallel queries don't use
>> the index. Until parallel scans using GIST indices work, I don't think this
>> is really going to help much.
>>
>> -greg
>>
>>
>> On Fri, Jun 1, 2018 at 12:04 AM Brian Cole  wrote:
>>
>>> It appears like Postgres 9.6+ supports parallel queries now to
>>> accelerate slow queries:
>>> https://www.postgresql.org/docs/10/static/parallel-query.html
>>>
>>> Has anyone successfully got this to accelerate substructure queries with
>>> the RDKit Postgres cartridge?
>>>
>>> Thanks,
>>> Brian
>>>
>>>
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit Postgres Cartridge Parallel Queries?

2018-06-01 Thread Brian Cole
Hi Greg,

Are SMARTS searches with the ::qmol type supposed to parallelize? They
don't appear to be either.

-Brian

On Fri, Jun 1, 2018 at 1:46 AM, Greg Landrum  wrote:

> Hi Brian,
>
> When the new parallel queries came out I checked that they actually could
> be used and things seemed fine.
> The problem (and it's a sizable one) is that parallel queries don't use
> the index. Until parallel scans using GIST indices work, I don't think this
> is really going to help much.
>
> -greg
>
>
> On Fri, Jun 1, 2018 at 12:04 AM Brian Cole  wrote:
>
>> It appears like Postgres 9.6+ supports parallel queries now to accelerate
>> slow queries:
>> https://www.postgresql.org/docs/10/static/parallel-query.html
>>
>> Has anyone successfully got this to accelerate substructure queries with
>> the RDKit Postgres cartridge?
>>
>> Thanks,
>> Brian
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot__
>> _
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit Postgres Cartridge Parallel Queries?

2018-05-31 Thread Greg Landrum
Hi Brian,

When the new parallel queries came out I checked that they actually could
be used and things seemed fine.
The problem (and it's a sizable one) is that parallel queries don't use the
index. Until parallel scans using GIST indices work, I don't think this is
really going to help much.

-greg


On Fri, Jun 1, 2018 at 12:04 AM Brian Cole  wrote:

> It appears like Postgres 9.6+ supports parallel queries now to accelerate
> slow queries:
> https://www.postgresql.org/docs/10/static/parallel-query.html
>
> Has anyone successfully got this to accelerate substructure queries with
> the RDKit Postgres cartridge?
>
> Thanks,
> Brian
>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit postgres cartridge building

2018-05-24 Thread Paolo Tosco

Dear Alfredo,

this file contains pretty comprehensive instructions how to build and 
install the cartridge:


https://github.com/rdkit/rdkit/blob/master/Code/PgSQL/rdkit/README

Please get back to me off-list if you still have issues in getting it to 
work for you.


Cheers,
p.


On 05/24/18 19:57, Alfredo Quevedo wrote:


thank you Markus for the reply,

listing the /user/share/postgresql directory I can see several folder 
apart from 10 (9.2, 9.3, etc),


I uninstalled the current postgres instalation and deleted all the 
folder under ´/user/share/postgresql. Afterwards I reinstalled 
postgres, and only  user/share/postgresql/10 is now present.


Afterwards I recompiled the RDKit souurce with:

cmake -DRDK_BUILD_INCHI_SUPPORT=ON -DRDK_BUILD_AVALON_SUPPORT=ON 
-DRDK_BUILD_CAIRO_SUPPORT=ON ..


Is this enough to automatically build the RDKit postgres package or I 
need some extra building flags (such as DRDK_BUILD_PGSQL=ON)?


regards

Alfredo





El 24/05/2018 a las 15:36, Markus Sitzmann escribió:

Hi Alfredo,

My first guess would be you have another, older Postgres version on 
your computer and you have build against this version. Take a look at 
the /use/share/postgresql directory and take a look if there is 
another directory instead of 10/


Markus

-
|  Markus Sitzmann
| markus.sitzm...@gmail.com 

On 24. May 2018, at 18:24, Alfredo Quevedo > wrote:



Good morning,

I am trying to build RDKit from source, and succeed with that 
following the instructions provided in the documentation. Howvere, I 
am trying to use the postgres cartridge, which as far as I 
understand is built during the main building process.


but after trying to create the extension for a database with:

psql -c  'create extension rdkit'  molecules

I am getting the following error

ERROR:  could not open extension control file 
"/usr/share/postgresql/10/extension/rdkit.control": No such file or 
directory


It seems that the building of the cartridge is not being applyed to 
my local postgres installation?


Any hint is highly appreacited,

thanks in advance



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org ! 
http://sdm.link/slashdot

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net 


https://lists.sourceforge.net/lists/listinfo/rdkit-discuss




--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit postgres cartridge building

2018-05-24 Thread Alfredo Quevedo

thank you Markus for the reply,

listing the /user/share/postgresql directory I can see several folder 
apart from 10 (9.2, 9.3, etc),


I uninstalled the current postgres instalation and deleted all the 
folder under ´/user/share/postgresql. Afterwards I reinstalled postgres, 
and only  user/share/postgresql/10 is now present.


Afterwards I recompiled the RDKit souurce with:

cmake -DRDK_BUILD_INCHI_SUPPORT=ON -DRDK_BUILD_AVALON_SUPPORT=ON 
-DRDK_BUILD_CAIRO_SUPPORT=ON ..


Is this enough to automatically build the RDKit postgres package or I 
need some extra building flags (such as DRDK_BUILD_PGSQL=ON)?


regards

Alfredo





El 24/05/2018 a las 15:36, Markus Sitzmann escribió:

Hi Alfredo,

My first guess would be you have another, older Postgres version on 
your computer and you have build against this version. Take a look at 
the /use/share/postgresql directory and take a look if there is 
another directory instead of 10/


Markus

-
|  Markus Sitzmann
| markus.sitzm...@gmail.com 

On 24. May 2018, at 18:24, Alfredo Quevedo > wrote:



Good morning,

I am trying to build RDKit from source, and succeed with that 
following the instructions provided in the documentation. Howvere, I 
am trying to use the postgres cartridge, which as far as I understand 
is built during the main building process.


but after trying to create the extension for a database with:

psql -c  'create extension rdkit'  molecules

I am getting the following error

ERROR:  could not open extension control file 
"/usr/share/postgresql/10/extension/rdkit.control": No such file or 
directory


It seems that the building of the cartridge is not being applyed to 
my local postgres installation?


Any hint is highly appreacited,

thanks in advance



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org ! 
http://sdm.link/slashdot

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net 


https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKit postgres cartridge building

2018-05-24 Thread Markus Sitzmann
Hi Alfredo,

My first guess would be you have another, older Postgres version on your 
computer and you have build against this version. Take a look at the 
/use/share/postgresql directory and take a look if there is another directory 
instead of 10/

Markus

-
|  Markus Sitzmann
|  markus.sitzm...@gmail.com

> On 24. May 2018, at 18:24, Alfredo Quevedo  wrote:
> 
> Good morning,
> 
> I am trying to build RDKit from source, and succeed with that following the 
> instructions provided in the documentation. Howvere, I am trying to use the 
> postgres cartridge, which as far as I understand is built during the main 
> building process.
> 
> but after trying to create the extension for a database with:
> 
> psql -c  'create extension rdkit'  molecules
> 
> I am getting the following error
> 
> ERROR:  could not open extension control file 
> "/usr/share/postgresql/10/extension/rdkit.control": No such file or directory
> 
> It seems that the building of the cartridge is not being applyed to my local 
> postgres installation?
> 
> Any hint is highly appreacited,
> 
> thanks in advance
> 
> 
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] rdkit postgres cartridge

2010-07-17 Thread Greg Landrum
After loading your dataset into a database on my linux machine, I'm
starting to wonder about my own answer below:

On Sat, Jul 17, 2010 at 6:02 AM, Greg Landrum greg.land...@gmail.com wrote:

 There are two parts here:
 1) The RDKit does a lot of work when it reads a molecule, so it's 
 comparatively slow. I generally expect that it will spend 1-4 seconds per 
 thousand molecules (depending on cpu speed, obviously). Your set of 25K 
 molecules takes (on my macbook) around 6 seconds per thousand. If I break 
 that down by block, most of the time is spent on the first molecules:
 8] for i in range(0,len(s),1000):
    ...:   t1=time.time()
    ...:   ms=[s[x] for x in range(i,min(i+1000,len(s)))]
    ...:   t2=time.time()
    ...:   print i,'%.2f'%(t2-t1)
    ...:
 0 10.75
 1000 17.78
 2000 11.03
 3000 11.01
 4000 7.73
 5000 5.08
 6000 4.62
 7000 5.14
 8000 4.44
 9000 4.08
 ...
 without looking at them, I suspect you have the larger and more complex 
 molecules at the beginning of the file? I will see if there are any real 
 outliers in the dataset that I can use to suggest further optimizations to 
 the molecule processing code.

I just re-ran this experiment on my linux box, which is not exactly
modern (4.5 years old, 2.8GHz Pentium D):
[5] for i in range(0,len(s),1000):
   ...: t1=time.time()
   ...: ms=[s[x] for x in range(i,min(i+1000,len(s)))]
   ...: t2=time.time()
   ...: print i,'%.2f'%(t2-t1)
   ...:
   ...:
0 2.37
1000 4.10
2000 2.94
3000 2.85
4000 2.17
5000 1.49
6000 1.40
7000 1.52
...

These are numbers much more in line with what I expect. The resulting
database load takes a more reasonable amount of time (in my eyes):
tjtest=# \timing
Timing is on.
tjtest=# copy mols from '/home/glandrum/t.smi' delimiter ' ';
COPY 25855
Time: 44648.633 ms

And the indexing is also substantially faster than what you saw:
tjtest=# create index midx on mols using gist(m);
CREATE INDEX
Time: 119232.414 ms

Searches are also faster (and, please notice, now they're correct) :

tjtest=# select count(id) from mols where m  @ 'c1c1C(=O)NC';
 count
---
   546
(1 row)

Time: 380.455 ms

Could it be that either you built the rdkit in debug mode, your
machine is/was heavily loaded at the time you ran your tests, or  your
linux box is even older than mine?

Meanwhile, I need to go check on my macbook to figure out what
happened there; I guess I was using a debug build, because that's
normally faster than my linux box.

-greg

--
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] rdkit postgres cartridge

2010-07-16 Thread Greg Landrum
Hi TJ,

On Fri, Jul 16, 2010 at 4:02 PM, TJ O'Donnell t...@acm.org wrote:

  I'm having a good time playing with your new postgres cartridge.  I've
 run into a few problems I thought you could help with.

 First a summary of what I did, then a few questions.  I'm using postgres
 8.4.4 on linux, your latest rdkit and cartridge from svn,

 as of last week.


 Create table rdmol (id integer, smiles text, m mol, mx mol)

 id and smiles from drugbank and first 22K of pubchem. see attached smi file



 update rdmol set m=smiles::mol

 25855

 171,693.016 ms

 update rdmol set mx=m

 25855

 1,512.227 ms

 create index molidx on rdmol using gist(m);


  730,077.537 ms

 select count(id) from rdmol where mx @ 'c1c1C(=O)NC'

 546

 24,224.787 ms

 select count(id) from rdmol where m  @ 'c1c1C(=O)NC'

 399

 570.539 ms



 Is this slow speed to be expected when creating mol from smiles and
 gist(m)?

 There are two parts here:

1) The RDKit does a lot of work when it reads a molecule, so it's
comparatively slow. I generally expect that it will spend 1-4 seconds per
thousand molecules (depending on cpu speed, obviously). Your set of 25K
molecules takes (on my macbook) around 6 seconds per thousand. If I break
that down by block, most of the time is spent on the first molecules:
8] for i in range(0,len(s),1000):
   ...:   t1=time.time()
   ...:   ms=[s[x] for x in range(i,min(i+1000,len(s)))]
   ...:   t2=time.time()
   ...:   print i,'%.2f'%(t2-t1)
   ...:
0 10.75
1000 17.78
2000 11.03
3000 11.01
4000 7.73
5000 5.08
6000 4.62
7000 5.14
8000 4.44
9000 4.08
...
without looking at them, I suspect you have the larger and more complex
molecules at the beginning of the file? I will see if there are any real
outliers in the dataset that I can use to suggest further optimizations to
the molecule processing code.

2) molecule indexing speed: this is determined by the speed (really the lack
thereof) of the layered fingerprinting code, which is slow. The
fingerprinter enumerates all (branched and unbranched) molecular paths
containing from 1-7 bonds and hashes them. The inclusion of branched paths
makes the process slower, but (I believe) improves the screenout rate of the
fingerprint. There is a good amount of work left to be done on improving the
fingerprinter and it utility SSS. We use a different fingerprint at work for
the index, so I haven't spent much time on this stuff.


 More troubling is why are fewer superstructures found when the gist index
 is used?



 select smiles from rdmol where mx @ 'c1c1C(=O)NC'

  except

 select smiles from rdmol where m  @ ' c1c1C(=O)NC '



 147 rows

 c1ccc(C(Nc2cc(Cl)ccc2OCC(O)=O)=O)cc1

 CCN1C(=O)c2c2C1=O

 O=C(Nc1ccc(Br)cc1)c1c(O)c(Br)cc(Br)c1


Not good! not good at all. I'm willing to live with somewhat slower code for
preprocessing steps, but the results definitely should be correct. There's
clearly a parameter problem somewhere that's giving rise to this. I suspect
I know what it is and will fix it. Thanks for pointing this out!

Best Regards,
-greg
--
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss