Re: External Hashing [was Re: matching strings in a large set of strings]

2010-05-01 Thread Jack
http://www.swizwatch.com/
All Cartier replica watches sold at Hotwristwatch.com are brand-new and high
quality. Each Cartier Replica Watch produced is examined carefully by our
quality test department and each watch is inspected again before being sent
to our customer. It is our desire that you do experience the joy of shopping
when buying one of our Cartier Replica Watches at our site. Some Cartier
Watches may need to be special ordered, please call for availability on
Cartier watches. Best service you will receive from us.

Helmut Jarausch jarau...@igpm.rwth-aachen.de 
??:840jkofai...@mid.dfncis.de...
I think one could apply an external hashing technique which would require 
only
 very few disk accesses per lookup.
 Unfortunately, I'm now aware of an implementation in Python.
 Does anybody know about a Python implementation of external hashing?

 Thanks,
 Helmut.

 -- 
 Helmut Jarausch

 Lehrstuhl fuer Numerische Mathematik
 RWTH - Aachen University
 D 52056 Aachen, Germany 


-- 
http://mail.python.org/mailman/listinfo/python-list


External Hashing [was Re: matching strings in a large set of strings]

2010-04-30 Thread Helmut Jarausch
I think one could apply an external hashing technique which would require only
very few disk accesses per lookup.
Unfortunately, I'm now aware of an implementation in Python.
Does anybody know about a Python implementation of external hashing?

Thanks,
Helmut.

-- 
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: External Hashing [was Re: matching strings in a large set of strings]

2010-04-30 Thread Tim Chase

On 04/30/2010 12:51 PM, Helmut Jarausch wrote:

I think one could apply an external hashing technique which would require only
very few disk accesses per lookup.
Unfortunately, I'm now aware of an implementation in Python.
Does anybody know about a Python implementation of external hashing?


While you don't detail what you're hashing, Stephan Behnel 
already suggested (in the parent thread) using one of Python's 
native dbm modules (I just use anydbm and let it choose).  The 
underlying implementations should be fairly efficient assuming 
you don't use the dumbdbm last-resort fallback).  With the anydbm 
interface, you can implement dict/set semantics as long as you 
take care that everything is marshalled into and out of strings 
for keys/values.


-tkc



--
http://mail.python.org/mailman/listinfo/python-list


Re: External Hashing [was Re: matching strings in a large set of strings]

2010-04-30 Thread Dave Angel

Helmut Jarausch wrote:

I think one could apply an external hashing technique which would require only
very few disk accesses per lookup.
Unfortunately, I'm now aware of an implementation in Python.
Does anybody know about a Python implementation of external hashing?

Thanks,
Helmut.

  
That's what databases are for.  But if you're trying to streamline it 
further than a database, you need to give some constraints.  In an 
earlier thread, you mentioned strings with a constant size.  I don't 
think you ever actually said that the match string was also of the same 
size, but I'll assume it is.  Final question is how often you'll be 
updating the list, versus searching it.  If you do both frequently, 
stick to a database.


On the other hand, if the list on disk is fixed, you could thoroughly 
optimize its makeup.  Perhaps the easiest efficient form would be to 
have a file in two parts.  The first part is the hash table, where each 
entry has a seek-offset and a size.  And the rest of the file is just a 
list of items, sorted by hash value.


If you make a hash large enough that the maximum collision list is a 
couple of k, then two seeks would get any record.  First seek to the 
hash entry, then use that to decide where to seek for the data, and how 
much to read.  Then you simply search that block in memory.


You could also play games with a two-level hash, where each block of 
records has its own mini hash table.  Doesn't seem worth it for a mere 
83million records.


DaveA
--
http://mail.python.org/mailman/listinfo/python-list