I still don't see the need of a dict-like something holding 10m hashes to discern in some 10k lines which one to insert....... solutions: 1) if the files you're going to insert have less rows than the number of rows in the table, revert the logic: fetch only table rows that could be matching the files. instead of fetching + hashing 10m things, you hash 10k of them 2) choose proper pkeys and code a trigger (ON INSERT). Let the backend do the work (guess what, they're engineered to manage data!), not a single python process that fills the memory 3) store the hash in a separate column (or a separate table). Instead of fetching n rows * number of columns values, and then hash it, you fetch the hashed value already.
On Tuesday, March 17, 2015 at 12:14:20 AM UTC+1, LoveWeb2py wrote: > > Thank you for the feedback everyone. > > The main reason I fetch them all first is to make sure I'm not inserting > duplicate records. We have a lot of files that have thousands of records > and sometimes they're duplicates. I hash a few columns from each record and > if the value is the same then I don't insert the record. If there is a more > efficient way to do this please let me know. > > -- Resources: - http://web2py.com - http://web2py.com/book (Documentation) - http://github.com/web2py/web2py (Source code) - https://code.google.com/p/web2py/issues/list (Report Issues) --- You received this message because you are subscribed to the Google Groups "web2py-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.

