Re: [Imdbpy-devel] [sql] memory consumption
Hell yeah :D RESTORING imdbIDs values for movies... DONE! (restored 2102978 entries out of 2104144) # TIME restore movies : 0min, 47sec (wall) 0min, 0sec (user) 0min, 0sec (system) RESTORING imdbIDs values for people... DONE! (restored 3544106 entries out of 3545620) # TIME restore people : 1min, 19sec (wall) 0min, 0sec (user) 0min, 0sec (system) Freaking fast thank you for those quick fixs ! Also, thanks for the credits ;-) Le 19 févr. 2012 à 11:26, Davide Alberani a écrit : On Sun, Feb 12, 2012 at 21:17, Emmanuel Tabard m...@webitup.fr wrote: Well, that's a very interesting solution, thanks. :-) I made a test run with this solution. Time to save/restore : 6minutes Restoring success : - People : 99.8777% - Movies : 99.8845% That's great! Ok, I've committed a compromise solution: when we're using CSV (which is thought to give you a set of files that you can move and restore on completely different db altogether), I use my previous implementation, but using a cursor instead of an ORM object (it should be really fast, at least storing the imdbIDs). When not using CSV, I've implemented your solution (the version based on a dbm database is also used as a fallback, in case some db servers have problems). But this is just the default: with the -i command line argument (by the way: I've removed -t to select a tmp directory: the CSV one is used or the current directory) you can force one method or the other: -i dbm and -i table. I hope it works and it covers all the possible use cases. :-) -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 ___ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Re: [Imdbpy-devel] [sql] memory consumption
On Sun, Feb 12, 2012 at 14:20, Emmanuel Tabard m...@webitup.fr wrote: Fair enough ! When it was selecting all the not null ids, the memory of the process grows up and the size of the .db never grows up. My theory is that dbm save on close ? Does that make sense ? Strange (even if, being anydbm a generic interface to various underlying modules, you can never tell). This simple snippet, on my system, creates a 1.2 Gb files and in the process the memory in not used much (besides for caches, but it doesn't matter): #!/usr/bin/env python import time import anydbm long_string = 'LALALALA' * 1024 db = anydbm.open('/tmp/big.db', 'n') for x in xrange(10): x = str(x) db[x] = long_string print 'INSERT' db.close() print 'CLOSE' time.sleep(10) print 'DONE' sys.exit() #== I fear that the leak is in the cycle on the result of the 'select'. :-/ -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Re: [Imdbpy-devel] [sql] memory consumption
I was wondering, why don't you use the original dbs ? Something like that takes 3 seconds: CREATE TABLE title_extract SELECT imdb_id, md5sum FROM title WHERE imdb_id IS NOT NULL CREATE TABLE name_extract SELECT imdb_id, md5sum FROM name WHERE imdb_id IS NOT NULL And use your query to restore. Should be freaking fast ... Le 12 févr. 2012 à 14:56, Davide Alberani a écrit : On Sun, Feb 12, 2012 at 14:20, Emmanuel Tabard m...@webitup.fr wrote: Fair enough ! When it was selecting all the not null ids, the memory of the process grows up and the size of the .db never grows up. My theory is that dbm save on close ? Does that make sense ? Strange (even if, being anydbm a generic interface to various underlying modules, you can never tell). This simple snippet, on my system, creates a 1.2 Gb files and in the process the memory in not used much (besides for caches, but it doesn't matter): #!/usr/bin/env python import time import anydbm long_string = 'LALALALA' * 1024 db = anydbm.open('/tmp/big.db', 'n') for x in xrange(10): x = str(x) db[x] = long_string print 'INSERT' db.close() print 'CLOSE' time.sleep(10) print 'DONE' sys.exit() #== I fear that the leak is in the cycle on the result of the 'select'. :-/ -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Re: [Imdbpy-devel] [sql] memory consumption
Here is a little workarround : -- Extract imdb_id and md5sum (6sec) CREATE TABLE title_extract SELECT imdb_id, md5sum FROM title WHERE imdb_id IS NOT NULL; CREATE TABLE name_extract SELECT imdb_id, md5sum FROM name WHERE imdb_id IS NOT NULL; -- Add indexes (12sec) ALTER TABLE title_extract ADD INDEX md5sum_idx (md5sum) ALTER TABLE name_extract ADD INDEX md5sum_idx (md5sum) -- Reset imdb ids ... UPDATE title SET imdb_id = NULL; UPDATE name SET imdb_id = NULL; -- Restore imdb ids for movies (2min) UPDATE title INNER JOIN title_extract USING (md5sum) SET title.imdb_id = title_extract.imdb_id -- Restore imdb ids for people (5min) UPDATE name INNER JOIN name_extract USING (md5sum) SET name.imdb_id = name_extract.imdb_id Total time save/restore : less than 10minutes Le 12 févr. 2012 à 15:52, Emmanuel Tabard a écrit : I was wondering, why don't you use the original dbs ? Something like that takes 3 seconds: CREATE TABLE title_extract SELECT imdb_id, md5sum FROM title WHERE imdb_id IS NOT NULL CREATE TABLE name_extract SELECT imdb_id, md5sum FROM name WHERE imdb_id IS NOT NULL And use your query to restore. Should be freaking fast ... Le 12 févr. 2012 à 14:56, Davide Alberani a écrit : On Sun, Feb 12, 2012 at 14:20, Emmanuel Tabard m...@webitup.fr wrote: Fair enough ! When it was selecting all the not null ids, the memory of the process grows up and the size of the .db never grows up. My theory is that dbm save on close ? Does that make sense ? Strange (even if, being anydbm a generic interface to various underlying modules, you can never tell). This simple snippet, on my system, creates a 1.2 Gb files and in the process the memory in not used much (besides for caches, but it doesn't matter): #!/usr/bin/env python import time import anydbm long_string = 'LALALALA' * 1024 db = anydbm.open('/tmp/big.db', 'n') for x in xrange(10): x = str(x) db[x] = long_string print 'INSERT' db.close() print 'CLOSE' time.sleep(10) print 'DONE' sys.exit() #== I fear that the leak is in the cycle on the result of the 'select'. :-/ -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Re: [Imdbpy-devel] [sql] memory consumption
On Sun, Feb 12, 2012 at 16:50, Emmanuel Tabard m...@webitup.fr wrote: Here is a little workarround : Well, that's a very interesting solution, thanks. :-) Anyway, I have to think a little about it: storing the data in the filesystem granted us the possibility to split the CSV mode into 2 separated steps: first all data are put in CSV/pickle/dbm files, then the db is re-created. Hmmm... mumble, mumble... (d'oh, the weekend is over!) -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Re: [Imdbpy-devel] [sql] memory consumption
Well, that's a very interesting solution, thanks. :-) I made a test run with this solution. Time to save/restore : 6minutes Restoring success : - People : 99.8777% - Movies : 99.8845% Anyway, I have to think a little about it: storing the data in the filesystem granted us the possibility to split the CSV mode into 2 separated steps: first all data are put in CSV/pickle/dbm files, then the db is re-created. I saw that you drop tables and not the entire database. So I created the extract tables, clear the imdb ids, run imdb2sql et restore the extract tables. IMHO, I recommend you to keep your datas (extracted) on the same database system. Less queries, less disk operations, less python processing. Anyhow I don't know if it's a cross dbs solution (tested only with postgres and mysql ...). Hmmm... mumble, mumble... (d'oh, the weekend is over!) Hope the snowstorms are over ! Thanks you very much for your help ;-) Le 12 févr. 2012 à 19:23, Davide Alberani a écrit : On Sun, Feb 12, 2012 at 16:50, Emmanuel Tabard m...@webitup.fr wrote: Here is a little workarround : Well, that's a very interesting solution, thanks. :-) Anyway, I have to think a little about it: storing the data in the filesystem granted us the possibility to split the CSV mode into 2 separated steps: first all data are put in CSV/pickle/dbm files, then the db is re-created. Hmmm... mumble, mumble... (d'oh, the weekend is over!) -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Re: [Imdbpy-devel] [sql] memory consumption
On Tue, Feb 7, 2012 at 09:20, Davide Alberani davide.alber...@gmail.com wrote: As usual, I'm really busy right now... I hope to have time to give it a look this weekend. Ehi, snowstorms buy you a lot of free time... :-P It was easier that I thought, mostly thanks to the fact the we already have md5 checksum of names and title (a more or less recent feature). In the mercurial repository there's a draft of solution. How it works: - titles/names with imdbID are stored in a dbm database, using their md5 as keys. - at restore time, imdbIDs are restored in batches of 1 each time. Notes: - by default, the database are created in the current directory (and not deleted); there's now the '-t dir' command line argument, to specify a temporary directory. - I've not tested it with huge amounts of data: if it's slow or fails, let me know if it's while storing or restoring the IDs (and the error message). - 10.000 entries for a batch is *totally* arbitrary: we've to choose a good compromise between performances and the maximum size of a query. - the batch is executed as a single query, like: UPDATE table SET imdb_id = CASE md5sum WHEN 'md5_1' THEN 'imdbID1' ... END WHERE md5sum IN ('md5_1', md5_2', ...) I don't really know if this syntax is valid for every SQL databases... - I've simplified the code, maybe too much. - I've not tested it with CSV support. As usual, any test, bug report, comment and so on is welcome. -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Re: [Imdbpy-devel] [sql] memory consumption
On Sat, Feb 11, 2012 at 22:06, Emmanuel Tabard m...@webitup.fr wrote: Looks nice !!! Well, it's not exactly nice clean code... let's hope it works. It seems that you load all the datas in memory before storing it in the temp databases. cls.select(ISNOTNULL(cls.q.imdbID)) Maybe you should save the imdbids by batch of 1entries ? If you believe at the theory, the return value of a 'select' method is a lazy object, and so it should not consume much memory. It's true that it's then cycled over, one item at a time, and so it may be slow (in which case, we can directly use a cursors and process the results in batches). Tell me if you need the complete database dump to test with tons of datas ! It won't hurt... :) A dump of SELECT md5sum, imdb_id from the 'name', 'title', 'char_name' and 'company_name' will be enough (not on the public list, please :)) Bye, -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Re: [Imdbpy-devel] [sql] memory consumption
On Mon, Feb 6, 2012 at 09:10, Emmanuel Tabard m...@webitup.fr wrote: Feb 6 07:16:53 ks380342 kernel: Out of memory: Kill process 29782 (imdbpy2sql.py) score 718 or sacrifice child Feb 6 07:16:53 ks380342 kernel: Killed process 29782 (imdbpy2sql.py) total-vm:18212980kB, anon-rss:18149024kB, file-rss:64kB Greedy boy :D Eheh... :-) As usual, I'm really busy right now... I hope to have time to give it a look this weekend. -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d ___ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Re: [Imdbpy-devel] [sql] memory consumption
Feb 6 07:16:53 ks380342 kernel: Out of memory: Kill process 29782 (imdbpy2sql.py) score 718 or sacrifice child Feb 6 07:16:53 ks380342 kernel: Killed process 29782 (imdbpy2sql.py) total-vm:18212980kB, anon-rss:18149024kB, file-rss:64kB Greedy boy :D Le 27 janv. 2012 à 09:18, Davide Alberani a écrit : On Thu, Jan 26, 2012 at 18:50, Emmanuel Tabard m...@webitup.fr wrote: # TIME FINAL : 1223min, 59sec (wall) 1095min, 16sec (user) 13min, 7sec (system) :-) You can notice that : - title 84% success - name 99% success For sure movie titles change more and faster than user names (a user name can change only if a typo was found, if the person legally change his name or if a second person with the same name/surname is added to the db) Anyway, 84% is a little too high... I guess some kind of titles (tv series episodes, maybe?) are not handled correctly. Thanks for your help! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 ___ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Re: [Imdbpy-devel] [sql] memory consumption
Could you provide the methods you use for update and initial db creation, or pastebin the code. On Thu, Jan 26, 2012 at 9:32 AM, Emmanuel Tabard m...@webitup.fr wrote: hi, First of all, thank you for imdbpy. This is really plug'n play, well done !!! Context : - Import all imdb database (from text dumps) - first time it's fast and ok - I have the imdb ids for 90% of titles and names (no need for companies and characters) - I'll pay imdb for using those datas (I need to control the datas instead of using their own api) My problem comes when imdbpy updates my database. It takes hours to save the imdb ids and it consumes a *lot* of memory. Almost all of my RAM (24go) ... Is there a way to optimize that step ? Why does it takes so much memory ? I can help you with this if you don't have much time, I'm python dev but I need input :) Cheers, - Emmanuel PS : Sorry about my first non-member thread, wrong email ... -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d ___ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel -- - Ken -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d___ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel
Re: [Imdbpy-devel] [sql] memory consumption
On Thu, Jan 26, 2012 at 15:32, Emmanuel Tabard m...@webitup.fr wrote: First of all, thank you for imdbpy. This is really plug'n play, well done !!! Thanks. :-P Context : - Import all imdb database (from text dumps) - first time it's fast and ok - I have the imdb ids for 90% of titles and names (no need for companies and characters) That's a lot of data. :) My problem comes when imdbpy updates my database. It takes hours to save the imdb ids and it consumes a *lot* of memory. Almost all of my RAM (24go) ... Is there a way to optimize that step ? Why does it takes so much memory ? It's so slow and takes so much memory because it was thought to work with a few hundreds of entries. :-D Wow, that's an interesting problem... I guess it can be heavily improved, especially if we can store some information to the disc. Anyway, it's not an easy task: the real problem is that we don't have a unique ID to identify a movie (that would be the ID that we're saving... but the problem is matching it to the other information of the row: title, year, imdb_index, kind, etc. etc.) Hmmm... I promise to think about it in the weekend. If anyone have a nice solution to this problem, any hint is welcome! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d ___ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel