Re: [Imdbpy-devel] [sql] memory consumption

2012-02-20 Thread Emmanuel Tabard
Hell yeah :D

RESTORING imdbIDs values for movies... DONE! (restored 2102978 entries out of 
2104144)
# TIME restore movies : 0min, 47sec (wall) 0min, 0sec (user) 0min, 0sec (system)
RESTORING imdbIDs values for people... DONE! (restored 3544106 entries out of 
3545620)
# TIME restore people : 1min, 19sec (wall) 0min, 0sec (user) 0min, 0sec (system)


Freaking fast 

thank you for those quick fixs ! Also, thanks for the credits ;-) 

Le 19 févr. 2012 à 11:26, Davide Alberani a écrit :

 On Sun, Feb 12, 2012 at 21:17, Emmanuel Tabard m...@webitup.fr wrote:
 
 Well, that's a very interesting solution, thanks. :-)
 
 I made a test run with this solution.
 
 Time to save/restore : 6minutes
 
 Restoring success :
 
  - People : 99.8777%
  - Movies : 99.8845%
 
 That's great!
 
 Ok, I've committed a compromise solution: when we're using CSV (which
 is thought to give you a set of files that you can move and restore on
 completely
 different db altogether), I use my previous implementation, but using
 a cursor instead
 of an ORM object (it should be really fast, at least storing the imdbIDs).
 
 When not using CSV, I've implemented your solution (the version based on
 a dbm database is also used as a fallback, in case some db servers
 have problems).
 
 But this is just the default: with the -i command line argument (by
 the way: I've
 removed -t to select a tmp directory: the CSV one is used or the
 current directory)
 you can force one method or the other: -i dbm and -i table.
 
 I hope it works and it covers all the possible use cases. :-)
 
 -- 
 Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
 http://www.mimante.net/


--
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
___
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel


Re: [Imdbpy-devel] [sql] memory consumption

2012-02-12 Thread Davide Alberani
On Sun, Feb 12, 2012 at 14:20, Emmanuel Tabard m...@webitup.fr wrote:

 Fair enough !
 When it was selecting all the not null ids, the memory of the process grows
 up and the size of the .db never grows up.
 My theory is that dbm save on close ? Does that make sense ?

Strange (even if, being anydbm a generic interface to various underlying
modules, you can never tell).

This simple snippet, on my system, creates a 1.2 Gb files and in the process
the memory in not used much (besides for caches, but it doesn't matter):

#!/usr/bin/env python
import time
import anydbm

long_string = 'LALALALA' * 1024
db = anydbm.open('/tmp/big.db', 'n')
for x in xrange(10):
x = str(x)
db[x] = long_string

print 'INSERT'
db.close()
print 'CLOSE'
time.sleep(10)
print 'DONE'
sys.exit()
#==

I fear that the leak is in the cycle on the result of the 'select'. :-/


-- 
Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel


Re: [Imdbpy-devel] [sql] memory consumption

2012-02-12 Thread Emmanuel Tabard
I was wondering, why don't you use the original dbs ?

Something like that takes 3 seconds: 

CREATE TABLE title_extract SELECT imdb_id, md5sum FROM title WHERE imdb_id IS 
NOT NULL
CREATE TABLE name_extract SELECT imdb_id, md5sum FROM name WHERE imdb_id IS NOT 
NULL


And use your query to restore.

Should be freaking fast ...
 
Le 12 févr. 2012 à 14:56, Davide Alberani a écrit :

 On Sun, Feb 12, 2012 at 14:20, Emmanuel Tabard m...@webitup.fr wrote:
 
 Fair enough !
 When it was selecting all the not null ids, the memory of the process grows
 up and the size of the .db never grows up.
 My theory is that dbm save on close ? Does that make sense ?
 
 Strange (even if, being anydbm a generic interface to various underlying
 modules, you can never tell).
 
 This simple snippet, on my system, creates a 1.2 Gb files and in the process
 the memory in not used much (besides for caches, but it doesn't matter):
 
 #!/usr/bin/env python
 import time
 import anydbm
 
 long_string = 'LALALALA' * 1024
 db = anydbm.open('/tmp/big.db', 'n')
 for x in xrange(10):
x = str(x)
db[x] = long_string
 
 print 'INSERT'
 db.close()
 print 'CLOSE'
 time.sleep(10)
 print 'DONE'
 sys.exit()
 #==
 
 I fear that the leak is in the cycle on the result of the 'select'. :-/
 
 
 -- 
 Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
 http://www.mimante.net/


--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel


Re: [Imdbpy-devel] [sql] memory consumption

2012-02-12 Thread Emmanuel Tabard
Here is a little workarround :

-- Extract imdb_id and md5sum  (6sec)
CREATE TABLE title_extract SELECT imdb_id, md5sum FROM title WHERE imdb_id IS 
NOT NULL;
CREATE TABLE name_extract SELECT imdb_id, md5sum FROM name WHERE imdb_id IS NOT 
NULL;

-- Add indexes (12sec)
ALTER TABLE title_extract ADD INDEX md5sum_idx (md5sum)
ALTER TABLE name_extract ADD INDEX md5sum_idx (md5sum)


-- Reset imdb ids ...
UPDATE title SET imdb_id = NULL;
UPDATE name SET imdb_id = NULL;

-- Restore imdb ids for movies (2min)
UPDATE title
INNER JOIN title_extract USING (md5sum)
SET title.imdb_id = title_extract.imdb_id

-- Restore imdb ids for people (5min)
UPDATE name
INNER JOIN name_extract USING (md5sum)
SET name.imdb_id = name_extract.imdb_id



Total time save/restore : less than 10minutes



Le 12 févr. 2012 à 15:52, Emmanuel Tabard a écrit :

 I was wondering, why don't you use the original dbs ?
 
 Something like that takes 3 seconds: 
 
 CREATE TABLE title_extract SELECT imdb_id, md5sum FROM title WHERE imdb_id 
 IS NOT NULL
 CREATE TABLE name_extract SELECT imdb_id, md5sum FROM name WHERE imdb_id IS 
 NOT NULL
 
 
 And use your query to restore.
 
 Should be freaking fast ...
 
 Le 12 févr. 2012 à 14:56, Davide Alberani a écrit :
 
 On Sun, Feb 12, 2012 at 14:20, Emmanuel Tabard m...@webitup.fr wrote:
 
 Fair enough !
 When it was selecting all the not null ids, the memory of the process grows
 up and the size of the .db never grows up.
 My theory is that dbm save on close ? Does that make sense ?
 
 Strange (even if, being anydbm a generic interface to various underlying
 modules, you can never tell).
 
 This simple snippet, on my system, creates a 1.2 Gb files and in the process
 the memory in not used much (besides for caches, but it doesn't matter):
 
 #!/usr/bin/env python
 import time
 import anydbm
 
 long_string = 'LALALALA' * 1024
 db = anydbm.open('/tmp/big.db', 'n')
 for x in xrange(10):
   x = str(x)
   db[x] = long_string
 
 print 'INSERT'
 db.close()
 print 'CLOSE'
 time.sleep(10)
 print 'DONE'
 sys.exit()
 #==
 
 I fear that the leak is in the cycle on the result of the 'select'. :-/
 
 
 -- 
 Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
 http://www.mimante.net/
 


--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel


Re: [Imdbpy-devel] [sql] memory consumption

2012-02-12 Thread Davide Alberani
On Sun, Feb 12, 2012 at 16:50, Emmanuel Tabard m...@webitup.fr wrote:

 Here is a little workarround :

Well, that's a very interesting solution, thanks. :-)

Anyway, I have to think a little about it: storing the data in the
filesystem granted
us the possibility to split the CSV mode into 2 separated steps:
first all data are
put in CSV/pickle/dbm files, then the db is re-created.

Hmmm... mumble, mumble... (d'oh, the weekend is over!)

-- 
Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel


Re: [Imdbpy-devel] [sql] memory consumption

2012-02-12 Thread Emmanuel Tabard
 Well, that's a very interesting solution, thanks. :-)

I made a test run with this solution.

Time to save/restore : 6minutes

Restoring success :

 - People : 99.8777%
 - Movies : 99.8845%


 Anyway, I have to think a little about it: storing the data in the
 filesystem granted
 us the possibility to split the CSV mode into 2 separated steps:
 first all data are
 put in CSV/pickle/dbm files, then the db is re-created.

I saw that you drop tables and not the entire database. So I created the 
extract tables, clear the imdb ids, run imdb2sql et restore the extract tables.
IMHO, I recommend you to keep your datas (extracted) on the same database 
system. Less queries, less disk operations, less python processing.

Anyhow I don't know if it's a cross dbs solution (tested only with postgres and 
mysql ...). 

 Hmmm... mumble, mumble... (d'oh, the weekend is over!)


Hope the snowstorms are over !
Thanks you very much for your help ;-)



Le 12 févr. 2012 à 19:23, Davide Alberani a écrit :

 On Sun, Feb 12, 2012 at 16:50, Emmanuel Tabard m...@webitup.fr wrote:
 
 Here is a little workarround :
 
 Well, that's a very interesting solution, thanks. :-)
 
 Anyway, I have to think a little about it: storing the data in the
 filesystem granted
 us the possibility to split the CSV mode into 2 separated steps:
 first all data are
 put in CSV/pickle/dbm files, then the db is re-created.
 
 Hmmm... mumble, mumble... (d'oh, the weekend is over!)
 
 -- 
 Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
 http://www.mimante.net/


--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel


Re: [Imdbpy-devel] [sql] memory consumption

2012-02-11 Thread Davide Alberani
On Tue, Feb 7, 2012 at 09:20, Davide Alberani davide.alber...@gmail.com wrote:

 As usual, I'm really busy right now... I hope to have time to give it
 a look this weekend.

Ehi, snowstorms buy you a lot of free time... :-P

It was easier that I thought, mostly thanks to the fact the we already have
md5 checksum of names and title (a more or less recent feature).

In the mercurial repository there's a draft of solution.

How it works:
- titles/names with imdbID are stored in a dbm database, using their
md5 as keys.
- at restore time, imdbIDs are restored in batches of 1 each time.

Notes:
- by default, the database are created in the current directory (and
not deleted);
  there's now the '-t dir' command line argument, to specify a
temporary directory.
- I've not tested it with huge amounts of data: if it's slow or fails,
let me know
  if it's while storing or restoring the IDs (and the error message).
- 10.000 entries for a batch is *totally* arbitrary: we've to choose a
good compromise
  between performances and the maximum size of a query.
- the batch is executed as a single query, like:
  UPDATE table SET imdb_id = CASE md5sum WHEN 'md5_1' THEN
'imdbID1' ... END WHERE md5sum IN ('md5_1', md5_2', ...)
  I don't really know if this syntax is valid for every SQL databases...
- I've simplified the code, maybe too much.
- I've not tested it with CSV support.

As usual, any test, bug report, comment and so on is welcome.


-- 
Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel


Re: [Imdbpy-devel] [sql] memory consumption

2012-02-11 Thread Davide Alberani
On Sat, Feb 11, 2012 at 22:06, Emmanuel Tabard m...@webitup.fr wrote:

 Looks nice !!!

Well, it's not exactly nice  clean code... let's hope it works.

 It seems that you load all the datas in memory before storing it in the temp
 databases.
 cls.select(ISNOTNULL(cls.q.imdbID))

 Maybe you should save the imdbids by batch of 1entries ?

If you believe at the theory, the return value of a 'select' method is
a lazy object, and so it should not consume much memory.
It's true that it's then cycled over, one item at a time, and so it
may be slow (in which case, we can directly use a cursors and
process the results in batches).

 Tell me if you need the complete database dump to test with tons of datas !

It won't hurt... :)
A dump of SELECT md5sum, imdb_id from the 'name', 'title', 'char_name'
and 'company_name' will be enough (not on the public list, please :))


Bye,

-- 
Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel


Re: [Imdbpy-devel] [sql] memory consumption

2012-02-07 Thread Davide Alberani
On Mon, Feb 6, 2012 at 09:10, Emmanuel Tabard m...@webitup.fr wrote:

 Feb  6 07:16:53 ks380342 kernel: Out of memory: Kill process 29782 
 (imdbpy2sql.py) score 718 or sacrifice child
 Feb  6 07:16:53 ks380342 kernel: Killed process 29782 (imdbpy2sql.py) 
 total-vm:18212980kB, anon-rss:18149024kB, file-rss:64kB

 Greedy boy :D

Eheh... :-)
As usual, I'm really busy right now... I hope to have time to give it
a look this weekend.


-- 
Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

--
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
___
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel


Re: [Imdbpy-devel] [sql] memory consumption

2012-02-06 Thread Emmanuel Tabard
Feb  6 07:16:53 ks380342 kernel: Out of memory: Kill process 29782 
(imdbpy2sql.py) score 718 or sacrifice child
Feb  6 07:16:53 ks380342 kernel: Killed process 29782 (imdbpy2sql.py) 
total-vm:18212980kB, anon-rss:18149024kB, file-rss:64kB


Greedy boy :D 



Le 27 janv. 2012 à 09:18, Davide Alberani a écrit :

 On Thu, Jan 26, 2012 at 18:50, Emmanuel Tabard m...@webitup.fr wrote:
 
 # TIME FINAL : 1223min, 59sec (wall) 1095min, 16sec (user) 13min, 7sec
 (system)
 
 :-)
 
 You can notice that :
  - title 84% success
  - name 99% success
 
 For sure movie titles change more and faster than user names (a user name
 can change only if a typo was found, if the person legally change his name or
 if a second person with the same name/surname is added to the db)
 
 Anyway, 84% is a little too high... I guess some kind of titles (tv
 series episodes,
 maybe?) are not handled correctly.
 
 
 Thanks for your help!
 
 
 -- 
 Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
 http://www.mimante.net/


--
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
___
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel


Re: [Imdbpy-devel] [sql] memory consumption

2012-01-26 Thread Garland, Ken R
Could you provide the methods you use for update and initial db creation,
or pastebin the code.

On Thu, Jan 26, 2012 at 9:32 AM, Emmanuel Tabard m...@webitup.fr wrote:

 hi,

 First of all, thank you for imdbpy. This is really plug'n play, well done
 !!!

 Context :
 - Import all imdb database (from text dumps) - first time it's fast and ok
 - I have the imdb ids for 90% of titles and names (no need for companies
 and characters)
 - I'll pay imdb for using those datas (I need to control the datas instead
 of using their own api)

 My problem comes when imdbpy updates my database. It takes hours to save
 the imdb ids and it consumes a *lot* of memory. Almost all of my RAM (24go)
 ...

 Is there a way to optimize that step ? Why does it takes so much memory ?

 I can help you with this if you don't have much time, I'm python dev but I
 need input :)


 Cheers,

 - Emmanuel

 PS : Sorry about my first non-member thread, wrong email ...

 --
 Keep Your Developer Skills Current with LearnDevNow!
 The most comprehensive online learning library for Microsoft developers
 is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
 Metro Style Apps, more. Free future releases when you subscribe now!
 http://p.sf.net/sfu/learndevnow-d2d
 ___
 Imdbpy-devel mailing list
 Imdbpy-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/imdbpy-devel




-- 
- Ken
--
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d___
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel


Re: [Imdbpy-devel] [sql] memory consumption

2012-01-26 Thread Davide Alberani
On Thu, Jan 26, 2012 at 15:32, Emmanuel Tabard m...@webitup.fr wrote:

 First of all, thank you for imdbpy. This is really plug'n play, well done !!!

Thanks. :-P

 Context :
 - Import all imdb database (from text dumps) - first time it's fast and ok
 - I have the imdb ids for 90% of titles and names (no need for companies and 
 characters)

That's a lot of data. :)

 My problem comes when imdbpy updates my database. It takes hours to save the 
 imdb ids and it consumes a *lot* of memory. Almost all of my RAM (24go) ...

 Is there a way to optimize that step ? Why does it takes so much memory ?

It's so slow and takes so much memory because it was thought to work with
a few hundreds of entries. :-D

Wow, that's an interesting problem... I guess it can be heavily improved,
especially if we can store some information to the disc.
Anyway, it's not an easy task: the real problem is that we don't have a
unique ID to identify a movie (that would be the ID that we're saving... but
the problem is matching it to the other information of the row: title, year,
imdb_index, kind, etc. etc.)

Hmmm... I promise to think about it in the weekend.  If anyone have a
nice solution to this problem, any hint is welcome!

-- 
Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

--
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
___
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel