Thanks Luke, Steve, Brett, Lloyd & Alan for your prompt responses & sharing your wisdom.
I <3 the python community... You(We ?) folks are AWESOME I cross-posted this query on comp.lang.python I bet most of you hang @ c.l.p too, but just in case, here is the link to the discussion at c.l.p https://groups.google.com/d/topic/comp.lang.python/i816mDMSoXM/discussion Thanks again for the amazing help & advice cheers ashish On Mon, Dec 20, 2010 at 5:13 PM, Alan Gauld <alan.ga...@btinternet.com>wrote: > "ashish makani" <ashish.mak...@gmail.com> wrote > > I am looking for a specific element..there are several 10s/100s >> occurrences >> of that element in the 1gb file. >> >> I need to detect them & then for each 1, i need to copy all the content >> b/w >> the element's start & end tags & create a smaller xml >> > > This is exactly what sax and its kin are for. If you wanted to manipulate > the xml data and recreate the original file tree based is better but for > this > kind of one shot processing SAX will be much much faster. > > The concept is simple enough if you have ever used awk to process > text files. (or the Python HTMLParser) You define a function that gets > triggered when the parser detects a matching tag. > > > My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2 >> quad >> cpuq9400. >> On this i am running sun virtualbox(3.2.12), with ubuntu 10.10(maverick) >> as >> guest os, with 23gb disk space & 2gb(2048mb) ram, assigned to the guest >> ubuntu os. >> > > Obviously running the code in the virtuial machjine is limiting your > ability to deal with the data but in this case you would be pushing > hard to build the entire tree in RAM anyway so it probably doesn't > matter. > > > 4. I then investigated some streaming libraries, but am confused - there >> is >> SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , >> > > Which one is the best for my situation ? >> > > I've only used sax - I tried minidom once but couldn't get it to work > as I wanted so went back to sax... There are lots of examples of > xml parsing using sax, both in Python and Java - just google. > > > Should i instead just open the file, & use reg ex to look for the element >> i >> need ? >> > > Unless the xml is very simple you would probably find yourself > creating a bigger problem. regex's are not good at handling the > kinds of recursive data structures as can be found in SGML > based languages. > > HTH, > > > -- > Alan Gauld > Author of the Learn to Program web site > http://www.alan-g.me.uk/ > > > > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor > * "We act as though comfort and luxury were the chief requirements of life, when all that we need to make us happy is something to be enthusiastic about." -- Albert Einstein*
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor