Re: [PHP] Sequential access of XML nodes.
Richard Quadling wrote: >It seems that the SimpleXMLIterator is perfect for me. >[...] Interesting, I forget that's there... I must have a play with it sometime. Thanks for resurfacing it :) -- Ross McKay, Toronto, NSW Australia "Let the laddie play wi the knife - he'll learn" - The Wee Book of Calvin -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Sequential access of XML nodes.
On 27 September 2011 03:38, Ross McKay wrote: > On Mon, 26 Sep 2011 14:17:43 -0400, Adam Richardson wrote: > >>I believe the XMLReader allows you to pull node by node, and it's really >>easy to work with: >>http://www.php.net/manual/en/intro.xmlreader.php >> >>In terms of dealing with various forms of compression, I believe you con use >>the compression streams to handle this: >>http://stackoverflow.com/questions/1190906/php-open-gzipped-xml >>http://us3.php.net/manual/en/wrappers.compression.php > > +1 here. XMLReader is easy and fast, and will do the job you want albeit > without the nice foreach(...) loop Richard spec's. You just loop over > reading the XML and checking the node type, watching the state of your > stream to see how to handle each iteration. > > e.g. (assuming $xml is an open XMLReader, $db is PDO in example) > > $text = ''; > $haveRecord = FALSE; > $records = 0; > > // prepare insert statement > $sql = ' > insert into Product (ID, Product, ...) > values (:ID, :Product, ...) > '; > $cmd = $db->prepare($sql); > > // set list of allowable fields and their parameter type > $fields = array( > 'ID' => PDO::PARAM_INT, > 'Product' => PDO::PARAM_STR, > ... > ); > > while ($xml->read()) { > switch ($xml->nodeType) { > case XMLReader::ELEMENT: > if ($xml->name === 'Product') { > // start of Product element, > // reset command parameters to empty > foreach ($fields as $name => $type) { > $cmd->bindValue(":$name", NULL, PDO::PARAM_NULL); > } > $haveRecord = TRUE; > } > $text = ''; > break; > > case XMLReader::END_ELEMENT: > if ($xml->name === 'Product') { > // end of Product element, save record > if ($haveRecord) { > $result = $cmd->execute(); > $records++; > } > $haveRecord = FALSE; > } > elseif ($haveRecord) { > // still inside a Product element, > // record field value and move on > $name = $xml->name; > if (array_key_exists($name, $fields)) { > $cmd->bindValue(":$name", $text, $fields[$name]); > } > } > $text = ''; > break; > > case XMLReader::TEXT: > case XMLReader::CDATA: > // record value (or part value) of text or cdata node > $text .= $xml->value; > break; > > default: > break; > } > } > > return $records; Thanks for all of that. It seems that the SimpleXMLIterator is perfect for me. I need to see if the documents I'm needing to process have multiple namespaces. If they have do, then I'm not exactly sure what to do at this stage. Richard. -- Richard Quadling Twitter : EE : Zend : PHPDoc @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Sequential access of XML nodes.
On Mon, 26 Sep 2011 14:17:43 -0400, Adam Richardson wrote: >I believe the XMLReader allows you to pull node by node, and it's really >easy to work with: >http://www.php.net/manual/en/intro.xmlreader.php > >In terms of dealing with various forms of compression, I believe you con use >the compression streams to handle this: >http://stackoverflow.com/questions/1190906/php-open-gzipped-xml >http://us3.php.net/manual/en/wrappers.compression.php +1 here. XMLReader is easy and fast, and will do the job you want albeit without the nice foreach(...) loop Richard spec's. You just loop over reading the XML and checking the node type, watching the state of your stream to see how to handle each iteration. e.g. (assuming $xml is an open XMLReader, $db is PDO in example) $text = ''; $haveRecord = FALSE; $records = 0; // prepare insert statement $sql = ' insert into Product (ID, Product, ...) values (:ID, :Product, ...) '; $cmd = $db->prepare($sql); // set list of allowable fields and their parameter type $fields = array( 'ID' => PDO::PARAM_INT, 'Product' => PDO::PARAM_STR, ... ); while ($xml->read()) { switch ($xml->nodeType) { case XMLReader::ELEMENT: if ($xml->name === 'Product') { // start of Product element, // reset command parameters to empty foreach ($fields as $name => $type) { $cmd->bindValue(":$name", NULL, PDO::PARAM_NULL); } $haveRecord = TRUE; } $text = ''; break; case XMLReader::END_ELEMENT: if ($xml->name === 'Product') { // end of Product element, save record if ($haveRecord) { $result = $cmd->execute(); $records++; } $haveRecord = FALSE; } elseif ($haveRecord) { // still inside a Product element, // record field value and move on $name = $xml->name; if (array_key_exists($name, $fields)) { $cmd->bindValue(":$name", $text, $fields[$name]); } } $text = ''; break; case XMLReader::TEXT: case XMLReader::CDATA: // record value (or part value) of text or cdata node $text .= $xml->value; break; default: break; } } return $records; -- Ross McKay, Toronto, NSW Australia "Tuesday is Soylent Green day" -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Sequential access of XML nodes.
On Mon, Sep 26, 2011 at 12:24 PM, Richard Quadling wrote: > Hi. > > I've got a project which will be needing to iterate some very large > XML files (around 250 files ranging in size from around 50MB to > several hundred MB - 2 of them are in excess of 500MB). > > The XML files have a root node and then a collection of products. In > total, in all the files, there are going to be several million product > details. Each XML feed will have a different structure as it relates > to a different source of data. > > I plan to have an abstract reader class with the concrete classes > being extensions of this, each covering the specifics of the format > being received and has the ability to return a standardised view of > the data for importing into mysql and eventually MongoDB. > > I want to use an XML iterator so that I can say something along the lines > of ... > > 1 - Instantiate the XML iterator with the XML's URL. > 2 - Iterate the XML getting back one node at a time without keeping > all the nodes in memory. > > e.g. > > $o_XML = new SomeExtendedXMLReader('http://www.site.com/data.xml'); > foreach($o_XML as $o_Product) { > // Process product. > } > > > Add to this that some of the xml feeds come .gz, I want to be able to > stream the XML out of the .gz file without having to extract the > entire file first. > > I've not got access to the XML feeds yet (they are coming from the > various affiliate networks around, and I'm a remote user so need to > get credentials and the like). > > If you have any pointers on the capabilities of the various XML reader > classes, based upon this scenario, then I'd be very grateful. > > > In this instance, the memory limitation is important. The current code > is string based and whilst it works, you can imagine the complexity of > it. > > The structure of each product internally will be different, but I will > be happy to get back a nested array or an XML fragment, as long as the > iterator is only holding onto 1 array/fragment at a time and not > caching the massive number of products per file. > > Thanks. > > Richard. > > > -- > Richard Quadling > Twitter : EE : Zend : PHPDoc > @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea > > -- > PHP General Mailing List (http://www.php.net/) > To unsubscribe, visit: http://www.php.net/unsub.php > > I believe the XMLReader allows you to pull node by node, and it's really easy to work with: http://www.php.net/manual/en/intro.xmlreader.php In terms of dealing with various forms of compression, I believe you con use the compression streams to handle this: http://stackoverflow.com/questions/1190906/php-open-gzipped-xml http://us3.php.net/manual/en/wrappers.compression.php Adam -- Nephtali: A simple, flexible, fast, and security-focused PHP framework http://nephtaliproject.com
Re: [PHP] Sequential access of XML nodes.
On 26 Sep 2011, at 17:24, Richard Quadling wrote: > I've got a project which will be needing to iterate some very large > XML files (around 250 files ranging in size from around 50MB to > several hundred MB - 2 of them are in excess of 500MB). > > The XML files have a root node and then a collection of products. In > total, in all the files, there are going to be several million product > details. Each XML feed will have a different structure as it relates > to a different source of data. > > I plan to have an abstract reader class with the concrete classes > being extensions of this, each covering the specifics of the format > being received and has the ability to return a standardised view of > the data for importing into mysql and eventually MongoDB. > > I want to use an XML iterator so that I can say something along the lines of > ... > > 1 - Instantiate the XML iterator with the XML's URL. > 2 - Iterate the XML getting back one node at a time without keeping > all the nodes in memory. > > e.g. > > $o_XML = new SomeExtendedXMLReader('http://www.site.com/data.xml'); > foreach($o_XML as $o_Product) { > // Process product. > } > > > Add to this that some of the xml feeds come .gz, I want to be able to > stream the XML out of the .gz file without having to extract the > entire file first. > > I've not got access to the XML feeds yet (they are coming from the > various affiliate networks around, and I'm a remote user so need to > get credentials and the like). > > If you have any pointers on the capabilities of the various XML reader > classes, based upon this scenario, then I'd be very grateful. > > > In this instance, the memory limitation is important. The current code > is string based and whilst it works, you can imagine the complexity of > it. > > The structure of each product internally will be different, but I will > be happy to get back a nested array or an XML fragment, as long as the > iterator is only holding onto 1 array/fragment at a time and not > caching the massive number of products per file. As far as I'm aware, XML Parser can handle all of this. http://php.net/xml It's a SAX parser so you can feed it the data chunk by chunk. You can use gzopen to open gzipped files and manually feed the data into xml_parse. Be sure to read the docs carefully because there's a lot to be aware of when parsing an XML document in pieces. -Stuart -- Stuart Dallas 3ft9 Ltd http://3ft9.com/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php