Re: [PHP] Sequential access of XML nodes.

2011-09-28 Thread Richard Quadling
On 27 September 2011 03:38, Ross McKay ro...@zeta.org.au wrote:
 On Mon, 26 Sep 2011 14:17:43 -0400, Adam Richardson wrote:

I believe the XMLReader allows you to pull node by node, and it's really
easy to work with:
http://www.php.net/manual/en/intro.xmlreader.php

In terms of dealing with various forms of compression, I believe you con use
the compression streams to handle this:
http://stackoverflow.com/questions/1190906/php-open-gzipped-xml
http://us3.php.net/manual/en/wrappers.compression.php

 +1 here. XMLReader is easy and fast, and will do the job you want albeit
 without the nice foreach(...) loop Richard spec's. You just loop over
 reading the XML and checking the node type, watching the state of your
 stream to see how to handle each iteration.

 e.g. (assuming $xml is an open XMLReader, $db is PDO in example)

 $text = '';
 $haveRecord = FALSE;
 $records = 0;

 // prepare insert statement
 $sql = '
 insert into Product (ID, Product, ...)
 values (:ID, :Product, ...)
 ';
 $cmd = $db-prepare($sql);

 // set list of allowable fields and their parameter type
 $fields = array(
    'ID' = PDO::PARAM_INT,
    'Product' = PDO::PARAM_STR,
    ...
 );

 while ($xml-read()) {
    switch ($xml-nodeType) {
        case XMLReader::ELEMENT:
            if ($xml-name === 'Product') {
                // start of Product element,
                // reset command parameters to empty
                foreach ($fields as $name = $type) {
                    $cmd-bindValue(:$name, NULL, PDO::PARAM_NULL);
                }
                $haveRecord = TRUE;
            }
            $text = '';
            break;

        case XMLReader::END_ELEMENT:
            if ($xml-name === 'Product') {
                // end of Product element, save record
                if ($haveRecord) {
                    $result = $cmd-execute();
                    $records++;
                }
                $haveRecord = FALSE;
            }
            elseif ($haveRecord) {
                // still inside a Product element,
                // record field value and move on
                $name = $xml-name;
                if (array_key_exists($name, $fields)) {
                    $cmd-bindValue(:$name, $text, $fields[$name]);
                }
            }
            $text = '';
            break;

        case XMLReader::TEXT:
        case XMLReader::CDATA:
            // record value (or part value) of text or cdata node
            $text .= $xml-value;
            break;

        default:
            break;
    }
 }

 return $records;

Thanks for all of that.

It seems that the SimpleXMLIterator is perfect for me.

I need to see if the documents I'm needing to process have multiple
namespaces. If they have do, then I'm not exactly sure what to do at
this stage.

Richard.
-- 
Richard Quadling
Twitter : EE : Zend : PHPDoc
@RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Sequential access of XML nodes.

2011-09-28 Thread Ross McKay
Richard Quadling wrote:

It seems that the SimpleXMLIterator is perfect for me.
[...]

Interesting, I forget that's there... I must have a play with it
sometime. Thanks for resurfacing it :)
-- 
Ross McKay, Toronto, NSW Australia
Let the laddie play wi the knife - he'll learn
- The Wee Book of Calvin

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP] Sequential access of XML nodes.

2011-09-26 Thread Richard Quadling
Hi.

I've got a project which will be needing to iterate some very large
XML files (around 250 files ranging in size from around 50MB to
several hundred MB - 2 of them are in excess of 500MB).

The XML files have a root node and then a collection of products. In
total, in all the files, there are going to be several million product
details. Each XML feed will have a different structure as it relates
to a different source of data.

I plan to have an abstract reader class with the concrete classes
being extensions of this, each covering the specifics of the format
being received and has the ability to return a standardised view of
the data for importing into mysql and eventually MongoDB.

I want to use an XML iterator so that I can say something along the lines of ...

1 - Instantiate the XML iterator with the XML's URL.
2 - Iterate the XML getting back one node at a time without keeping
all the nodes in memory.

e.g.

?php
$o_XML = new SomeExtendedXMLReader('http://www.site.com/data.xml');
foreach($o_XML as $o_Product) {
 // Process product.
}


Add to this that some of the xml feeds come .gz, I want to be able to
stream the XML out of the .gz file without having to extract the
entire file first.

I've not got access to the XML feeds yet (they are coming from the
various affiliate networks around, and I'm a remote user so need to
get credentials and the like).

If you have any pointers on the capabilities of the various XML reader
classes, based upon this scenario, then I'd be very grateful.


In this instance, the memory limitation is important. The current code
is string based and whilst it works, you can imagine the complexity of
it.

The structure of each product internally will be different, but I will
be happy to get back a nested array or an XML fragment, as long as the
iterator is only holding onto 1 array/fragment at a time and not
caching the massive number of products per file.

Thanks.

Richard.


-- 
Richard Quadling
Twitter : EE : Zend : PHPDoc
@RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Sequential access of XML nodes.

2011-09-26 Thread Stuart Dallas
On 26 Sep 2011, at 17:24, Richard Quadling wrote:
 I've got a project which will be needing to iterate some very large
 XML files (around 250 files ranging in size from around 50MB to
 several hundred MB - 2 of them are in excess of 500MB).
 
 The XML files have a root node and then a collection of products. In
 total, in all the files, there are going to be several million product
 details. Each XML feed will have a different structure as it relates
 to a different source of data.
 
 I plan to have an abstract reader class with the concrete classes
 being extensions of this, each covering the specifics of the format
 being received and has the ability to return a standardised view of
 the data for importing into mysql and eventually MongoDB.
 
 I want to use an XML iterator so that I can say something along the lines of 
 ...
 
 1 - Instantiate the XML iterator with the XML's URL.
 2 - Iterate the XML getting back one node at a time without keeping
 all the nodes in memory.
 
 e.g.
 
 ?php
 $o_XML = new SomeExtendedXMLReader('http://www.site.com/data.xml');
 foreach($o_XML as $o_Product) {
 // Process product.
 }
 
 
 Add to this that some of the xml feeds come .gz, I want to be able to
 stream the XML out of the .gz file without having to extract the
 entire file first.
 
 I've not got access to the XML feeds yet (they are coming from the
 various affiliate networks around, and I'm a remote user so need to
 get credentials and the like).
 
 If you have any pointers on the capabilities of the various XML reader
 classes, based upon this scenario, then I'd be very grateful.
 
 
 In this instance, the memory limitation is important. The current code
 is string based and whilst it works, you can imagine the complexity of
 it.
 
 The structure of each product internally will be different, but I will
 be happy to get back a nested array or an XML fragment, as long as the
 iterator is only holding onto 1 array/fragment at a time and not
 caching the massive number of products per file.

As far as I'm aware, XML Parser can handle all of this.

http://php.net/xml

It's a SAX parser so you can feed it the data chunk by chunk. You can use 
gzopen to open gzipped files and manually feed the data into xml_parse. Be sure 
to read the docs carefully because there's a lot to be aware of when parsing an 
XML document in pieces.

-Stuart

-- 
Stuart Dallas
3ft9 Ltd
http://3ft9.com/
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Sequential access of XML nodes.

2011-09-26 Thread Adam Richardson
On Mon, Sep 26, 2011 at 12:24 PM, Richard Quadling rquadl...@gmail.comwrote:

 Hi.

 I've got a project which will be needing to iterate some very large
 XML files (around 250 files ranging in size from around 50MB to
 several hundred MB - 2 of them are in excess of 500MB).

 The XML files have a root node and then a collection of products. In
 total, in all the files, there are going to be several million product
 details. Each XML feed will have a different structure as it relates
 to a different source of data.

 I plan to have an abstract reader class with the concrete classes
 being extensions of this, each covering the specifics of the format
 being received and has the ability to return a standardised view of
 the data for importing into mysql and eventually MongoDB.

 I want to use an XML iterator so that I can say something along the lines
 of ...

 1 - Instantiate the XML iterator with the XML's URL.
 2 - Iterate the XML getting back one node at a time without keeping
 all the nodes in memory.

 e.g.

 ?php
 $o_XML = new SomeExtendedXMLReader('http://www.site.com/data.xml');
 foreach($o_XML as $o_Product) {
  // Process product.
 }


 Add to this that some of the xml feeds come .gz, I want to be able to
 stream the XML out of the .gz file without having to extract the
 entire file first.

 I've not got access to the XML feeds yet (they are coming from the
 various affiliate networks around, and I'm a remote user so need to
 get credentials and the like).

 If you have any pointers on the capabilities of the various XML reader
 classes, based upon this scenario, then I'd be very grateful.


 In this instance, the memory limitation is important. The current code
 is string based and whilst it works, you can imagine the complexity of
 it.

 The structure of each product internally will be different, but I will
 be happy to get back a nested array or an XML fragment, as long as the
 iterator is only holding onto 1 array/fragment at a time and not
 caching the massive number of products per file.

 Thanks.

 Richard.


 --
 Richard Quadling
 Twitter : EE : Zend : PHPDoc
 @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY : bit.ly/lFnVea

 --
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php


I believe the XMLReader allows you to pull node by node, and it's really
easy to work with:
http://www.php.net/manual/en/intro.xmlreader.php

In terms of dealing with various forms of compression, I believe you con use
the compression streams to handle this:
http://stackoverflow.com/questions/1190906/php-open-gzipped-xml
http://us3.php.net/manual/en/wrappers.compression.php

Adam

-- 
Nephtali:  A simple, flexible, fast, and security-focused PHP framework
http://nephtaliproject.com


Re: [PHP] Sequential access of XML nodes.

2011-09-26 Thread Ross McKay
On Mon, 26 Sep 2011 14:17:43 -0400, Adam Richardson wrote:

I believe the XMLReader allows you to pull node by node, and it's really
easy to work with:
http://www.php.net/manual/en/intro.xmlreader.php

In terms of dealing with various forms of compression, I believe you con use
the compression streams to handle this:
http://stackoverflow.com/questions/1190906/php-open-gzipped-xml
http://us3.php.net/manual/en/wrappers.compression.php

+1 here. XMLReader is easy and fast, and will do the job you want albeit
without the nice foreach(...) loop Richard spec's. You just loop over
reading the XML and checking the node type, watching the state of your
stream to see how to handle each iteration.

e.g. (assuming $xml is an open XMLReader, $db is PDO in example)

$text = '';
$haveRecord = FALSE;
$records = 0;

// prepare insert statement
$sql = '
insert into Product (ID, Product, ...)
values (:ID, :Product, ...)
';
$cmd = $db-prepare($sql);

// set list of allowable fields and their parameter type
$fields = array(
'ID' = PDO::PARAM_INT,
'Product' = PDO::PARAM_STR,
...
);

while ($xml-read()) {
switch ($xml-nodeType) {
case XMLReader::ELEMENT:
if ($xml-name === 'Product') {
// start of Product element, 
// reset command parameters to empty
foreach ($fields as $name = $type) {
$cmd-bindValue(:$name, NULL, PDO::PARAM_NULL);
}
$haveRecord = TRUE;
}
$text = '';
break;

case XMLReader::END_ELEMENT:
if ($xml-name === 'Product') {
// end of Product element, save record
if ($haveRecord) {
$result = $cmd-execute();
$records++;
}
$haveRecord = FALSE;
}
elseif ($haveRecord) {
// still inside a Product element, 
// record field value and move on
$name = $xml-name;
if (array_key_exists($name, $fields)) {
$cmd-bindValue(:$name, $text, $fields[$name]);
}
}
$text = '';
break;

case XMLReader::TEXT:
case XMLReader::CDATA:
// record value (or part value) of text or cdata node
$text .= $xml-value;
break;

default:
break;
}
}

return $records;
-- 
Ross McKay, Toronto, NSW Australia
Tuesday is Soylent Green day

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php