Re: [xml] Recovering from errors in an XML "stream"

2019-09-24 Thread Webb Scales

On 9/24/19 5:49 PM, Liam R E Quin wrote:

This isn’t true in general in XML, so beware.

That was why I was asking  :-)

(And, it's why I really want LibXML2 to do as much of the thinking here 
as possible!)



        Thanks,

            Webb



--

Webb Scales
Principal Software Architect
603-673-2306
www.ursasecure.com 
w...@ursasecure.com 

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Recovering from errors in an XML "stream"

2019-09-24 Thread Webb Scales

Thanks, Eric -- that's an interesting suggestion.

Does this work for you because the '<' character is not permitted in the 
stream except as the opening of a tag (which makes it very 
straightforward to locate each tag) and the root tag is not permitted to 
appear inside the document (or, are you doing a nesting count?)?  I'm 
trying to ensure that my code doesn't have to know too much about XML or 
maintain too much state (that's what I'm using LibXML2 for!  :-) ).



            Thanks,

                Webb



On 9/24/19 5:14 PM, Eric Eberhard wrote:


You can easily read the XML using TCP/IP yourself and find the ending 
tag, process, read the next document, process, etc. We do that always 
(much easier than other ideas).  You know the ending tag from the 
starting tag and there are issues about blocking and non-blocking 
reads.  We read one byte blocking and as soon as we get something we 
read until the ending tag and pause for processing.  Eric


*From:*xml [mailto:xml-boun...@gnome.org] *On Behalf Of *Webb Scales
*Sent:* Monday, September 09, 2019 9:30 PM
*To:* Liam R E Quin ; xml@gnome.org
*Subject:* Re: [xml] Recovering from errors in an XML "stream"

I'm OK with making small on-the-fly "edits" to the input (such as 
removing the initial comment, or removing all comments), but trying to 
make my code discern the overall structure (such as picking out the 
boundaries between the documents) is starting to step over into 
actually parsing it, which defeats the purpose of using LibXML2.


If the TextReader didn't insist upon reading beyond the root end-tag, 
that would enable me to solve my problem, I think. (I don't understand 
why it does that.)  In the absence of any other options, I'm going to 
experiment with the SAX interface and see if that will allow me to 
stop the parse at the right spot.


Anyway, thanks for your replies, Liam.


            Webb


On 9/10/19 12:19 AM, Liam R E Quin wrote:

On Mon, 2019-09-09 at 22:41 -0400, Webb Scales wrote:

the

fact remains that I don't control the text that I'm trying to parse,

and I still need to parse it, even though it's not "well-formed".

You may need to write some form of pre-processor that fixes the

problems. As you say, that may reduce the need for an XML parser.

I haven't investigated error recovery with libxml, so someone else

might have better ideas.

Liam



--

Webb Scales
Principal Software Architect
603-673-2306
www.ursasecure.com <https://www.ursasecure.com>
w...@ursasecure.com <mailto:w...@ursasecure.com>

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Recovering from errors in an XML "stream"

2019-09-24 Thread Eric Eberhard
Like I said, read into a string, then parse that.  You can skip the garbage 
like CR/LF … in our case if it all goes into the string in one read then so 
what, we still parse them one at a time …  Eric

 

From: xml [mailto:xml-boun...@gnome.org] On Behalf Of Webb Scales
Sent: Monday, September 09, 2019 7:41 PM
To: Liam R. E. Quin ; xml@gnome.org
Subject: Re: [xml] Recovering from errors in an XML "stream"

 

On 9/7/19 12:37 AM, Liam R. E. Quin wrote:



On Fri, 2019-09-06 at 01:57 -0400, Webb Scales wrote:

The first issue is that the XML parser seems to balk entirely at the 
fact that the document is preceded by a comment before the XML 
declaration.  (I'm less than shocked, but it is kind of
disappointing.) 

 
I'd be sad if it accepte it - it's not allowed.

Thanks for the BNF and the pointer to the specification.  However, the fact 
remains that I don't control the text that I'm trying to parse, and I still 
need to parse it, even though it's not "well-formed".





The next issue is that the XML parser reports an error near the end
of  the document, when it notices that the document is followed by an
XML declaration.  (I'm a little closer to shocked by this.)

 
Feed the parser XML without errors and this won't happen. Or are you
saying there are multiple documents in the same input stream?

I've got a stream of bytes; it contains text which is "XML-like".  I would love 
to break it up into chunks which are well-formed (or otherwise acceptable) XML 
documents and then feed it to a LibXML2 function, but I need to do so without 
making too many assumptions about the input and without having to teach my code 
too much about XML (otherwise, there'd be no point using LibXML2).

As it happens, there are newlines between the documents, so I tweaked my custom 
I/O handler to return only up to the next newline.  However, after receiving 
the text for a complete document, the TextReader still calls my handler again 
and then issues an error because there is text after the closing tag for the 
root...if it hadn't made the extra call, it wouldn't have been prompted to fail 
like that!





the offending text doesn't appear
until after the closing tag for the root.)

isn't that the point?

The point is that the TextReader is (I thought...) supposed to return the nodes 
or elements as they are parsed...so why does it report errors in text that is 
well beyond the current node (which, in fact, it had to issue an extra I/O 
request to get)??

Without that lookahead, I could have stopped the parse when it reached the end 
of the document, and started a new reader for the next document.  But, instead, 
the current reader consumes some of the text which belongs to the next 
document, and then goes into an endless cycle where it returns errors without 
advancing to the next node.





Is there some other approach which is better for my situation than
the xmlTextReader?

 
XSLT 3 provides a streaming mode which does what it sounds like you
might need, but libxml supports only XSLT 1. However, it, too, needs
well-formed XML input without errors. There's also STX. Or use a SAX
parser and keep only what you need, but again you need well-formed
input. By the time you've written a program to fix the input, your
program might well be able to do what you need anyway, no??

Yes, I'm trying to avoid reinventing the wheel:  if I write code which is able 
to transform my input into well-formed XML, I won't need LibXML to parse it for 
me.


I was hoping that there was a way to handle the errors encountered by the 
TextReader, recover from them, and continue with the parse, but it sounds like 
that's not practical.


Webb



 

-- 

Webb Scales 
Principal Software Architect 
603-673-2306 
www.ursasecure.com <https://www.ursasecure.com>  
w...@ursasecure.com <mailto:w...@ursasecure.com>  



___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Recovering from errors in an XML "stream"

2019-09-24 Thread Eric Eberhard
You can easily read the XML using TCP/IP yourself and find the ending tag, 
process, read the next document, process, etc.  We do that always (much easier 
than other ideas).  You know the ending tag from the starting tag and there are 
issues about blocking and non-blocking reads.  We read one byte blocking and as 
soon as we get something we read until the ending tag and pause for processing. 
 Eric

 

From: xml [mailto:xml-boun...@gnome.org] On Behalf Of Webb Scales
Sent: Monday, September 09, 2019 9:30 PM
To: Liam R E Quin ; xml@gnome.org
Subject: Re: [xml] Recovering from errors in an XML "stream"

 

I'm OK with making small on-the-fly "edits" to the input (such as removing the 
initial comment, or removing all comments), but trying to make my code discern 
the overall structure (such as picking out the boundaries between the 
documents) is starting to step over into actually parsing it, which defeats the 
purpose of using LibXML2.

If the TextReader didn't insist upon reading beyond the root end-tag, that 
would enable me to solve my problem, I think.  (I don't understand why it does 
that.)  In the absence of any other options, I'm going to experiment with the 
SAX interface and see if that will allow me to stop the parse at the right spot.

Anyway, thanks for your replies, Liam.


Webb




On 9/10/19 12:19 AM, Liam R E Quin wrote:

On Mon, 2019-09-09 at 22:41 -0400, Webb Scales wrote:

the 
fact remains that I don't control the text that I'm trying to parse,
and I still need to parse it, even though it's not "well-formed".

 
You may need to write some form of pre-processor that fixes the
problems. As you say, that may reduce the need for an XML parser.
 
I haven't investigated error recovery with libxml, so someone else
might have better ideas.
 
Liam
 

 

-- 

Webb Scales 
Principal Software Architect 
603-673-2306 
www.ursasecure.com <https://www.ursasecure.com>  
w...@ursasecure.com <mailto:w...@ursasecure.com>  



___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Recovering from errors in an XML "stream"

2019-09-09 Thread Liam R E Quin
On Tue, 2019-09-10 at 00:29 -0400, Webb Scales wrote:
> 
> If the TextReader didn't insist upon reading beyond the root end-tag,

All XML parsers do that, as the spec requires them to check if anything
follows it and raise an error if so.

Liam


-- 
Liam Quin - web slave for https://www.fromoldbooks.org/
with fabulous vintage art and fascinating texts to read.
Click here to have the slave rewarded with clean chains.

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Recovering from errors in an XML "stream"

2019-09-09 Thread Webb Scales
I'm OK with making small on-the-fly "edits" to the input (such as 
removing the initial comment, or removing all comments), but trying to 
make my code discern the overall structure (such as picking out the 
boundaries between the documents) is starting to step over into actually 
parsing it, which defeats the purpose of using LibXML2.


If the TextReader didn't insist upon reading beyond the root end-tag, 
that would enable me to solve my problem, I think.  (I don't understand 
why it does that.)  In the absence of any other options, I'm going to 
experiment with the SAX interface and see if that will allow me to stop 
the parse at the right spot.


Anyway, thanks for your replies, Liam.


            Webb



On 9/10/19 12:19 AM, Liam R E Quin wrote:

On Mon, 2019-09-09 at 22:41 -0400, Webb Scales wrote:

the
fact remains that I don't control the text that I'm trying to parse,
and I still need to parse it, even though it's not "well-formed".

You may need to write some form of pre-processor that fixes the
problems. As you say, that may reduce the need for an XML parser.

I haven't investigated error recovery with libxml, so someone else
might have better ideas.

Liam



--

Webb Scales
Principal Software Architect
603-673-2306
www.ursasecure.com 
w...@ursasecure.com 

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Recovering from errors in an XML "stream"

2019-09-09 Thread Liam R E Quin
On Mon, 2019-09-09 at 22:41 -0400, Webb Scales wrote:
> the 
> fact remains that I don't control the text that I'm trying to parse,
> and I still need to parse it, even though it's not "well-formed".

You may need to write some form of pre-processor that fixes the
problems. As you say, that may reduce the need for an XML parser.

I haven't investigated error recovery with libxml, so someone else
might have better ideas.

Liam

-- 
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Web slave for vintage clipart http://www.fromoldbooks.org/

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Recovering from errors in an XML "stream"

2019-09-09 Thread Webb Scales

On 9/7/19 12:37 AM, Liam R. E. Quin wrote:

On Fri, 2019-09-06 at 01:57 -0400, Webb Scales wrote:

The first issue is that the XML parser seems to balk entirely at the
fact that the document is preceded by a comment before the XML
declaration.  (I'm less than shocked, but it is kind of
disappointing.)

I'd be sad if it accepte it - it's not allowed.
Thanks for the BNF and the pointer to the specification.  However, the 
fact remains that I don't control the text that I'm trying to parse, and 
I still need to parse it, even though it's not "well-formed".




The next issue is that the XML parser reports an error near the end
of  the document, when it notices that the document is followed by an
XML declaration.  (I'm a little closer to shocked by this.)

Feed the parser XML without errors and this won't happen. Or are you
saying there are multiple documents in the same input stream?
I've got a stream of bytes; it contains text which is "XML-like".  I 
would love to break it up into chunks which are well-formed (or 
otherwise acceptable) XML documents and then feed it to a LibXML2 
function, but I need to do so without making too many assumptions about 
the input and without having to teach my code too much about XML 
(otherwise, there'd be no point using LibXML2).


As it happens, there are newlines between the documents, so I tweaked my 
custom I/O handler to return only up to the next newline.  However, 
after receiving the text for a complete document, the TextReader still 
calls my handler /again/ and then issues an error because there is text 
after the closing tag for the root...if it hadn't made the extra call, 
it wouldn't have been prompted to fail like that!




the offending text doesn't appear
until after the closing tag for the root.)

isn't that the point?
The point is that the TextReader is (I thought...) supposed to return 
the nodes or elements /as they are parsed/...so why does it report 
errors in text that is well beyond the current node (which, in fact, it 
had to issue an /extra/ I/O request to get)??


Without that lookahead, I could have stopped the parse when it reached 
the end of the document, and started a /new/ reader for the next 
document.  But, instead, the current reader consumes some of the text 
which belongs to the next document, and then goes into an endless cycle 
where it returns errors without advancing to the next node.




Is there some other approach which is better for my situation than
the xmlTextReader?

XSLT 3 provides a streaming mode which does what it sounds like you
might need, but libxml supports only XSLT 1. However, it, too, needs
well-formed XML input without errors. There's also STX. Or use a SAX
parser and keep only what you need, but again you need well-formed
input. By the time you've written a program to fix the input, your
program might well be able to do what you need anyway, no??
Yes, I'm trying to avoid reinventing the wheel:  if I write code which 
is able to transform my input into well-formed XML, I won't need LibXML 
to parse it for me.



I was hoping that there was a way to handle the errors encountered by 
the TextReader, recover from them, and continue with the parse, but it 
sounds like that's not practical.



            Webb

--

Webb Scales
Principal Software Architect
603-673-2306
www.ursasecure.com 
w...@ursasecure.com 

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml