On Thu, Aug 23, 2012 at 5:32 PM, Daniel Veillard <veill...@redhat.com> wrote:
> On Thu, Aug 23, 2012 at 09:03:44PM +0800, Daniel Veillard wrote:
>> On Wed, Aug 22, 2012 at 10:40:20PM +0200, Johan Corveleyn wrote:
> [...]
>> > Is there some (preferable concise) reading about basic concepts like
>> > how xml schemas automata work, how these are built, etc? What should
>> > the automata in this case look like (in terms of the data structures
>> > inside the code)? How can I see that the automata is non-determinist,
>> > and what would a determinist automata look like, ... ?
>>
>>     Johan,
>>
>>   sorry for the slow answer,
>>
>> I will look at this, promise, but all that code is rooted in automata
>> theory, there is quite some litterature about it, and well unless you
>> go though the process of learning that code and possibly the theory
>> behind it, it's not something that can be hacked trivially. Basically
>> I run this in debug mode, look at the automata generated, there is pro
>> bably an error there and then this need fixing and a lot of checking.
>> To be honnest that one of the hardest part of the library to change
>> with the core of the parser.
>>   Theory is usually learnt though the 'Dragon Book', by Aho and Al.
>> and the code is actually more complex since using extended automata
>> operations not found in basic automata ...
>>
>>     If you have time looking at this may be interesting, but
>> it's really not trivial at least more complex than other parts of the
>> code :-)
>
>   Okay simce people may get curious:
>     1/ I fixed the issue see the tiny patch attached
>     2/ following is an explanation of what I did to debug the issue
>        in case people may want to fix other issues there of if they are
>        just curious how a developper may spend half an hour of his time :-)
>
> Steps:
> 0/ make minimal xsd and xml the one given were quite good
> 1/ think about the pattern that is being expressed, it's
>    rule * (specialRule rule *) ?
>    that should be determinist actually so yes that looks like a bug
> 2/ run xmllint --schema tst.xsd tst.xml
>    clearly libxml2 regexp think it's not determinist, and there is
>    a construction problem as hinted before
> 3/ edit xmlregexp.c comment off the 4 debugging macros at the top
>    recompile xmllint and run it on the example
> 4/ start to study the nice output given:
>
>     -------------------------------
>     thinkpad:~/XML -> xmllint --schema tst.xsd tst.xml
>     Add trans from 0 to 1  atom: string once 'rule'
>     Add trans from 1 to 1  atom: string once 'rule'
>     Add trans from 0 to 1 epsilon transition
>     Add trans from 1 to 2  atom: string once 'specialRule'
>     Add trans from 2 to 3  atom: string once 'rule'
>     Add trans from 3 to 3  atom: string once 'rule'
>     Add trans from 2 to 3 epsilon transition
>     Add trans from 1 to 3 epsilon transition
>     Found epsilon trans 1 from 2 to 3
>     xmlFAReduceEpsilonTransitions(2, 3)
>     State 3 is final, so 2 becomes final
>     Add trans from 2 to 3  atom: string once 'rule'
>     Found epsilon trans 2 from 1 to 3
>     xmlFAReduceEpsilonTransitions(1, 3)
>     State 3 is final, so 1 becomes final
>     Add trans from 1 to 3  atom: string once 'rule'
>     Found epsilon trans 1 from 0 to 1
>     xmlFAReduceEpsilonTransitions(0, 1)
>     State 1 is final, so 0 becomes final
>     Add trans from 0 to 1  atom: string once 'rule'
>     Add trans from 0 to 2  atom: string once 'specialRule'
>     Add trans from 0 to 3  atom: string once 'rule'
>     xmlFAComputesDeterminism
>      ctxt: '(null)'
>     5 atoms:
>      00  atom: string once 'rule'
>      01  atom: string once 'rule'
>      02  atom: string once 'specialRule'
>      03  atom: string once 'rule'
>      04  atom: string once 'rule'
>     4 states: start: 0
>      state: FINAL 0, 5 transitions:
>       trans: atom 0, to 1
>       trans: removed
>       trans: atom 1, to 1
>       trans: atom 2, to 2
>       trans: atom 4, to 3
>      state: FINAL 1, 4 transitions:
>       trans: atom 1, to 1
>       trans: atom 2, to 2
>       trans: removed
>       trans: atom 4, to 3
>      state: FINAL 2, 3 transitions:
>       trans: atom 3, to 3
>       trans: removed
>       trans: atom 4, to 3
>      state: FINAL 3, 1 transitions:
>       trans: atom 4, to 3
>     0 counters:
>     Final: 4 states
>     Final: 2 atoms
>     Indet: state 0 trans 4, atom 4 to 3 : 0 to 3
>            previous to is 2
>     tst.xsd:4: element complexType: Schemas parser error : local complex
>     type: The content model is not determinist.
>     ----------------
>
> 5/ make the graph of the automata as it is created and then transformed
>    by epsilon rules reductions, c.f. the picture P1020568.JPG attached
>    something is wrong when adding the 3rd epsilon transition from state
>    1 to state 3, as the indeterminism is created there, suddenly a rule
>    while in state 0 or 1 could be consumed to go to 1 or to 3.
>
> 6/ there is a need to separate the state from the end of the sequence
>    from state 3 to avoid the problem
>
> 7/ make the patch adding an extra state at end of sequences in that
>    configuration and connected to the last state of the sequence by
>    an espilon
>
> 8/ notice we did just that in another sequence case with a nice comment
>    copy the comment too
>
> 9/ rebuild xmllint, and rerun, watch the new automata being generated
>    and the output, looks it fixes the issue, good !
>
> 10/ recomment the verbose debug macros, rebuild everything, watch 'make check'
>     with a bit of anxiety
>
> 11/ run runsuite for good measure, no divergence, so one can run
>     30000+ tests without hitting such a basic regexp construction
>     error. Actually that had been around for nearly a decade without
>     anybody noticing, isn't code amazing !
>
> 12/ Prepare to push to git
>
> 13/ prepare a mail to the list with explanations.
>
> 14/ commit, push and send mail

Great! Thanks a lot. And thanks for the explanation.

I'll give your patch a try on my real-world sample (from which the
rules example was deduced), just to be sure.

You may want to resolve / close the issue
https://bugzilla.gnome.org/show_bug.cgi?id=670865.

-- 
Johan
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml

Reply via email to