On Thu, Aug 23, 2012 at 5:32 PM, Daniel Veillard <veill...@redhat.com> wrote: > On Thu, Aug 23, 2012 at 09:03:44PM +0800, Daniel Veillard wrote: >> On Wed, Aug 22, 2012 at 10:40:20PM +0200, Johan Corveleyn wrote: > [...] >> > Is there some (preferable concise) reading about basic concepts like >> > how xml schemas automata work, how these are built, etc? What should >> > the automata in this case look like (in terms of the data structures >> > inside the code)? How can I see that the automata is non-determinist, >> > and what would a determinist automata look like, ... ? >> >> Johan, >> >> sorry for the slow answer, >> >> I will look at this, promise, but all that code is rooted in automata >> theory, there is quite some litterature about it, and well unless you >> go though the process of learning that code and possibly the theory >> behind it, it's not something that can be hacked trivially. Basically >> I run this in debug mode, look at the automata generated, there is pro >> bably an error there and then this need fixing and a lot of checking. >> To be honnest that one of the hardest part of the library to change >> with the core of the parser. >> Theory is usually learnt though the 'Dragon Book', by Aho and Al. >> and the code is actually more complex since using extended automata >> operations not found in basic automata ... >> >> If you have time looking at this may be interesting, but >> it's really not trivial at least more complex than other parts of the >> code :-) > > Okay simce people may get curious: > 1/ I fixed the issue see the tiny patch attached > 2/ following is an explanation of what I did to debug the issue > in case people may want to fix other issues there of if they are > just curious how a developper may spend half an hour of his time :-) > > Steps: > 0/ make minimal xsd and xml the one given were quite good > 1/ think about the pattern that is being expressed, it's > rule * (specialRule rule *) ? > that should be determinist actually so yes that looks like a bug > 2/ run xmllint --schema tst.xsd tst.xml > clearly libxml2 regexp think it's not determinist, and there is > a construction problem as hinted before > 3/ edit xmlregexp.c comment off the 4 debugging macros at the top > recompile xmllint and run it on the example > 4/ start to study the nice output given: > > ------------------------------- > thinkpad:~/XML -> xmllint --schema tst.xsd tst.xml > Add trans from 0 to 1 atom: string once 'rule' > Add trans from 1 to 1 atom: string once 'rule' > Add trans from 0 to 1 epsilon transition > Add trans from 1 to 2 atom: string once 'specialRule' > Add trans from 2 to 3 atom: string once 'rule' > Add trans from 3 to 3 atom: string once 'rule' > Add trans from 2 to 3 epsilon transition > Add trans from 1 to 3 epsilon transition > Found epsilon trans 1 from 2 to 3 > xmlFAReduceEpsilonTransitions(2, 3) > State 3 is final, so 2 becomes final > Add trans from 2 to 3 atom: string once 'rule' > Found epsilon trans 2 from 1 to 3 > xmlFAReduceEpsilonTransitions(1, 3) > State 3 is final, so 1 becomes final > Add trans from 1 to 3 atom: string once 'rule' > Found epsilon trans 1 from 0 to 1 > xmlFAReduceEpsilonTransitions(0, 1) > State 1 is final, so 0 becomes final > Add trans from 0 to 1 atom: string once 'rule' > Add trans from 0 to 2 atom: string once 'specialRule' > Add trans from 0 to 3 atom: string once 'rule' > xmlFAComputesDeterminism > ctxt: '(null)' > 5 atoms: > 00 atom: string once 'rule' > 01 atom: string once 'rule' > 02 atom: string once 'specialRule' > 03 atom: string once 'rule' > 04 atom: string once 'rule' > 4 states: start: 0 > state: FINAL 0, 5 transitions: > trans: atom 0, to 1 > trans: removed > trans: atom 1, to 1 > trans: atom 2, to 2 > trans: atom 4, to 3 > state: FINAL 1, 4 transitions: > trans: atom 1, to 1 > trans: atom 2, to 2 > trans: removed > trans: atom 4, to 3 > state: FINAL 2, 3 transitions: > trans: atom 3, to 3 > trans: removed > trans: atom 4, to 3 > state: FINAL 3, 1 transitions: > trans: atom 4, to 3 > 0 counters: > Final: 4 states > Final: 2 atoms > Indet: state 0 trans 4, atom 4 to 3 : 0 to 3 > previous to is 2 > tst.xsd:4: element complexType: Schemas parser error : local complex > type: The content model is not determinist. > ---------------- > > 5/ make the graph of the automata as it is created and then transformed > by epsilon rules reductions, c.f. the picture P1020568.JPG attached > something is wrong when adding the 3rd epsilon transition from state > 1 to state 3, as the indeterminism is created there, suddenly a rule > while in state 0 or 1 could be consumed to go to 1 or to 3. > > 6/ there is a need to separate the state from the end of the sequence > from state 3 to avoid the problem > > 7/ make the patch adding an extra state at end of sequences in that > configuration and connected to the last state of the sequence by > an espilon > > 8/ notice we did just that in another sequence case with a nice comment > copy the comment too > > 9/ rebuild xmllint, and rerun, watch the new automata being generated > and the output, looks it fixes the issue, good ! > > 10/ recomment the verbose debug macros, rebuild everything, watch 'make check' > with a bit of anxiety > > 11/ run runsuite for good measure, no divergence, so one can run > 30000+ tests without hitting such a basic regexp construction > error. Actually that had been around for nearly a decade without > anybody noticing, isn't code amazing ! > > 12/ Prepare to push to git > > 13/ prepare a mail to the list with explanations. > > 14/ commit, push and send mail
Great! Thanks a lot. And thanks for the explanation. I'll give your patch a try on my real-world sample (from which the rules example was deduced), just to be sure. You may want to resolve / close the issue https://bugzilla.gnome.org/show_bug.cgi?id=670865. -- Johan _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml