Here is a simple perl script that will check many xml files for errors 
(following the assumptions listed below). I think the diagnostics are 
relatively easy to understand. 

#!/usr/bin/perl
# BSD License

use strict;

my $lineNum = 0;
my $element = "";
my $tagName = "";
my @tagStack = ();
lines: while (<>) {
  $lineNum++;

  # While there is a tag on the line
  # remove and process it
  while (s/<([^>]+)>//o) {
    $element = $1;

    # self closed tags are skipped
    if ($element =~ /(.*\/|\?.*\?)$/) {
      next;
    }

    # end tags have to nest properly
    # thus match stack top
    if ($element =~ /^\/([^\s]+).*$/o) {
      $tagName = $1;
      my ($topTagName, $topLineNum, $topElement) = @{ pop @tagStack };
      if ($topTagName ne $tagName) {
        print "Error on line $lineNum: expected $tagName, but saw $topTagName 
from line $topLineNum (element: <$topElement>)\n";
        last lines;
      }
    } else {
      # Found a start element
      $element =~ /^([^\s]+).*$/o;
      $tagName = $1;
      push @tagStack, [ $tagName, $lineNum, $element ];
    }
  }
}

foreach my $location (reverse @tagStack) {
  my ($topTagName, $topLineNum, $topElement) = @{$location};
  print "unmatched $topTagName from line $topLineNum (element: 
<$topElement>)\n";
}

On Sep 21, 2012, at 1:27 PM, DM Smith <dmsm...@crosswire.org> wrote:

> So far the discussion is around whether the xml is well-formed.
> Once you get that working, then you need to make sure it is valid wrt the 
> OSIS schema.
> 
> There's an old tool that will convert sgml to well-formed xml. I think it was 
> James Clark's "sx". I've used it successfully on initial conversions and 
> getting something that will work within xml tools.
> 
> Finally, OSIS has the notion of milestones for start and end elements. There 
> are semantic rules regarding this that cannot be checked by standard xml 
> tools. Osis2mod tries to handle this. When you get to that point, I can help 
> unravel the logging options.
> 
> The purpose of milestoned elements is to allow for two competing document 
> models to be in the same xml document: BSP and BCV (names we've given it here 
> and in the wiki).
> 
> We recommend using BSP (book, chapter, section, paragraph, poetry, lists to 
> all be containers, not milestoned) and verse elements be milestoned.
> 
> Note, the OSIS manual says that if you have one element milestoned, then all 
> other elements with the same tag name have to be milestoned. Practically 
> speaking, this does not matter. SWORD and JSword don't care. Having verses 
> milestoned only if necessary is probably a better way to create a good XML 
> document. Start out with all of them as containers and each place where that 
> causes a problem, either fix the xml or if otherwise correct, convert to 
> milestoned verses.
> 
> Generally speaking these BSP elements should not start just inside or at the 
> end of a verse. Rather they should be between verse elements or within the 
> text. When they are placed just after the verse start, they often will cause 
> the verse number to be orphaned. When they are placed just before the verse 
> end, then it is generally not noticeable (just bad form).
> 
> Quotes will create the biggest grief in the above. They often cross 
> boundaries. Certainly, the beatitudes does, starting in one chapter and 
> ending a couple of chapters later. For this reason, using the milestoned 
> version is necessary.
> 
> If you're document follows some simple rules (some required by xml, others 
> simplifications), then checking nesting is a simple matter of having a 
> push/pop stack of elements. The simple rules:
> 1) All attributes when present have quoted values.
> 2) All entities are properly formed and used when needed. Also, < and > are 
> not in attribute values. 
> 3) Tags are marked with < ... >, </ ... >, or < ... />. and now new lines 
> between < and >.
> 
> If this is true then a simple perl script can be written to find the problems 
> in the file:
> Look for < ... /> and skip them. They cause no problems.
> Look for < xxx ... > and push the tag name along with its location in the 
> file on to the stack.
> Look for < xxx />, compare xxx to the top element on the stack. If it doesn't 
> match, then it causes an error.
> When you get to the end of the document and the stack is not empty, then the 
> elements on the stack are not closed properly.
> 
> Printing out the stack (elements and locations) would help find what the 
> problem is.
> 
> For example:
>       if xxx is deeper in the stack, then there is a problem with nesting. 
> Look at all the elements above the xxx on the stack for problems.
>       if it is not in the stack, then the element was not started prior to 
> that point or it may have been ended twice.
> 
> Here is a simple perl script (that I wrote), which doesn't do that, but could 
> be adapted to do it. This creates a histogram/dictionary of tag and attribute 
> names.
> 
> #!/usr/bin/perl
> 
> use strict;
> 
> my %tags = ();
> my %attrs = ();
> while (<>)
>   {
> #print;
>     # While there is a tag on the line
>     while (/<[^\/\s>]+[\/\s>]/o)
>     {
>       # While there is an attribute in the tag
>       while (/<[^\/\s>]+\s+[^\=\/\>]+=\"[^\"]+\"/o)
>       {
>         # remove the attribute
>         s/<([^\/\s>]+)\s+([^\=\/\>]+)(\="[^\"]+\")(.*)/<$1 $4/o;
>         my ($t, $a, $v, $r) = ($1, $2, $3, $4);
>         $attrs{"$t.$a"}++;
>       }
>       # remove the tag
>       s/<([^\/\s>]+)[\/\s>]//o;
>       $tags{$1}++;
> #print("do next tag on line\n");
>     }
> #print("do next line\n");
>   }
> 
> foreach my $tag (sort keys %tags)
>   {
>     print("$tag\n");
>   }
> 
> foreach my $attr (sort keys %attrs)
>   {
>     print("$attr\n");
>   }
> 
> Hope this helps,
>       DM
> 
> On Sep 21, 2012, at 10:52 AM, Andrew Thule <thules...@gmail.com> wrote:
> 
>> Thanks everyone for suggestions.  I'll give them all a try. 
>> 
>> That said, the emacs recommendation is nearly a religious conversion 
>> recommendation.  (I'm on the vi side of the vi verses emacs debate.  I 
>> suppose as long as it doesn't kill me I should give it a try, though I'm not 
>> certain what impact it will have on the health of my soul ... :D )
>> 
>> ~A
>> 
>> 
>> On Thursday, September 20, 2012, Daniel Owens wrote:
>> I use jEdit with the XML plugin installed. I find it helps me find problems 
>> fairly easily.
>> 
>> Daniel
>> 
>> On 09/20/2012 05:26 PM, Greg Hellings wrote:
>> There are a number of pieces of software out there that will
>> pretty-print the XML for you, with indenting and whatnot. Overly
>> indented for what you would want in production but decent for
>> debugging mismatching nesting and the like.
>> 
>> For example, 'xmllint --format' will properly indent the file, etc. I
>> don't know how it will handle poorly formed XML.
>> 
>> GUI editors can do wonders as well. On Windows I use Notepad++ and
>> manually set it to display XML. gEdit and Geany - I believe - both
>> support similar display worlds. And there are some plugins for Eclipse
>> that might handle what you need as well.
>> 
>> --Greg
>> 
>> On Thu, Sep 20, 2012 at 4:19 PM, Karl Kleinpaste <k...@kleinpaste.org> wrote:
>> Andrew Thule <thules...@gmail.com> writes:
>> One of my least favour things is finding mismatched tags in OSIS.xml files
>> Has anyone successfully climbed this summit?
>> XEmacs and xml-mode (and font-lock-mode).  M-C-f and M-C-b execute
>> sgml-forward-element and -backward-.  That is, sitting at the beginning
>> of <tag>, M-C-f (meta-control-f) moves forward to the matching </tag>,
>> properly handling nested tags.
>> 
>> _______________________________________________
>> sword-devel mailing list: sword-devel@crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>> _______________________________________________
>> sword-devel mailing list: sword-devel@crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>> 
>> 
>> 
>> _______________________________________________
>> sword-devel mailing list: sword-devel@crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>> _______________________________________________
>> sword-devel mailing list: sword-devel@crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
> 
> _______________________________________________
> sword-devel mailing list: sword-devel@crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to