Re: [NTG-context] ignore not closed tags in XML input

2022-05-21 Thread Pablo Rodriguez via ntg-context
On 5/19/22 17:33, juh via ntg-context wrote:
> Dear Pablo,
>
> sorry for answering late as I am on holidays learning Spanish in
> Salamanca. :-)

Many thanks for your reply, Jan-Ulrich.

I hope you are enjoying your experience in Spain.

> Am Wed, May 18, 2022 at 06:00:20PM +0200 schrieb Pablo Rodriguez via 
> ntg-context:
>> Sorry for explaining myself so poorly.
>>
>> One of the not irrelevant tasks for me is finding examples of XML code.
>
> As I know that you are fluent in German I would recommend
>
> https://deutschestextarchiv.de/

Good advice, since the DTA contains TEI XML sources.

> It is a collection of many, many texts in German with expired
> copyright in TEI XML and other formats.
>
> I had a hard time to convert even one text to ConTeXt, but I've got it
> to work. I had the crazy idea to get a process where I simply can
> download the TEI XML source and make a nice book of the text.

Just a comment. My experience with computers is that the first time
doing anything is the hardest one.

Many thanks for your help,

Pablo
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] ignore not closed tags in XML input

2022-05-21 Thread Pablo Rodriguez via ntg-context
On 5/19/22 00:09, Bruce Horrocks via ntg-context wrote:
>> On 18 May 2022, at 17:00, Pablo Rodriguez via ntg-context wrote:
>> Sorry for explaining myself so poorly.
>>
>> One of the not irrelevant tasks for me is finding examples of XML code.
>
> Perhaps you could start by typesetting a technical source rather
> than prose?
> > I suggest trying to typeset the UK Meteorological Office's Shipping
> Forecast :-)
>
> [...]>
> It's a good (in my opinion) source because it is amenable to being
> printed in several different ways: one might be to simply copy the
> webpage's layout, while another could be to use columns to fit more
> onto a single page of text.

Hi Bruce,

many thanks for your advice.

This could be a good way to practice things that I’m not used to.

After all, the things you can do with pandoc are rather limited when
considered from XML.

> Alternatively, a much more demanding exercise would be to typeset the
> user manual for the XML editing software "Oxygen": 
> > > The XML source for the manual is here:
>   

Many thanks for your tip, but I’m afraid this isn’t my cup of tea.

But this reminded me of the Guidelines from the Text Encoding Initiative
(https://tei-c.org).

The PDF version of these Guidelines are roughly over 2000 pages.

It could be also a good exercise (and also demanding).

Many thanks for your help,

Pablo
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] ignore not closed tags in XML input

2022-05-21 Thread Pablo Rodriguez via ntg-context
On 5/18/22 19:14, Thangalin via ntg-context wrote:
> Hey Pablo,
>
>> One of the not irrelevant tasks for me is finding examples of XML code.
>
> To clarify, XHTML documents /are/ XML documents. XHTML happens to use a
> standardized set of XML element and attribute names. All XHTML examples
> are also XML examples.

Hi Dave,

many thanks for the explanation.

>> But my worries came from having to sanitize HTML sources (which aren’t
>
> That was discussed in the blog post: finding a source of well-formed
> XHTML documents. There are a number of tools to sanitize HTML, as
> mentioned in the thread. KeenWrite uses the Java-based JSoup library
> https://jsoup.org/  to sanitize HTML and then create
> an XHTML version.

After dealing with other (X)HTML sources, I have experienced that not
few of them contain sloppy encoded data (as Taco pointed out).

There are even some mismatches that xmllint doesn’t solve automatically
(as Taco already mentioned too).

Now I understand that I will have also to curate tidy XML sources to
typeset them with ConTeXt.

Many thanks for your help again,

Pablo
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] ignore not closed tags in XML input

2022-05-19 Thread juh via ntg-context
Dear Pablo,

sorry for answering late as I am on holidays learning Spanish in
Salamanca. :-)

Am Wed, May 18, 2022 at 06:00:20PM +0200 schrieb Pablo Rodriguez via 
ntg-context:
> Sorry for explaining myself so poorly.
> 
> One of the not irrelevant tasks for me is finding examples of XML code.

As I know that you are fluent in German I would recommend

https://deutschestextarchiv.de/

It is a collection of many, many texts in German with expired
copyright in TEI XML and other formats.

I had a hard time to convert even one text to ConTeXt, but I've got it
to work. I had the crazy idea to get a process where I simply can
download the TEI XML source and make a nice book of the text.

Saludos!
juh

-- 
Autoren-Homepage: . http://literatur.hasecke.com
Satiren & Essays: . http://www.sudelbuch.de
Privater Blog:  http://www.hasecke.eu
Netzliteratur-Projekt:  http://www.generationenprojekt.de




signature.asc
Description: PGP signature
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] ignore not closed tags in XML input

2022-05-18 Thread Bruce Horrocks via ntg-context


> On 18 May 2022, at 17:00, Pablo Rodriguez via ntg-context 
>  wrote:
> 
> 
> Sorry for explaining myself so poorly.
> 
> One of the not irrelevant tasks for me is finding examples of XML code.

Perhaps you could start by typesetting a technical source rather than prose?

I suggest trying to typeset the UK Meteorological Office's Shipping Forecast :-)

- web page version
  


- XML source data
  


- as broadcast on the Radio
  

It's a good (in my opinion) source because it is amenable to being printed in 
several different ways: one might be to simply copy the webpage's layout, while 
another could be to use columns to fit more onto a single page of text.



Alternatively, a much more demanding exercise would be to typeset the user 
manual for the XML editing software "Oxygen".
  

The XML source for the manual is here:
  


—
Bruce Horrocks
Hampshire, UK

___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] ignore not closed tags in XML input

2022-05-18 Thread Thangalin via ntg-context
Hey Pablo,

> One of the not irrelevant tasks for me is finding examples of XML code.

To clarify, XHTML documents *are* XML documents. XHTML happens to use a
standardized set of XML element and attribute names. All XHTML examples are
also XML examples.

> But my worries came from having to sanitize HTML sources (which aren’t

That was discussed in the blog post: finding a source of well-formed XHTML
documents. There are a number of tools to sanitize HTML, as mentioned in
the thread. KeenWrite uses the Java-based JSoup library https://jsoup.org/
to sanitize HTML and then create an XHTML version.

All the best!
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] ignore not closed tags in XML input

2022-05-18 Thread Pablo Rodriguez via ntg-context
On 5/18/22 03:23, Thangalin via ntg-context wrote:
> […]
>   I wanted to write an introduction on how to typeset XML sources with
>   ConTeXt (at least, in Spanish).
>
> See:
> https://dave.autonoma.ca/blog/2020/04/11/project-gutenberg-projects/
>
> It's English, but describes a fair amount of what you're probably
> looking to accomplish, and there are all sorts of free translation
> services now.

Hi Dave,

many thanks for your reply.

Your introduction clearly states
(https://dave.autonoma.ca/blog/2020/04/11/project-gutenberg-projects/#xhtml-to-markdown):

  Even though ConTeXt can typeset XML documents, we’ll use XSLT—the
  verbose language only gurus grok without gripes—to convert XHTML into
  a Markdown document that pandoc can read to produce a native ConTeXt
  file.

I’m afraid I’m interested in typesetting XML documents with ConTeXt.

Actually, I have been typesetting XHML documents (generated by pandoc
from Markdown sources) for years now.

Sorry for having explained myself like crap. I wanted to write an
introduction on how to typeset XML sources in ConTeXt. I cannot see how
free translation services may be of help here.

>   One of the main issues I face is to find examples.
>
> See:
>
> https://wiki.contextgarden.net/XML
> https://wiki.contextgarden.net/Getting_Started_with_XML_and_ConTeXt_using_TEXML
>
> And themes for my text editor, KeenWrite, in particular:
>
> https://github.com/DaveJarvis/keenwrite-themes/tree/main/xhtml
> https://github.com/DaveJarvis/keenwrite-themes/tree/main/tarmes
> https://github.com/DaveJarvis/keenwrite-themes/tree/main/boschet

Sorry for explaining myself so poorly.

One of the not irrelevant tasks for me is finding examples of XML code.

>   Maybe all XML handling is way more complex than I originally thought.
>
> It takes some elbow grease. Conceptually, it's essentially mapping XML
> elements to xmlsetups, which are used to apply typesetting instructions.

I agree, this is basically the idea.

But my worries came from having to sanitize HTML sources (which aren’t
strict XML-compliant).

Many thanks for your help,

Pablo
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] ignore not closed tags in XML input

2022-05-17 Thread Thangalin via ntg-context
> I wanted to write an introduction on how to typeset XML sources with
> ConTeXt (at least, in Spanish).
>

See: https://dave.autonoma.ca/blog/2020/04/11/project-gutenberg-projects/

It's English, but describes a fair amount of what you're probably looking
to accomplish, and there are all sorts of free translation services now.


> One of the main issues I face is to find examples.
>

See:

https://wiki.contextgarden.net/XML
https://wiki.contextgarden.net/Getting_Started_with_XML_and_ConTeXt_using_TEXML

And themes for my text editor, KeenWrite, in particular:

https://github.com/DaveJarvis/keenwrite-themes/tree/main/xhtml
https://github.com/DaveJarvis/keenwrite-themes/tree/main/tarmes
https://github.com/DaveJarvis/keenwrite-themes/tree/main/boschet


> Maybe all XML handling is way more complex than I originally thought.
>

It takes some elbow grease. Conceptually, it's essentially mapping XML
elements to xmlsetups, which are used to apply typesetting instructions.

Cheers!
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] ignore not closed tags in XML input

2022-05-17 Thread Pablo Rodriguez via ntg-context
On 5/16/22 20:13, Taco Hoekwater via ntg-context wrote:
>> On 16 May 2022, at 18:50, Pablo Rodriguez via ntg-context 
>>  wrote:
>> [...]
>> If I want to typeset the whole book
>> (https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/), I will
>> have to download and sanitize over 20 HTML files.
>
> Which can be done with a couple of command lines. Xmllint usually does a good
> job of cleaning up dodgy html input:
>
>   xmllint --html --xmlout  > 

Many thanks for your reply, Taco.

Since I have to recursively download the site (with "wget -r"), I hope I
can find a way to pipe and get all in a single invocation.

>> It is really a pity that ConTeXt cannot totally ignore any given XML 
>> elements.
>
> This statement is a little unfair: the problem is exactly that your input is 
> NOT proper XML.

My apologies. I really think ConTeXt rocks.

I wanted to write an introduction on how to typeset XML sources with
ConTeXt (at least, in Spanish).

One of the main issues I face is to find examples.

It seemed natural to me to use HTML edited texts. But it turned out,
it’s way trickier than I first thought.

HTML edited texts could be an eye-candy for some potential interested
people. But if one has to add web crawler plus XML sanitizer to the
dependencies, this makes it way harder (even for myself).

> If it was proper XML, ConTeXt would not have problems with it. ConTeXt 
> explicitly has
> the capability to handle XML files, which your input simply is not. In fact, 
> it is
> sloppy HTML-esque data that modern webbrowsers happen to be able to handle 
> more or less
> correctly. It is not valid HTML either, because valid HTML has to be valid 
> SGML, which your
> input clearly is not.

I agree my input isn’t proper XML, but it is valid SGML. One of the main
differences between both is that SGML allows unclosed tags.

This is why cases such as this one are corner-cases:
https://validator.w3.org/nu/?doc=https%3A%2F%2Fseumasjeltzz.github.io%2FLinguaeGraecaePerSeIllustrata%2F.

Since I considered this a corner-case, I thought that a command such as
\xmlignore{#1}{head/(meta|link)} would make sense.

> That said, Tools like xmllint exist for this stuff. Just write a small batch 
> driver file in
> some scripting language ((power)shell, lua, python, perl, etc.) to preprocess 
> the HTML
> stuff into clean XML, and you should be fine.

Many thanks for your for your reply again.

Maybe all XML handling is way more complex than I originally thought.

Many thanks for your help,

Pablo
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] ignore not closed tags in XML input

2022-05-16 Thread Taco Hoekwater via ntg-context


> On 16 May 2022, at 18:50, Pablo Rodriguez via ntg-context 
>  wrote:
> 
> On 5/16/22 17:30, Hans van der Meer via ntg-context wrote:
>> Can't you use an editor with grep, searching for something like the
>> pattern ?
> 
> Many thanks for your reply, dr. van der Meer.
> 
> If I want to typeset the whole book
> (https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/), I will
> have to download and sanitize over 20 HTML files.

Which can be done with a couple of command lines. Xmllint usually does a good
job of cleaning up dodgy html input:

  xmllint --html --xmlout  > 

(As good as can be expected from a program, anyway).

> It is really a pity that ConTeXt cannot totally ignore any given XML elements.

This statement is a little unfair: the problem is exactly that your input is 
NOT proper XML.
 
If it was proper XML, ConTeXt would not have problems with it. ConTeXt 
explicitly has
the capability to handle XML files, which your input simply is not. In fact, it 
is
sloppy HTML-esque data that modern webbrowsers happen to be able to handle more 
or less
correctly. It is not valid HTML either, because valid HTML has to be valid 
SGML, which your
input clearly is not.

That said, Tools like xmllint exist for this stuff. Just write a small batch 
driver file in 
some scripting language ((power)shell, lua, python, perl, etc.) to preprocess 
the HTML 
stuff into clean XML, and you should be fine.

Taco

— 
Taco Hoekwater  E: t...@bittext.nl
genderfluid (all pronouns)



___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] ignore not closed tags in XML input

2022-05-16 Thread Pablo Rodriguez via ntg-context
On 5/16/22 17:30, Hans van der Meer via ntg-context wrote:
> Can't you use an editor with grep, searching for something like the
> pattern ?

Many thanks for your reply, dr. van der Meer.

If I want to typeset the whole book
(https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/), I will
have to download and sanitize over 20 HTML files.

And I’m afraid this is only for a single PDF output.

It is really a pity that ConTeXt cannot totally ignore any given XML
elements.

Many thanks for your help,

Pablo
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] ignore not closed tags in XML input

2022-05-16 Thread Pablo Rodriguez via ntg-context
On 5/16/22 17:22, mf via ntg-context wrote:
> See HTML-tidy,
>
> https://www.html-tidy.org/
>
> it could help you pre-processing your HTML files.

Hi Massi,

the problem is that they aren’t my HTML files and that this is a very
common error.

I’m afraid that pre-processing could work for a few files, but this
solution wouldn’t work if I would like to use it with any HTML file that
I could need.

Many thanks for your help,

Pablo
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] ignore not closed tags in XML input

2022-05-16 Thread Hans van der Meer via ntg-context
Can't you use an editor with grep, searching for something like the pattern 
 (with appropriate escapes of course).

dr. Hans van der Meer


> On 16 May 2022, at 17:08, Pablo Rodriguez via ntg-context 
>  wrote:
> 
> Dear list,
> 
> I would like to feed
> https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/001.html as
> XML input for ConTeXt.
> 
> The problem is that (as many other XML files that I haven’t generated
> myself) some  and  tags aren’t closed, such as in:
> 
>  
>  https://fonts/css?greek; rel="stylesheet">
>  
> 
> So, all that I get is the following message:
> 
>  invalid xml file - parsed text
> 
> Unsuccessfully I have tried the following:
> 
>  \xmlsetsetup{#1}{html/head/(meta|link)}{-}
> 
> Is there no way to make ConTeXt more tolerant, so that it is able to
> ignore those tags?
> 
> Many thanks for your help,
> 
> Pablo
> ___
> If your question is of interest to others as well, please add an entry to the 
> Wiki!
> 
> maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
> webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
> archive  : https://bitbucket.org/phg/context-mirror/commits/
> wiki : http://contextgarden.net
> ___

___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___


Re: [NTG-context] ignore not closed tags in XML input

2022-05-16 Thread mf via ntg-context

See HTML-tidy,

https://www.html-tidy.org/

it could help you pre-processing your HTML files.

Massi

Il 16/05/22 17:08, Pablo Rodriguez via ntg-context ha scritto:

Dear list,

I would like to feed
https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/001.html as
XML input for ConTeXt.

The problem is that (as many other XML files that I haven’t generated
myself) some  and  tags aren’t closed, such as in:

   
   https://fonts/css?greek; rel="stylesheet">
   

So, all that I get is the following message:

   invalid xml file - parsed text

Unsuccessfully I have tried the following:

   \xmlsetsetup{#1}{html/head/(meta|link)}{-}

Is there no way to make ConTeXt more tolerant, so that it is able to
ignore those tags?

Many thanks for your help,

Pablo

___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki : http://contextgarden.net
___