Re: [NTG-context] DOC/RTF to ConTeXt via XML

2005-09-28 Thread Christopher Creutzig
Idris Samawi Hamid wrote:

But you should also explore DocBook-in-ConTeXt, which
uses ConTeXt's native XML processing capabilities.
 
 
 Is it possible to create a Word template that is isomorphic with a DocBook 
 format?

 You can write a Word template isomorphic to a (pretty large) subset of
DocBook, although I believe Word does not allow you to introduce new
types of crossreferences, so you can't reach everything DocBook has.
Whether you can make your authors use it consistently is a different
matter – DocBook uses, for example, different elements for different
types of what ConTeXt calls typing: code for inline code fragments,
command for something you invoke (use option for its options and
symbol for the placeholders to be replaced by actual values),
computeroutput should be obvious but there are also screen and
screenshot – the difference is a bit subtle; programmers might also
use constant, errorcode, errorname, errortext, exceptionname,
funcdef, funcprototype, funcsynopsis, cmdsynopsis,
constructorsynopsis, arg, function, methodname, methodparam,
methodsynopsis, ooclass, ooexception, oointerface,
progamlisting and its annotated cousin programlistingco, sgmltag,
structfield, structname, varargs, varname; envar denotes
environment variables, filename is almost superfluous since it is a
special case of a systemitem (yes, many of these elements carry
further meta-information in their attributes), then there are the
unspecific literal and literallayout elements and also markup,
userinput, and finally there is also uri to format URLs and other URIs.

 Somewhat related elements also abound: keycap is used to denote keys
on the keyboard, keycombo for combinations of those keys.  guibutton
is used for the text on a button in a GUI, guilabel, guimenu,
guimenuitem and guisubmenu and many others.

 Note that I do not question this abundance of possibilities.  After
all, it is logical markup taken to an extreme and probably noone really
uses all of it, yet all the parts are already there if you want them.
I do question the likelihood of the average Word user (who, let's face
it, probably never used formats since the introductory course) making
good use of this.  Sure it is nice to have authors' first and last names
explicitly marked in your text, but someone has to go there and do that,
and if they don't see any difference on their screens after doing it,
they will get lazy and not do it for the fifteenth person they name.
Additionally, most of the DocBook elements may only appear nested in the
correct places in other elements, which makes using an isomorphic Word
template rather challenging even for the advanced user.

 If you would like to browse through the long list of markup items in
DocBook, please see http://docbook.org/tdg/en/html/docbook.html – and do
not be afraid; as I said above, very much of what is there is absolutely
special-purpose stuff.

 Adam (privately) suggested hiring someone to write a structured format for 
 authors. Is that where docbook comes in?

 Actually, I would not orient the thing to fit to DocBook.  DocBook is
an extremely flexible beast, so if, after designing the structured
format best suited for your needs (this does not need to invlove any
xml), you want to map it to DocBook, that should not be any problem.

 Basically, authors in the humanities use Word and it's virtually a lost cause 
 getting them to switch to anything else, even free tools like OO.o (let alone 
 ConTeXt). It would have to be someting where I could do 
 word=docbook=ConTeXt.

 As I said: Offering Word is obviously a must, but if that were the only
option you offered, you'd be actively adding your part to making sure
the situation does not change.  And getting Word to export DocBook will
certainly be much harder than using OOo for that part.


Christopher
___
ntg-context mailing list
ntg-context@ntg.nl
http://www.ntg.nl/mailman/listinfo/ntg-context


Re: [NTG-context] DOC/RTF to ConTeXt via XML

2005-09-28 Thread Christopher Creutzig
Idris Samawi Hamid wrote:
 Ok, you guys have lost me now-) Maybe the best thing to do is try something 

 Just ignore the detail of what xslt can and can't do for the moment.
That just influences the choice of tools for one particular step and we
all agree that there are tools for this step.

 it to ConTeXt. From what I gather so far the process goes something like
 
 doc  = rtf 
 rtf  = OO.o
 OO.o = xml

 No need for rtf.  That would loose lots of information anyway, wouldn't it?

 \startHans
 converting open office xml is not always easy; stay away from tab's and use 
 high level constructs as much as possible
 \stopHans

 I'm not really sure what Hand meant by this.  I assume he does have a
valid point, since so far I only had a short and theoretical look at the
format, but I can only guess what it is.  Hans, could you give an
example or two?

From this discussion it seems that I (as an xml ignoramous) would be better 
 off converting to ConTeXt code rather than processing pure xml blocks (but 
 maybe I'm wrong).

 XML is much, much easier to parse than just about anything else.  That
means that whatever your conversion process uses, you can simply reuse
an XML parser in whatever language you want to use.  (Interpreting the
file may be easy or hard, depending on the xml structure at hand.)  The
only exception I can see right now would be a rather large and
error-prone “Visual” Basic program to create a sort of export filter for
Word to write ConTeXt.  I certainly don't think that's easier.

 Once I get a sane xml file (this seems to be the biggest problem) what is the 
 best tool to convert this to ConTeXt?

 It depends on who is going to write the conversion.  From the languages
I've used so far, it's probably easiest to do in xslt, but if you
are/have at hand a programmer who's good at ruby but would have to learn
xslt first, the whole thing may not be big enough to warrant learning
another language first.  Unless that programmer wants to, which would be
a very good sign.  Learning a new language per year is not really a bad
idea.

 We are all extremely busy, of course, but if anyone finds this interesting I 
 can send a sample doc article from my journal. Maybe we can do a MyWay or 
 something to document this process for ourselves and others, as well as find 

 It might be a pretty specific thing, though.  My guess is that you
could make more progress by thinking about what sort of structurals you
would like to have, rather than looking at what you have right now.


Christopher
___
ntg-context mailing list
ntg-context@ntg.nl
http://www.ntg.nl/mailman/listinfo/ntg-context


Re: [NTG-context] DOC/RTF to ConTeXt via XML

2005-09-28 Thread Duncan Hothersall
  No need for rtf.  That would loose lots of information anyway, wouldn't it?

RTF can capture everything that .doc can (MS update it every time they
rev the .doc format), and it has the advantage that it is defined in a
spec with a grammar, which means that importing routines (like the one
in OO.o) tend to be better than for the binary .doc format. So I would
usually use .rtf as the Save As... from Word, rather than relying on
OO.o's reverse engineering of the .doc format. Others' experiences may
vary, of course, and perhaps I do an injustice to OO.o's Word imports,
which have certainly improved. But RTF is a fairly safe bet, and
additionally it is 'human readable' so that helps debugging.

\startHans
converting open office xml is not always easy; stay away from tab's and use 
high level constructs as much as possible
\stopHans

I would add to this - make sure you use either OO.o 1.1.5 or a 2.0 Beta,
since earlier versions used a file format which was a lot trickier to
post-process (problems with conflating styles into paragraph formats).

Once I get a sane xml file (this seems to be the biggest problem) what is the 
best tool to convert this to ConTeXt?

Well you might not need to - remember that ConTeXt can process XML
natively now, which is why I suggested you look at the
DocBook-in-ConTeXt project, which uses this feature. You wouldn't
necessarily have to use the DocBook standard, but you could use the
principles of that project to define a nice output from your own
(simple) brand of XML.

Duncan
___
ntg-context mailing list
ntg-context@ntg.nl
http://www.ntg.nl/mailman/listinfo/ntg-context


Re: [NTG-context] DOC/RTF to ConTeXt via XML

2005-09-28 Thread Christopher Creutzig
Duncan Hothersall wrote:

 RTF can capture everything that .doc can (MS update it every time they
 rev the .doc format), and it has the advantage that it is defined in a
 spec with a grammar, which means that importing routines (like the one

 Oh, yes, the RTF spec.  It really makes you wonder what Microsoft
employees understand by the word “spec.”  Word breaks almost every
single rule in that spec and has done so for ages:  “The LetterSequence
is made up of lowercase alphabetic characters (a-z). RTF is case
sensitive.  The following Word 97-2000 keywords do not currently follow
the requirement that keywords may not contain any uppercase alphabetic
characters.  ...”  But I should be happy that these violations are
actually dcumented.

 in OO.o) tend to be better than for the binary .doc format. So I would

 Okay; I did not know that whatever Microsoft currently calls RTF is
actually able to save all Word files losslessly.  (I am in the lucky
position not to have any Word files to convert.)  Makes me wonder if
there really is any need for an XML step in between.  Can OOo convert
RTF to XML without user intervention, such as clicking somewhere with a
mouse?  Maybe rtf2fo.com, http://www.infinity-loop.de/products/upcast/,
or http://sourceforge.net/projects/majix/ are good alternatives for this
step?  (I never used any one of them.)

 which have certainly improved. But RTF is a fairly safe bet, and
 additionally it is 'human readable' so that helps debugging.

 Asking a human to read RTF is certainly inhuman.  :-)

 But there is another advantage of using RTF: Authors can use almost any
word processor they want. :-)

 Well you might not need to - remember that ConTeXt can process XML
 natively now, which is why I suggested you look at the

 But unless I'm mistaken, this is based on a streaming model, which has
its advantages, but also disadvantages.  So, the question is whether the
xml format is close enough to the order in which ConTeXt would like to
get the bits and pieces.  Since the format has not been defined yet,
this question should be kept in mind.


Christopher
___
ntg-context mailing list
ntg-context@ntg.nl
http://www.ntg.nl/mailman/listinfo/ntg-context


Re: [NTG-context] DOC/RTF to ConTeXt via XML

2005-09-27 Thread Christopher Creutzig
Duncan Hothersall wrote:
Question: Is it possible to design a doc or rtf template that Open Office can 
convert to a sane, consistent xml format? 
 
 
 OpenOffice.org does allow you to attach an XSLT stylesheet to an export
 process which therefore allows you to do a (limited) transformation from
 the visual markup which is its native format to a more structured one

 Why „limited“?  Complicated things are just, well, a bit complicated to
achieve.  It is certainly possible to get a structured document from,
say, an average xhtml file.  I would prefer not to write that code,
though.  It would be rather boring and full of hard-to-read special cases.

 which you would need. But the biggest challenge is that all
 wordprocessors are designed for visual editing, meaning that there are,
 for example, 15 or so different ways to get a bulleted list in Word,
 creating 15 or so different RTF constructs, and coping with this can be
 a nightmare.

 Yes, it can.  (Although RTF is completely unrelated to this problem,
since OOo would read the Word file.  And the OOo step greatly simplifies
the problem, since iirc the OOo format has just one or maybe two ways of
saving bulleted lists.  Or were you refering to different bullets?)  The
stricter your rules for the authors are, the easier it is to write the
required xslt program.  If your authors expect to be able to write
chapter headers by manually switching to a font in the range of 20 to 24
pt and adding a number in front, you've got a hell of a coding session
in front of you.  If, otoh, you take the dictatorical approach of
telling them in advance that manual font changes (maybe apart from
pseudo-italics and pseudo-bold which will be mapped to \em in the end)
will simply be ignored, your code will be much easier but you may have a
problem with the authors.

 The FO approach (Paul Tremblay's focus) is one way to process XML to
 paginated output, but there are many others. Personally I don't like the
 FO approach, for a variety of reasons, but I'm sure others have had
 success with it. But you should also explore DocBook-in-ConTeXt, which
 uses ConTeXt's native XML processing capabilities. And don't rule out

 The advantage of using DocBook is that you get a very rich set of
capabilities.  The disadvantage can be described in almost the same
words, plus, as I said before, DocBook is one of the most verbose
formats in common use.  If you only use the format as an intermediate
step, that is irrelevant, but if your authors willsend in files that
way, it is not.

 using a separate scripting language to convert XML into ConTeXt as a
 batch process, since that will give you the ultimate flexibility in
 accessing all of ConTeXt's abilities.

 Personally, I'd use xslt for that.  Navigating the xml tree is
extremely easy and writing out text instead of xml is not really a problem.

Question: Does the entire journal have to be in programmed in xml or can 
ConTeXt process xml locally? For example, I may have my own article done in 
COnTeXt mixed with other articles done in rtf=xml.
 
 
 You can just put XML into \startXMLdata ... \stopXMLdata blocks. I do
 this for MathML processing within a larger ConTeXt document.

 I'd approach Idris' problem the other way round: Transform the xml
files to ConTeXt and leave the ConTeXt files as is.  Then, texexec the
whole thing.

Any other advice (and/or pitfalls to watch for) would be appreciated. This 
sounds very promising!
 
 
 Horses for courses. It's possible to get sucked into things like an FO
 implementation or an XML conversion and find that you have spent months
 perfecting it and it only shaves half an hour off your production time!

 Amen.

 Also, don't limit your authors to Word.  Offering Word is obviously a
requirement, but if you go the way through OOo, there would be no point
in not offering an OOo template file.  If you are using a standard xml
format, such as (a subset of) DocBook or TEI, you probably should accept
articles in that format, too.  And, of course, ConTeXt.


Christopher
___
ntg-context mailing list
ntg-context@ntg.nl
http://www.ntg.nl/mailman/listinfo/ntg-context


Re: [NTG-context] DOC/RTF to ConTeXt via XML

2005-09-27 Thread Duncan Hothersall
Slightly OT, sorry:

OpenOffice.org does allow you to attach an XSLT stylesheet to an export
process which therefore allows you to do a (limited) transformation from
the visual markup which is its native format to a more structured one
 
  Why „limited“?  

Well, XSLT seems to have been designed, and certainly tends to be
implemented, as a tool for simple transformations of small XML chunks.
Obviously complex transformations can be constructed from a bunch of
simple transformations, but there comes a point when you should really
just use a better tool - though these tend to cost serious money (e.g.
OmniMark). Also, most XSLT implementations use the DOM model, which is
fine for a 50Kb file but will be incredibly resource-hungry if you're
processing files of 5Mb. At that point you want a streaming model, and
for a streaming model you want a better suited language than XSLT. As I
say, horses for courses. For article-length pieces and simple
transforms, XSLT might suffice.

  Also, don't limit your authors to Word.  Offering Word is obviously a
 requirement, but if you go the way through OOo, there would be no point
 in not offering an OOo template file.  If you are using a standard xml
 format, such as (a subset of) DocBook or TEI, you probably should accept
 articles in that format, too.  And, of course, ConTeXt.

Absolutely; particularly if you can offer authors an incentive or direct
benefit from adopting OO.o, such as speed of turnaround of proofs, etc.
___
ntg-context mailing list
ntg-context@ntg.nl
http://www.ntg.nl/mailman/listinfo/ntg-context


Re: [NTG-context] DOC/RTF to ConTeXt via XML

2005-09-27 Thread Christopher Creutzig
Duncan Hothersall wrote:
 Well, XSLT seems to have been designed, and certainly tends to be
 implemented, as a tool for simple transformations of small XML chunks.

 No, xslt is a tool for arbitrary xml - xml conversions (and a little
more than that).  With a good implementation (say, saxon), working with
moderately large trees is pretty fast.  The stylesheet is actually
compiled before running.

 Obviously complex transformations can be constructed from a bunch of
 simple transformations, but there comes a point when you should really

 Just about any programming language gives you simple operations to
build whatever you want from.

 just use a better tool - though these tend to cost serious money (e.g.

 „Better“ depends on your task at hand.

 OmniMark). Also, most XSLT implementations use the DOM model, which is

 XSLT uses a DOM model, which is different from the W3C DOM model.

 fine for a 50Kb file but will be incredibly resource-hungry if you're
 processing files of 5Mb. At that point you want a streaming model, and

 That depends on what you want to do with your data.  For many of my
needs, a streaming model simply wouldn't work without keeping lots of
information (to be processed later) in memory, defeating the model.

 I have found splitting my data into files that form conceptional units
to be a good way, both for editing the files and for turnaround times.
(I am using Makefiles, so the granularity of finding unchanged items for
me is the file.)  We are talking about almost 15MB here, which I regard
as pretty much, considering it is almost pure text.

 Again, I don't mind using something else on XML data.  I'm doing it
myself.  It all depends on what you want to do.  In the case of
transforming xml to ConTeXt, I would go for an xslt implementation, but
ymmv.  After all, the choice of tools always depends on many factors,
including familiarity.  (I've continued using perl instead of ruby for
ages, until recently, for that reason.)

 for a streaming model you want a better suited language than XSLT. As I
 say, horses for courses. For article-length pieces and simple
 transforms, XSLT might suffice.

 For number crunching, xslt is certainly inadequate.  Transforming books
of average length (say, 300-500 pages) is certainly doable, although I
would go for a transformation chapter-by-chapter,especially considering
that we are talking about a process where crossreferences etc. are going
to be handled later in the chain.  But I thought we were talking about
article-length pieces anyway?


Christopher
___
ntg-context mailing list
ntg-context@ntg.nl
http://www.ntg.nl/mailman/listinfo/ntg-context


RE: [NTG-context] DOC/RTF to ConTeXt via XML

2005-09-27 Thread Idris Samawi Hamid
Hi Christopher, Duncan, Hans, and Adam,

Thank you so much for your detailed comments and suggestions. Again, I'm 
completely new to xml and feel like a fish out of water. OTOH I use sooo much 
time just manually extracting text (with innumerable transliteration 
diacritics) and then copying-pasting to WinEDT that I am willing to explore 
the xml approach if it can be made sane enough...

= Original Message From Christopher Creutzig [EMAIL PROTECTED] 
=
Duncan Hothersall wrote:
 Well, XSLT seems to have been designed, and certainly tends to be
 implemented, as a tool for simple transformations of small XML chunks.

 No, xslt is a tool for arbitrary xml - xml conversions (and a little
more than that).

Ok, you guys have lost me now-) Maybe the best thing to do is try something 
practical: take an average word article and see what's involved in converting 
it to ConTeXt. From what I gather so far the process goes something like

doc  = rtf 
rtf  = OO.o
OO.o = xml

But here things get dicey because

\startHans
converting open office xml is not always easy; stay away from tab's and use 
high level constructs as much as possible
\stopHans

Question: Will a proper doc (or OO.o) template solve this problem or is this a 
post-OO.o-processing problem no matter what I do beforehand?

From this discussion it seems that I (as an xml ignoramous) would be better 
off converting to ConTeXt code rather than processing pure xml blocks (but 
maybe I'm wrong).

Once I get a sane xml file (this seems to be the biggest problem) what is the 
best tool to convert this to ConTeXt?

We are all extremely busy, of course, but if anyone finds this interesting I 
can send a sample doc article from my journal. Maybe we can do a MyWay or 
something to document this process for ourselves and others, as well as find 
the most practical approach to creating a sane workflow. Besides, this kind of 
project seems to be exactly the kind of thing to illustrate the full power of 
ConTeXt.

This is a mid-term project so no urgency (I'll keep copying and pasting for 
now-)

Thanks again you all for your advice.

Best
Idris


Professor Idris Samawi Hamid
Department of Philosophy
Colorado State University
Fort Collins, CO 80523

___
ntg-context mailing list
ntg-context@ntg.nl
http://www.ntg.nl/mailman/listinfo/ntg-context


RE: [NTG-context] DOC/RTF to ConTeXt via XML

2005-09-27 Thread Idris Samawi Hamid
Hi Duncan,

I know little about xml and virtually nothing about Word (except that it's 
crap) so please forgive me if this is a stupid or clueless question-)

But you should also explore DocBook-in-ConTeXt, which
uses ConTeXt's native XML processing capabilities.

Is it possible to create a Word template that is isomorphic with a DocBook 
format?

Adam (privately) suggested hiring someone to write a structured format for 
authors. Is that where docbook comes in?

Basically, authors in the humanities use Word and it's virtually a lost cause 
getting them to switch to anything else, even free tools like OO.o (let alone 
ConTeXt). It would have to be someting where I could do 
word=docbook=ConTeXt.

Sigh

Best
Idris


Professor Idris Samawi Hamid
Department of Philosophy
Colorado State University
Fort Collins, CO 80523

___
ntg-context mailing list
ntg-context@ntg.nl
http://www.ntg.nl/mailman/listinfo/ntg-context


Re: [NTG-context] DOC/RTF to ConTeXt via XML

2005-09-27 Thread Adam Lindsay
Idris Samawi Hamid said this at Tue, 27 Sep 2005 09:10:27 -0600:

Adam (privately) suggested hiring someone to write a structured format for 
authors. Is that where docbook comes in?

Ah, sorry about that. I meant you *could* hire someone to design a
format, but the bigger point was that it would be rather futile without
a user-level authoring tool backing it up!
-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 Adam T. Lindsay, Computing Dept. [EMAIL PROTECTED]
 Lancaster University, InfoLab21+44(0)1524/510.514
 Lancaster, LA1 4WA, UK Fax:+44(0)1524/510.492
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

___
ntg-context mailing list
ntg-context@ntg.nl
http://www.ntg.nl/mailman/listinfo/ntg-context