Re: [NTG-context] DOC/RTF to ConTeXt via XML
Idris Samawi Hamid wrote: But you should also explore DocBook-in-ConTeXt, which uses ConTeXt's native XML processing capabilities. Is it possible to create a Word template that is isomorphic with a DocBook format? You can write a Word template isomorphic to a (pretty large) subset of DocBook, although I believe Word does not allow you to introduce new types of crossreferences, so you can't reach everything DocBook has. Whether you can make your authors use it consistently is a different matter – DocBook uses, for example, different elements for different types of what ConTeXt calls typing: code for inline code fragments, command for something you invoke (use option for its options and symbol for the placeholders to be replaced by actual values), computeroutput should be obvious but there are also screen and screenshot – the difference is a bit subtle; programmers might also use constant, errorcode, errorname, errortext, exceptionname, funcdef, funcprototype, funcsynopsis, cmdsynopsis, constructorsynopsis, arg, function, methodname, methodparam, methodsynopsis, ooclass, ooexception, oointerface, progamlisting and its annotated cousin programlistingco, sgmltag, structfield, structname, varargs, varname; envar denotes environment variables, filename is almost superfluous since it is a special case of a systemitem (yes, many of these elements carry further meta-information in their attributes), then there are the unspecific literal and literallayout elements and also markup, userinput, and finally there is also uri to format URLs and other URIs. Somewhat related elements also abound: keycap is used to denote keys on the keyboard, keycombo for combinations of those keys. guibutton is used for the text on a button in a GUI, guilabel, guimenu, guimenuitem and guisubmenu and many others. Note that I do not question this abundance of possibilities. After all, it is logical markup taken to an extreme and probably noone really uses all of it, yet all the parts are already there if you want them. I do question the likelihood of the average Word user (who, let's face it, probably never used formats since the introductory course) making good use of this. Sure it is nice to have authors' first and last names explicitly marked in your text, but someone has to go there and do that, and if they don't see any difference on their screens after doing it, they will get lazy and not do it for the fifteenth person they name. Additionally, most of the DocBook elements may only appear nested in the correct places in other elements, which makes using an isomorphic Word template rather challenging even for the advanced user. If you would like to browse through the long list of markup items in DocBook, please see http://docbook.org/tdg/en/html/docbook.html – and do not be afraid; as I said above, very much of what is there is absolutely special-purpose stuff. Adam (privately) suggested hiring someone to write a structured format for authors. Is that where docbook comes in? Actually, I would not orient the thing to fit to DocBook. DocBook is an extremely flexible beast, so if, after designing the structured format best suited for your needs (this does not need to invlove any xml), you want to map it to DocBook, that should not be any problem. Basically, authors in the humanities use Word and it's virtually a lost cause getting them to switch to anything else, even free tools like OO.o (let alone ConTeXt). It would have to be someting where I could do word=docbook=ConTeXt. As I said: Offering Word is obviously a must, but if that were the only option you offered, you'd be actively adding your part to making sure the situation does not change. And getting Word to export DocBook will certainly be much harder than using OOo for that part. Christopher ___ ntg-context mailing list ntg-context@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-context
Re: [NTG-context] DOC/RTF to ConTeXt via XML
Idris Samawi Hamid wrote: Ok, you guys have lost me now-) Maybe the best thing to do is try something Just ignore the detail of what xslt can and can't do for the moment. That just influences the choice of tools for one particular step and we all agree that there are tools for this step. it to ConTeXt. From what I gather so far the process goes something like doc = rtf rtf = OO.o OO.o = xml No need for rtf. That would loose lots of information anyway, wouldn't it? \startHans converting open office xml is not always easy; stay away from tab's and use high level constructs as much as possible \stopHans I'm not really sure what Hand meant by this. I assume he does have a valid point, since so far I only had a short and theoretical look at the format, but I can only guess what it is. Hans, could you give an example or two? From this discussion it seems that I (as an xml ignoramous) would be better off converting to ConTeXt code rather than processing pure xml blocks (but maybe I'm wrong). XML is much, much easier to parse than just about anything else. That means that whatever your conversion process uses, you can simply reuse an XML parser in whatever language you want to use. (Interpreting the file may be easy or hard, depending on the xml structure at hand.) The only exception I can see right now would be a rather large and error-prone “Visual” Basic program to create a sort of export filter for Word to write ConTeXt. I certainly don't think that's easier. Once I get a sane xml file (this seems to be the biggest problem) what is the best tool to convert this to ConTeXt? It depends on who is going to write the conversion. From the languages I've used so far, it's probably easiest to do in xslt, but if you are/have at hand a programmer who's good at ruby but would have to learn xslt first, the whole thing may not be big enough to warrant learning another language first. Unless that programmer wants to, which would be a very good sign. Learning a new language per year is not really a bad idea. We are all extremely busy, of course, but if anyone finds this interesting I can send a sample doc article from my journal. Maybe we can do a MyWay or something to document this process for ourselves and others, as well as find It might be a pretty specific thing, though. My guess is that you could make more progress by thinking about what sort of structurals you would like to have, rather than looking at what you have right now. Christopher ___ ntg-context mailing list ntg-context@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-context
Re: [NTG-context] DOC/RTF to ConTeXt via XML
No need for rtf. That would loose lots of information anyway, wouldn't it? RTF can capture everything that .doc can (MS update it every time they rev the .doc format), and it has the advantage that it is defined in a spec with a grammar, which means that importing routines (like the one in OO.o) tend to be better than for the binary .doc format. So I would usually use .rtf as the Save As... from Word, rather than relying on OO.o's reverse engineering of the .doc format. Others' experiences may vary, of course, and perhaps I do an injustice to OO.o's Word imports, which have certainly improved. But RTF is a fairly safe bet, and additionally it is 'human readable' so that helps debugging. \startHans converting open office xml is not always easy; stay away from tab's and use high level constructs as much as possible \stopHans I would add to this - make sure you use either OO.o 1.1.5 or a 2.0 Beta, since earlier versions used a file format which was a lot trickier to post-process (problems with conflating styles into paragraph formats). Once I get a sane xml file (this seems to be the biggest problem) what is the best tool to convert this to ConTeXt? Well you might not need to - remember that ConTeXt can process XML natively now, which is why I suggested you look at the DocBook-in-ConTeXt project, which uses this feature. You wouldn't necessarily have to use the DocBook standard, but you could use the principles of that project to define a nice output from your own (simple) brand of XML. Duncan ___ ntg-context mailing list ntg-context@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-context
Re: [NTG-context] DOC/RTF to ConTeXt via XML
Duncan Hothersall wrote: RTF can capture everything that .doc can (MS update it every time they rev the .doc format), and it has the advantage that it is defined in a spec with a grammar, which means that importing routines (like the one Oh, yes, the RTF spec. It really makes you wonder what Microsoft employees understand by the word “spec.” Word breaks almost every single rule in that spec and has done so for ages: “The LetterSequence is made up of lowercase alphabetic characters (a-z). RTF is case sensitive. The following Word 97-2000 keywords do not currently follow the requirement that keywords may not contain any uppercase alphabetic characters. ...” But I should be happy that these violations are actually dcumented. in OO.o) tend to be better than for the binary .doc format. So I would Okay; I did not know that whatever Microsoft currently calls RTF is actually able to save all Word files losslessly. (I am in the lucky position not to have any Word files to convert.) Makes me wonder if there really is any need for an XML step in between. Can OOo convert RTF to XML without user intervention, such as clicking somewhere with a mouse? Maybe rtf2fo.com, http://www.infinity-loop.de/products/upcast/, or http://sourceforge.net/projects/majix/ are good alternatives for this step? (I never used any one of them.) which have certainly improved. But RTF is a fairly safe bet, and additionally it is 'human readable' so that helps debugging. Asking a human to read RTF is certainly inhuman. :-) But there is another advantage of using RTF: Authors can use almost any word processor they want. :-) Well you might not need to - remember that ConTeXt can process XML natively now, which is why I suggested you look at the But unless I'm mistaken, this is based on a streaming model, which has its advantages, but also disadvantages. So, the question is whether the xml format is close enough to the order in which ConTeXt would like to get the bits and pieces. Since the format has not been defined yet, this question should be kept in mind. Christopher ___ ntg-context mailing list ntg-context@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-context
Re: [NTG-context] DOC/RTF to ConTeXt via XML
Duncan Hothersall wrote: Question: Is it possible to design a doc or rtf template that Open Office can convert to a sane, consistent xml format? OpenOffice.org does allow you to attach an XSLT stylesheet to an export process which therefore allows you to do a (limited) transformation from the visual markup which is its native format to a more structured one Why „limited“? Complicated things are just, well, a bit complicated to achieve. It is certainly possible to get a structured document from, say, an average xhtml file. I would prefer not to write that code, though. It would be rather boring and full of hard-to-read special cases. which you would need. But the biggest challenge is that all wordprocessors are designed for visual editing, meaning that there are, for example, 15 or so different ways to get a bulleted list in Word, creating 15 or so different RTF constructs, and coping with this can be a nightmare. Yes, it can. (Although RTF is completely unrelated to this problem, since OOo would read the Word file. And the OOo step greatly simplifies the problem, since iirc the OOo format has just one or maybe two ways of saving bulleted lists. Or were you refering to different bullets?) The stricter your rules for the authors are, the easier it is to write the required xslt program. If your authors expect to be able to write chapter headers by manually switching to a font in the range of 20 to 24 pt and adding a number in front, you've got a hell of a coding session in front of you. If, otoh, you take the dictatorical approach of telling them in advance that manual font changes (maybe apart from pseudo-italics and pseudo-bold which will be mapped to \em in the end) will simply be ignored, your code will be much easier but you may have a problem with the authors. The FO approach (Paul Tremblay's focus) is one way to process XML to paginated output, but there are many others. Personally I don't like the FO approach, for a variety of reasons, but I'm sure others have had success with it. But you should also explore DocBook-in-ConTeXt, which uses ConTeXt's native XML processing capabilities. And don't rule out The advantage of using DocBook is that you get a very rich set of capabilities. The disadvantage can be described in almost the same words, plus, as I said before, DocBook is one of the most verbose formats in common use. If you only use the format as an intermediate step, that is irrelevant, but if your authors willsend in files that way, it is not. using a separate scripting language to convert XML into ConTeXt as a batch process, since that will give you the ultimate flexibility in accessing all of ConTeXt's abilities. Personally, I'd use xslt for that. Navigating the xml tree is extremely easy and writing out text instead of xml is not really a problem. Question: Does the entire journal have to be in programmed in xml or can ConTeXt process xml locally? For example, I may have my own article done in COnTeXt mixed with other articles done in rtf=xml. You can just put XML into \startXMLdata ... \stopXMLdata blocks. I do this for MathML processing within a larger ConTeXt document. I'd approach Idris' problem the other way round: Transform the xml files to ConTeXt and leave the ConTeXt files as is. Then, texexec the whole thing. Any other advice (and/or pitfalls to watch for) would be appreciated. This sounds very promising! Horses for courses. It's possible to get sucked into things like an FO implementation or an XML conversion and find that you have spent months perfecting it and it only shaves half an hour off your production time! Amen. Also, don't limit your authors to Word. Offering Word is obviously a requirement, but if you go the way through OOo, there would be no point in not offering an OOo template file. If you are using a standard xml format, such as (a subset of) DocBook or TEI, you probably should accept articles in that format, too. And, of course, ConTeXt. Christopher ___ ntg-context mailing list ntg-context@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-context
Re: [NTG-context] DOC/RTF to ConTeXt via XML
Slightly OT, sorry: OpenOffice.org does allow you to attach an XSLT stylesheet to an export process which therefore allows you to do a (limited) transformation from the visual markup which is its native format to a more structured one Why „limited“? Well, XSLT seems to have been designed, and certainly tends to be implemented, as a tool for simple transformations of small XML chunks. Obviously complex transformations can be constructed from a bunch of simple transformations, but there comes a point when you should really just use a better tool - though these tend to cost serious money (e.g. OmniMark). Also, most XSLT implementations use the DOM model, which is fine for a 50Kb file but will be incredibly resource-hungry if you're processing files of 5Mb. At that point you want a streaming model, and for a streaming model you want a better suited language than XSLT. As I say, horses for courses. For article-length pieces and simple transforms, XSLT might suffice. Also, don't limit your authors to Word. Offering Word is obviously a requirement, but if you go the way through OOo, there would be no point in not offering an OOo template file. If you are using a standard xml format, such as (a subset of) DocBook or TEI, you probably should accept articles in that format, too. And, of course, ConTeXt. Absolutely; particularly if you can offer authors an incentive or direct benefit from adopting OO.o, such as speed of turnaround of proofs, etc. ___ ntg-context mailing list ntg-context@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-context
Re: [NTG-context] DOC/RTF to ConTeXt via XML
Duncan Hothersall wrote: Well, XSLT seems to have been designed, and certainly tends to be implemented, as a tool for simple transformations of small XML chunks. No, xslt is a tool for arbitrary xml - xml conversions (and a little more than that). With a good implementation (say, saxon), working with moderately large trees is pretty fast. The stylesheet is actually compiled before running. Obviously complex transformations can be constructed from a bunch of simple transformations, but there comes a point when you should really Just about any programming language gives you simple operations to build whatever you want from. just use a better tool - though these tend to cost serious money (e.g. „Better“ depends on your task at hand. OmniMark). Also, most XSLT implementations use the DOM model, which is XSLT uses a DOM model, which is different from the W3C DOM model. fine for a 50Kb file but will be incredibly resource-hungry if you're processing files of 5Mb. At that point you want a streaming model, and That depends on what you want to do with your data. For many of my needs, a streaming model simply wouldn't work without keeping lots of information (to be processed later) in memory, defeating the model. I have found splitting my data into files that form conceptional units to be a good way, both for editing the files and for turnaround times. (I am using Makefiles, so the granularity of finding unchanged items for me is the file.) We are talking about almost 15MB here, which I regard as pretty much, considering it is almost pure text. Again, I don't mind using something else on XML data. I'm doing it myself. It all depends on what you want to do. In the case of transforming xml to ConTeXt, I would go for an xslt implementation, but ymmv. After all, the choice of tools always depends on many factors, including familiarity. (I've continued using perl instead of ruby for ages, until recently, for that reason.) for a streaming model you want a better suited language than XSLT. As I say, horses for courses. For article-length pieces and simple transforms, XSLT might suffice. For number crunching, xslt is certainly inadequate. Transforming books of average length (say, 300-500 pages) is certainly doable, although I would go for a transformation chapter-by-chapter,especially considering that we are talking about a process where crossreferences etc. are going to be handled later in the chain. But I thought we were talking about article-length pieces anyway? Christopher ___ ntg-context mailing list ntg-context@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-context
RE: [NTG-context] DOC/RTF to ConTeXt via XML
Hi Christopher, Duncan, Hans, and Adam, Thank you so much for your detailed comments and suggestions. Again, I'm completely new to xml and feel like a fish out of water. OTOH I use sooo much time just manually extracting text (with innumerable transliteration diacritics) and then copying-pasting to WinEDT that I am willing to explore the xml approach if it can be made sane enough... = Original Message From Christopher Creutzig [EMAIL PROTECTED] = Duncan Hothersall wrote: Well, XSLT seems to have been designed, and certainly tends to be implemented, as a tool for simple transformations of small XML chunks. No, xslt is a tool for arbitrary xml - xml conversions (and a little more than that). Ok, you guys have lost me now-) Maybe the best thing to do is try something practical: take an average word article and see what's involved in converting it to ConTeXt. From what I gather so far the process goes something like doc = rtf rtf = OO.o OO.o = xml But here things get dicey because \startHans converting open office xml is not always easy; stay away from tab's and use high level constructs as much as possible \stopHans Question: Will a proper doc (or OO.o) template solve this problem or is this a post-OO.o-processing problem no matter what I do beforehand? From this discussion it seems that I (as an xml ignoramous) would be better off converting to ConTeXt code rather than processing pure xml blocks (but maybe I'm wrong). Once I get a sane xml file (this seems to be the biggest problem) what is the best tool to convert this to ConTeXt? We are all extremely busy, of course, but if anyone finds this interesting I can send a sample doc article from my journal. Maybe we can do a MyWay or something to document this process for ourselves and others, as well as find the most practical approach to creating a sane workflow. Besides, this kind of project seems to be exactly the kind of thing to illustrate the full power of ConTeXt. This is a mid-term project so no urgency (I'll keep copying and pasting for now-) Thanks again you all for your advice. Best Idris Professor Idris Samawi Hamid Department of Philosophy Colorado State University Fort Collins, CO 80523 ___ ntg-context mailing list ntg-context@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-context
RE: [NTG-context] DOC/RTF to ConTeXt via XML
Hi Duncan, I know little about xml and virtually nothing about Word (except that it's crap) so please forgive me if this is a stupid or clueless question-) But you should also explore DocBook-in-ConTeXt, which uses ConTeXt's native XML processing capabilities. Is it possible to create a Word template that is isomorphic with a DocBook format? Adam (privately) suggested hiring someone to write a structured format for authors. Is that where docbook comes in? Basically, authors in the humanities use Word and it's virtually a lost cause getting them to switch to anything else, even free tools like OO.o (let alone ConTeXt). It would have to be someting where I could do word=docbook=ConTeXt. Sigh Best Idris Professor Idris Samawi Hamid Department of Philosophy Colorado State University Fort Collins, CO 80523 ___ ntg-context mailing list ntg-context@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-context
Re: [NTG-context] DOC/RTF to ConTeXt via XML
Idris Samawi Hamid said this at Tue, 27 Sep 2005 09:10:27 -0600: Adam (privately) suggested hiring someone to write a structured format for authors. Is that where docbook comes in? Ah, sorry about that. I meant you *could* hire someone to design a format, but the bigger point was that it would be rather futile without a user-level authoring tool backing it up! -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Adam T. Lindsay, Computing Dept. [EMAIL PROTECTED] Lancaster University, InfoLab21+44(0)1524/510.514 Lancaster, LA1 4WA, UK Fax:+44(0)1524/510.492 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- ___ ntg-context mailing list ntg-context@ntg.nl http://www.ntg.nl/mailman/listinfo/ntg-context