Re: [WSG] Making PDF and Word files accessible

2005-06-04 Thread Anders Nawroth

I have good experience with Tidy:
http://tidy.sourceforge.net/

/Anders

George S. Williams skrev:


On Fri, 2005-06-03 at 06:36, Angela Galvin wrote:

 

Secondly, with the Word documents, if there is an easier way to convert 
them to HTML? 
   



I use an open source program, antiword, to convert the Word docs to text
and then just add the necessary markup. (And, of course, edit out the
Word weirdness!) I've found this to be about 5 times faster than cut and
paste.

This is on a Linux box, but a Windows version of antiword seems to be
available at-

http://www.informatik.uni-frankfurt.de/~markus/antiword/

George

**
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list  getting help
**




 


**
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list  getting help
**



Re: [WSG] Making PDF and Word files accessible

2005-06-04 Thread Hope Stewart
Hi Angela,

On 3/6/05 8:36 PM, Angela Galvin [EMAIL PROTECTED] wrote:

 Secondly, with the Word documents, if there is an easier way to convert
 them to HTML? At the moment I am saving as HTML from Word, taking them
 into Dreamweaver and using 'Clean up Word HTML'. After that I use 'Find
 and replace' to strip out all font, span and attributes from p
 such as class and style. At which point I still have to mark up the
 document with proper headings, bulleted lists, etc. A little
 time-consuming and fiddly to say the least!

I see that your email was sent using Apple Mail. Assuming you are also using
Dreamweaver on a Mac, you can try what I do: cut  paste the Word doc into
AppleWorks. Then either save the AppleWorks doc as html or cut  paste from
AppleWorks into Dreamweaver. AppleWorks strips out all the Word rubbish.

HTH,
Hope Stewart

**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



Re: [WSG] Making PDF and Word files accessible

2005-06-04 Thread Zulema

Hope Stewart wrote:

Hi Angela,

I see that your email was sent using Apple Mail. Assuming you are also using
Dreamweaver on a Mac, you can try what I do: cut  paste the Word doc into
AppleWorks. Then either save the AppleWorks doc as html or cut  paste from
AppleWorks into Dreamweaver. AppleWorks strips out all the Word rubbish.

HTH,
Hope Stewart

This sounds familiar, oh yeah!. When doing this on a PC, I just found 
out just yesterday that cutting and pasting Word text into Notepad THEN 
cutting and pasting from Notepad to Dreamweaver seems to work for me. 
/But/ it removes those pesky (tm) and (r) symbols and sometimes curly 
quotes.


The link sent by heretic [http://textism.com/wordcleaner/] works 
wonders!! I just tried a one-page Word doc.  I tried the same file 
through TidyGUI, it didn't do the ul's and li where needed and it 
left some artifacts:


   code
   po:p
   pnbsp;/p
   /o:p/p
   /code


It's nice to know we have tools at our disposal to help make our lives 
easier (and cut the time spent coding)! :D


regards,

Z u l e m a  O r t i z
w e b  d e s i g n e r
email : [EMAIL PROTECTED]
website : http://zoblue.com/
weblog : http://blog.zoblue.com/
browser : http://getfirefox.com/ 



**
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list  getting help
**



Re: [WSG] Making PDF and Word files accessible

2005-06-03 Thread Mary Krieger

At 05:36 AM 6/3/2005, you wrote:
snip
Secondly, with the Word documents, if there is an easier way to convert 
them to HTML? At the moment I am saving as HTML from Word, taking them 
into Dreamweaver and using 'Clean up Word HTML'. After that I use 'Find 
and replace' to strip out all font, span and attributes from p such 
as class and style. At which point I still have to mark up the 
document with proper headings, bulleted lists, etc. A little 
time-consuming and fiddly to say the least!


Am I doing this right or is there another way to make these files 
accessible? (and make my life easier, after all it is Friday :-) )


Angela


Angela Galvin

Worth Media
15-17 Middle Street
Brighton BN1 1AL
T: 01273 201149
F: 01273 710004

-

www.worthmedia.net


I would skip the part where you save from Word into HTML. Why give yourself 
the grief?


If you copy and paste the text into the 'content' part of your standard 
page,  the line breaks will show you where the paragraph and headings are. 
I'm using Homesite so I just select and repeat the similar code ( first p, 
then h1, h2 etc) from one end of the document to the other.


Generally the only thing missing them is the the use of bold and italic 
within the text (not part of the heading structure) and any tables or lists 
within the text.


Validate to catch any stray weirdness and on to the next.

Perhaps not the most interesting type of web coding but listening to music 
of your taste, you can work up a good rhythm and code a whack of stuff 
relatively cleanly. Not a bad way to spend a Friday.


Mary Krieger
Winnipeg Manitoba Canada
http://www.mts.net/~mkrieger

**
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list  getting help
**



Re: [WSG] Making PDF and Word files accessible

2005-06-03 Thread designer

Angela Galvin wrote:

Hello all,

I have the task of adding a bunch of PDF and Word files to a web site I 
work on, that currently conforms to WAI Priority 1 guidelines.


My first question is that if I convert the PDF files to HTML to make 
them more accessible, am I right in thinking that this is only half my 
job done? If the original file wasn't marked up correctly in the first 
place before being saved as PDF (with headings, etc) does this mean that 
its still not really accessible?


Secondly, with the Word documents, if there is an easier way to convert 
them to HTML? At the moment I am saving as HTML from Word, taking them 
into Dreamweaver and using 'Clean up Word HTML'. After that I use 'Find 
and replace' to strip out all font, span and attributes from p 
such as class and style. At which point I still have to mark up the 
document with proper headings, bulleted lists, etc. A little 
time-consuming and fiddly to say the least!


Am I doing this right or is there another way to make these files 
accessible? (and make my life easier, after all it is Friday :-) )


Angela


Hi Angela,

No easy way, but the most reliable is to cut and paste from Word into 
the design view of Dreamweaver. Using the design view ensures that all 
the spacing is preserved and indeed, all the quotes etc are presented as 
the correct codes.


I didn't know this myself until recently, when someone on this list told 
me about it.


Hope this helps,


--
Bob McClelland
Cornwall (U.K.)
www.gwelanmor-internet.co.uk
**
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list  getting help
**



Re: [WSG] Making PDF and Word files accessible

2005-06-03 Thread George S. Williams
On Fri, 2005-06-03 at 06:36, Angela Galvin wrote:

 
 Secondly, with the Word documents, if there is an easier way to convert 
 them to HTML? 

I use an open source program, antiword, to convert the Word docs to text
and then just add the necessary markup. (And, of course, edit out the
Word weirdness!) I've found this to be about 5 times faster than cut and
paste.

This is on a Linux box, but a Windows version of antiword seems to be
available at-

http://www.informatik.uni-frankfurt.de/~markus/antiword/

George

**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



RE: [WSG] Making PDF and Word files accessible

2005-06-03 Thread Jona Decker
Mary Krieger wrote:


If you copy and paste the text into the 'content' part of your standard
page,  the line breaks will show you where the paragraph and headings
are. 
I'm using Homesite so I just select and repeat the similar code ( first
p, then h1, h2 etc) from one end of the document to the other.


Depending on your version of MS Office, copying from displayed text may
bring in a bunch of inline styles. Yes, even pasting into a text
document! Ack!

So, I usually save Word files as plain text (no line breaks) first.

Next I use a good text editor with regular expression searching (I use
TextPad, there are many others) to wrap text chunks in paragraph tags
(e.g. ^is the beginning of a line, $ is the end, \n is carriage return,
etc...)

And last, I do a search and replace for weird apostrophes, quotes,
dashes, etc...


Generally the only thing missing them is the the use of bold and italic
within the text (not part of the heading structure) and any tables or
lists within the text.


If you save as text, you'll still have tabs and funky characters for
lists, which can also be regular expression searched and replaced with
the right tags. I actually create a batch action for each contributor
role that regularly sends me Word documents, which does most of the
standard searches one after another (and in the right order, which I can
screw up if it's been awhile) with the press of a hotkey. This allows me
to include foreign characters for certain contributors, em dashes for
others, different list designators for Macs vs. PCs, etc...

The newest Acrobat (7 Pro) also exports to plain text quite
effectively...not just RTF. It ostensibly offers an html w/css option,
but uses inline styles extensively, so the plain text route is more
efficient.

Jona Decker
Madison, WI USA

**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**



Re: [WSG] Making PDF and Word files accessible

2005-06-03 Thread heretic
Hi there,

 My first question is that if I convert the PDF files to HTML to make
 them more accessible, am I right in thinking that this is only half my
 job done? If the original file wasn't marked up correctly in the first
 place before being saved as PDF (with headings, etc) does this mean
 that its still not really accessible?

As an extremely broad generalisation, yes - bad source gets bad
output. However every case is different so you'll have to check your
resulting (X)HTML to make sure it's standards compliant/accessible.
 
 Secondly, with the Word documents, if there is an easier way to convert
 them to HTML? At the moment I am saving as HTML from Word, taking them
 into Dreamweaver and using 'Clean up Word HTML'. 

Try http://textism.com/wordcleaner/  I've found it's pretty good,
esp. in conjunction with the DW tricks you mention.

If you have a large amount of this sort of work, you might like to
invest in http://cita.disability.uiuc.edu/software/office/

cheers

h

-- 
--- http://www.200ok.com.au/
--- The future has arrived; it's just not 
--- evenly distributed. - William Gibson
**
The discussion list for  http://webstandardsgroup.org/

 See http://webstandardsgroup.org/mail/guidelines.cfm
 for some hints on posting to the list  getting help
**