Dear Sigrid and Others:
When you scan a page of text, what you get, as already explained is an
image.
Going a bit deeper, this is a matrix of dots, not computer text.
Computer text is essentially made up of a string of hexadecimal numbers.
Each character in a code page is associated with a specific hex number.
This dictates the character, but not the font, size colour, style or
associated language. All of these attributes are stored by other
non-printable codes.
As can be seen this is radically different from a raster image.
OCR:
However, if that raster image (assuming it is of adequate graphic
quality, size and resolution) is then passed through OCR (or Optical
Character Recognition) software, which is a separate application, and
may inherently be rather processor demanding at times, the result is one
or other of different text files, which can normally all be opened in
Writer, regardless of the operating platform. It then should be
correctly language coded in Writer, spell checked and proof read.
(Often, a device sold to support Windows includes separate OCR software
that is really not part of the windows driver as it may appear. It is
simply linked to the driver for user convenience.
In Linux, the software may be available, but is likely not linked in the
same way, depending on how the device back end has been written.)
Linux is more complicated than Windows, but also far more powerful and
versatile.
Versatility and complexity generally go hand in hand everywhere.
In short, OCR is like a fast, automated typist - fast, but at times
stupid. Accuracy can be anywhere from 0% to 100%, depending on image
quality, content and many other factors.
My experience tells me that this is an area where you get what you pay for.
Although there are a couple of Open Source OCR applications in Linux (at
least for Fedora 12+), I have not yet seen a GUI front end for them.
Generally they want to be run from a terminal command line, which can
make their use very unwieldy. for that reason, I generally do OCR with
commercial paid software on a Windows Platform.
My main machine can boot either Fedora 13 x-64 or Windows XP, BIOS
(firmware) controlled.
SANE (Scanner Access Made Easy) is a common front end for scanners in
Linux, but each model of scanner or multifunction must have its own SANE
back end. Since there are so many models around, there has not been the
programmers' time available to create many of the back ends. Those
models which are supported (or partially) are online in a table as part
of the SANE website.
If one installs a back-end for a composite device, such as a scanner,
printer and/or fax, that back end may support all or any part of the
existing features of the hardware and firmware in the machine.
Each type of feature will normally work with the front end appropriate
to the type of device and the desktop it is intended for. Hence the
printer in such a device is normally supported by CUPS (Common Unix
Printing System) even though both it and the scanner (and possibly fax)
are supported by the respective front ends for their type.
Generally, the more expensive the hardware is/was, the more likely it is
to be supported.
This is for several plausible reasons.
If a user has an expensive piece of hardware, the high price he paid for
it will likely make him (or his company) do what is necessary to ensure
that it is supported, whether by choice of a more expensive model to
begin with or by having his own programmer do the programming work,
which then normally gets shared all over.
On the other hand, a cheap device, that has been produced by the
zillions often uses proprietary protocols to control it.
Due to the low cost, there is less general interest to struggle with the
proprietary protocol, and it may be a copyright violation to reverse
engineer it, especially as the amount of time and specialized work
required to reverse engineer the device for the information needed to
write a back end is much greater than in the case of an expensive model
which uses a standard protocol that has likely already been reverse
engineered, and needs only minimal patching in the programming.
Also, since the owners paid little for the device, they are less likely
to be willing and able to help support the support of that device.
Most of this information is public and on various websites, but it is up
to each user to do his own research.
Added to this, some of the added software may be open source, but hosted
on paid repositories. This helps cover the high cost of hosting, aside
from the open source factor.
Use and cost of these repositories will be described on their web pages.
Some of them may also offer for download compiled RPMs or DEBs to
configure your system to use their facilities.
Others may simply provide a tarball for you to compile according you
your own individual needs.
Finally I hope this can help make SANE from appearing /in/sane!
On 08/07/2010 12:30 PM, Sigrid Carrera wrote:
Hi,
2010/7/8 Gordon<[email protected]>:
On 08/07/2010 17:08, Marcello Romani wrote:
When the scan is complete, you'll have an ordinary odt document with an
image in it.
Regardless of whether the image has been scanned or inserted by hand, it
will be editable with the usual limitations of OOo (brighntess,
contrast, tone, etc.), which have nothing to do with scanning of course.
I think we're drifting off here!
What I am really after is this.
If I connect to the Officejet in Windows and put a document on the scanner,
there is an option to scan the document in as a Word document, not as an
image file, and the resultant file can be opened and edited just like a
normal Word document.
Now unfortunately, the Linux driver for this all-in-one machine doesn't
allow that function so I have had to install a third-party scanning utility
such as Xsane. I haven't found yet a scanning utility that will allow the
scanned image to be "saved as" an editable word-processing document and not
as an uneditable image file.
I'm wondering if any such thing actually exists in Linux...
yes, it does exist. Search for OCR functioality, for example gocr.
I don't have any experience with this, so I don't know how good it is.
But at least, I know, that it exists. ;)
Sigrid
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]