Re: Scanning Suggestions (Bookmarks & Colour)

2021-09-10 Thread Paul Koning via cctalk



> On Sep 10, 2021, at 12:21 PM, Paul Flo Williams via cctalk 
>  wrote:
> 
>> ...
> 
> https://hisdeedsaredust.com/2021/09/10/colour-separations-with-graphicsmagick.html

Paul,

You said "At the moment, I’ve been picking the colour by choosing an image with 
a large area of blue, pulling it into Gimp and successively resampling down 
until I’ve got a tiny image on which I use the colour dropper tool.".

There's a simpler way.  The GIMP color dropper has a "radius" setting, which 
you can set to be however big a region you want it to sample.  It will use the 
average of that region.

By the way, I tried to send an off-list reply to you; that was rejected by some 
dumb outfit called Spamhouse SBL:

Diagnostic-Code: smtp; 550-X-AAISP-blacklisted: 2001:558:fe21:29:69:252:207:42 
listed in Spamhaus SBL

Bizarre message because it doesn't appear in the mail routing of the rejected 
message, but it maps back to one of the comcast.net mail servers.  Does this 
mean you have a mail filter that rejects everything from comcast.net customers? 
 I remember nonsense like that back in the 1990s, but I didn't realize there 
were still any left that perpetrate such absurd notions.

paul




Re: Scanning Suggestions (Bookmarks & Colour)

2021-09-10 Thread Paul Flo Williams via cctalk
On Fri, 3 Sep 2021 10:10:50 +0100
Antonio Carlini via cctalk  wrote:

> On 02/09/2021 20:51, Paul Flo Williams via cctalk wrote:
> >
> > I haven't finished writing this up, but my workflow tends to be to
> > produce a Group4 TIFF from the colour scan by simple thresholding
> > (or first dropping the other colours to white, if they are quite
> > dark), and then produce all the other separations by dropping black
> > out, converting your spot colour to black and then thresholding.
> > This way you get two or more images:
> >
> > 1) PNG(s) containing pixels that are all either white or your spot
> > colour,
> > 2) a G4 TIFF for the black and white layer.  
> 
> As I'm in the process of scanning manuals right now, and I'd like to 
> preserve the colour, I'm looking forward to the write up.

https://hisdeedsaredust.com/2021/09/10/colour-separations-with-graphicsmagick.html

You're welcome to send me any sample scans of your RSX manuals,
Antonio, because I'd like to see the results with other colours. I've
been processing light blue, green and brown separations on other
manuals and doing a lot of coding, generalising my toolchain at the
moment and filing issues with GraphicsMagick on the way.

(Incidentally, I've improved that PDF from a week ago to add bookmarks
and proper page labels. Link in article; much more needing to be
written.)

Regards,
Paul


Re: Scanning Suggestions (Bookmarks & Colour)

2021-09-03 Thread Antonio Carlini via cctalk

On 02/09/2021 20:51, Paul Flo Williams via cctalk wrote:

With apologies for breaking the threading, as I've just rejoined and I'm
responding to something I've just spotted in the archive ...



Welcome back!



I haven't finished writing this up, but my workflow tends to be to
produce a Group4 TIFF from the colour scan by simple thresholding (or
first dropping the other colours to white, if they are quite dark), and
then produce all the other separations by dropping black out,
converting your spot colour to black and then thresholding. This way
you get two or more images:

1) PNG(s) containing pixels that are all either white or your spot
colour,
2) a G4 TIFF for the black and white layer.


As I'm in the process of scanning manuals right now, and I'd like to 
preserve the colour, I'm looking forward to the write up.


The ones I'm working on right now are mostly RSX-11 or VMS V4 and 
earlier, so they tend to highlight typed input by presenting that text 
in red.


They also have blocks of grey as background on which text is printed 
(this looks OK even with a bilevel TIFF) and also blocks of red/pink as 
background


for black text (and maybe red text too). I'm sure there's at least one 
manual that has blue text thrown into the mix too.




I've just scanned another document with some blue diagrams and table
backgrounds, if you'd like to see an example:

https://vt100.net/dec/ek-0la75-ug-002.pdf

That looks really good. It copes very well with black text on blue 
background, so I imagine it would work well for the black text on 
red/pink shading case.



What's the input to the process? G4 bilevel TIFF @ 600 dpi from the 
looks of your example, but how do you scan the colour pages? 600dpi to 
PNG? Or something else?



Thanks


Antonio


--
Antonio Carlini
anto...@acarlini.com



Scanning Suggestions (Bookmarks & Colour)

2021-09-02 Thread Paul Flo Williams via cctalk
With apologies for breaking the threading, as I've just rejoined and I'm
responding to something I've just spotted in the archive ...

Regarding colour separations for scanned documents, GraphicsMagick is
quite capable of producing the required individual colour layers. If
you identify the colours you wish to pull out, you can use the "-fuzz"
and "-opaque" operators to change any given colour range (fuzz uses
Euclidean distance in RGB space) into another one (the current "-fill"
colour).

I haven't finished writing this up, but my workflow tends to be to
produce a Group4 TIFF from the colour scan by simple thresholding (or
first dropping the other colours to white, if they are quite dark), and
then produce all the other separations by dropping black out,
converting your spot colour to black and then thresholding. This way
you get two or more images:

1) PNG(s) containing pixels that are all either white or your spot
colour,
2) a G4 TIFF for the black and white layer.

The PNG must be saved as a two-colour paletted image so that they can
be used as masks in the final PDF. I always apply the black and white
(text) layer on top of every page, so that the fuzzing of the colour
layers doesn't reduce the clarity of the text.

This might sound awkward, but I've found that one fuzz value tends to
work for all the pages when extracting a given colour, so you can
process all pages in a loop. I use the Perl module PDF::Builder to put
my scans together, but I think tumble is capable of overlays too.

PNGs are compressed with deflate. If the spot colours you are
processing apply to text in the document, my first thought was that I
could save a bunch of Group4 TIFFs, one for each colour, and mask those
into the PDF, because Group4 compression is impressive for text. It took
some frustrating experiments before realising the Group4 compression
isn't defined for two colour images in general; it is specifically for
images that are black and white, and PDF won't let you circumvent that!

I've just scanned another document with some blue diagrams and table
backgrounds, if you'd like to see an example:

https://vt100.net/dec/ek-0la75-ug-002.pdf

I might reprocess this later, but for now, I didn't even bother
separating out pages that contain blue from ones that don't; every page
has a blue layer, even if it's blank. If you're wide awake, you may
spot that the blue layer on page 41 doesn't extend to the bottom of the
table. This isn't a processing flaw; the document is actually printed
like that.

Regards,
Paul


Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-30 Thread Christian Corti via cctalk

On Fri, 27 Aug 2021, Al Kossow wrote:
I didn't see an obvious example of ocrmypdf doing OCR in parallel on a 
single document


It does that by default. At least, it always uses all cores when I process 
a document with ocrmypdf.


Christian


Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-28 Thread Paul Koning via cctalk



> On Aug 28, 2021, at 12:22 PM, Al Kossow via cctalk  
> wrote:
> 
> On 8/28/21 8:57 AM, Antonio Carlini via cctalk wrote:
> 
>> Neatly solved in the document's future (but our past and present) by having 
>> documents that are born digital.
> 
> Good luck translating documents that HP, DEC and IBM produced in their 
> proprietary "bookreader" formats so
> they don't look like crap.

The same goes for PDF, in many cases.  But I have found that running a PDF 
document through an OCR program that handles page formatting (like FineReader) 
can work quite well.  The OCR function itself of course works very nicely when 
you have input like that -- no variability in the letter shapes.  So you're 
really dealing with the conversion from page geometry to text flow that those 
programs also offer.

paul




Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-28 Thread Al Kossow via cctalk

Re: scanning resolution

I've never seen a perceivable difference beyond 600-800dpi in the material I 
work with

You have to consider the media you are working with. Even 600 is overkill for a 
DEC
pulp handbook from the 60's, while a high clay litho magazine may have 
half-toning to
that level.




Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-28 Thread Al Kossow via cctalk

On 8/28/21 8:57 AM, Antonio Carlini via cctalk wrote:


Neatly solved in the document's future (but our past and present) by having 
documents that are born digital.


Good luck translating documents that HP, DEC and IBM produced in their proprietary 
"bookreader" formats so
they don't look like crap.

I have the source tapes with the files for hundreds of HP manuals created with 
Interleaf.
I've not been able to recreate their workflow.

"Born Digital" is no panacea



Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-28 Thread Antonio Carlini via cctalk

On 28/08/2021 12:44, emanuel stiebler via cctalk wrote:


I always fail to understand this ...
With prices for hard drives like they are, and comparing to the amount
of work, it really is to scan a manual, I would recommend to scan with
the best resolution you have, and have those files as you "original scans"
Than, you apply whatever tricks you have in your bin, to "publish" those
scans.


Well the scanner claims 4800 dpi optical, so that's 1.6GiB per page.

(Actually the scanner claims 4800 x 9600 dpi optical  but I can't see 
how to ask it to do that).


So there's a question of what's practical. I only have about 4Tib of 
free space, so that's 2500 colour pages at most.


It's also incredibly inefficient: that same information, if it had been 
born digital, would take 100kB per page or so.


I've not tried opening a 100GiB document lately but I assume that any 
PDF reader will some issues.




Probably, one day there will be a nice tool, to do whatever you
expected, and you have the scan already on your drive, and the original
manual is digitized and preserved already.

Neatly solved in the document's future (but our past and present) by 
having documents that are born digital.



That just leaves a few hundred years of printed matter to deal with. 
Luckily noteshrink seems to do a good job.



Antonio

--
Antonio Carlini
anto...@acarlini.com



Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-28 Thread Antonio Carlini via cctalk

On 28/08/2021 09:21, æstrid smith via cctalk wrote:

i've achieved satisfactory results paletteizing scans of low-color-depth 
material using a tool called 'noteshrink':

https://mzucker.github.io/2016/09/20/noteshrink.html

Well as a guide the 66 page AA-CJ39A-TE (VAX-11 RSX Installation Guide 
and Release Notes) is 8.8MB. That's with the front and rear cover 
scanned as 300 dpi JPG and also 12 colour pages as 300 dpi JPG.


Each of the 600dpi PNG pages comes out at 26MB.

I tried optipng first. Even "-o 7" (which I ran overnight but I forgot 
to time ...) only dropped a page down to 19MB. So completely impractical 
for even this small number of pages.



Noteshrink (which I've seen before but never bothered to try!) knocked a 
26MB PNG down to 700kB. The only issue is that the red looks quite a bit 
more brown than it should. I'll look into it a bit more as it looks 
good. Thanks



Antonio


--
Antonio Carlini
anto...@acarlini.com



Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-28 Thread emanuel stiebler via cctalk
On 2021-08-28 04:21, æstrid smith via cctalk wrote:
> i've achieved satisfactory results paletteizing scans of low-color-depth 
> material using a tool called 'noteshrink':
> 
> https://mzucker.github.io/2016/09/20/noteshrink.html

Even if you don't use the tool, it is worth reading ;-)




Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-28 Thread emanuel stiebler via cctalk
On 2021-08-27 16:50, Antonio Carlini via cctalk wrote:

> For photographs or shaded areas that don't necessarily come out well
> under those settings, I plan to use 8-bit greyscale. I'd prefer to use
> 600dpi but I may have to fall back to 300dpi if the per-page fiile size
> shoots up too much.

I always fail to understand this ...
With prices for hard drives like they are, and comparing to the amount
of work, it really is to scan a manual, I would recommend to scan with
the best resolution you have, and have those files as you "original scans"
Than, you apply whatever tricks you have in your bin, to "publish" those
scans.

Probably, one day there will be a nice tool, to do whatever you
expected, and you have the scan already on your drive, and the original
manual is digitized and preserved already.

Just my .001 cents ;-)

And I was talking about pictures/hafltone etc.

The recommendations for b/w & text/line drawings are clear ...


Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-28 Thread æstrid smith via cctalk
i've achieved satisfactory results paletteizing scans of low-color-depth 
material using a tool called 'noteshrink':

https://mzucker.github.io/2016/09/20/noteshrink.html

-- 
æstrid smith (she/her)
=<[ c y b e r ]>=
antique telephone collectors association member #4870



On Fri, Aug 27, 2021, at 13:50, Antonio Carlini via cctalk wrote:
> I have a few manuals to scan and I'm looking for suggestions, about how 
> to add bookmarks and how to handle colour.
> 
> Bookmarks should be easier, so lets start with that. I want to add 
> bookmarks (or whatever they are called) so that it is easy to navigate 
> to page "2-48" or "C-17" in a document. Many of the PDFs on bitsavers 
> have that and I've found it very useful so I'd like to do that for my 
> future scans. I've tried with pdftk (the Java port as the original is no 
> longer available on my distro) but that failed. So I tried GhostScript 
> and that also failed, while also rewriting the PDF to be considerably 
> larger. Is there simple way to achieve this (ideally from the CLI)?
> 
> 
> Now for the scanning itself.
> 
> For manuals that are simple monochrome, I plan to scan at 600dpi bilevel 
> G4 encoded, wrapped in PDF.
> For photographs or shaded areas that don't necessarily come out well 
> under those settings, I plan to use 8-bit greyscale. I'd prefer to use 
> 600dpi but I may have to fall back to 300dpi if the per-page fiile size 
> shoots up too much.
> 
> The real issue is colour. I know that various people have looked at the 
> issue of how to efficiently scan pages that are mostly black and white 
> but have some coloured text (RSX-11 manuals and early VMS manuals did 
> this to highlight terminal input, for example). I don't think this is a 
> solved problem and I'm not expecting a solution, what I'm really looking 
> for is to check that what I'm about to produce will have all the 
> information that a future efficient algorithm is likely to need.
> 
> I'm going to start by scanning the whole manual as though it had no 
> colour (so 600 dpi bilevel G4 encoded, except for pages with photos and 
> shading and so on). Then I'm going to go back and rescan the pages that 
> have colour and scan those at 600 dpi and save as a JPG. Then I'll 
> produce a final PDF with the colour pages inserted. I'll also produce a 
> PDF with the B pages that were replaced by colour pages (I assume OCR 
> will be better served by non-jaggy scans).
> 
> So the final outputs will be:
> manual.pdf  - the whole manual, including whole pages scanned as colour 
> if any colour is present on them
> manual_BW.pdf  - the G4-encoded bilevel pages that were replaced by 
> colour pages
> 
> Thanks
> 
> 
> Antonio
> 
> 
> -- 
> 
> Antonio Carlini
> anto...@acarlini.com
> 
> 


Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-28 Thread Michael Mulhern via cctalk
I have been scanning to 400 and 600 dpi TIFF and using scantailor-advanced
to post-process. Where manuals have had colour highlighting I’ve had good
results selecting indexed colour as the output type for those pages.

Unfortunately pathetic rural internet connection means I’ve only been
pushing optimised PDFs to the internet archive, but all original and
processed scans are kept and backed up.

I’m not just doing manuals, but whatever I find interesting in my library
at them time.

So far I’ve only done 500+ in the last 12 months, but it’s also therapeutic
for me as well as for sharing.

For your education and/or reading pleasure
https://archive.org/details/@jongleur

Cheers

Michael.

On Sat, 28 Aug 2021 at 8:05 am, Antonio Carlini via cctalk <
cctalk@classiccmp.org> wrote:

> On 27/08/2021 22:10, Al Kossow via cctalk wrote:
> > On 8/27/21 2:05 PM, Paul Koning via cctalk wrote:
> >
> >> For material such as the RSX manuals you mentioned, the tool needed
> >> is a compression algorithm that handles color with hard edges
> >> faithfully.  Basically that means a lossless compression scheme.
> >> That should be fine, since pages like that should compress very well,
> >> at least if the scan has been touched up just a bit to make the page
> >> background reasonably pure white.
> >
> > Ethan worked on a filter a long time ago for DEC manuals. J David
> > Bryan's work was mentioned recently.
>
> I did see it, but it didn't look like a cookie-cutter recipe. I'd be
> happy to be proved wrong though. What I don't want to have to do is
> manually process each page (beyond having to decide which to scan in
> colour). I would be looking for an algorithm or process that I can just
> point at scanner data for a page and have it spit out the optimised PDF
> page. I'm sure that will appear at some stage, but I don't think it
> exists yet. The RSX-11M/M-PLUS Error Logging Manual, for example, has
> somewhere between 20 and 50 pages with colour present. I can pick those
> out and re-scan them and I can relatively easily merge those pages with
> the original B scan, but if I have to manually examine each page, I'll
> never make it to whatever manual is in my list after that one :-)
>
> >
> > It is trivial to add page bookmarks with Eric Smith's tumble with the
> > -b %F option
> >
> Thanks, I'll look into tumble.
>
>
> Antonio
>
> --
> Antonio Carlini
> anto...@acarlini.com
>
> --


*Blog: RetroRetrospective – Fun today with yesterday's gear……..
*
*Podcast*: *Retro Computing Roundtable * (Co-Host)


Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-27 Thread Al Kossow via cctalk



On 8/27/21 6:01 PM, Paul Koning wrote:
> I once looked for open source OCR, and found one (an obscure GNU project I 
> think) but it didn't do much.
|Most of the world uses tesseract, including ocrmypdf

I didn't see an obvious example of ocrmypdf doing OCR in parallel on a single 
document


|


Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-27 Thread Don Stalkowski via cctalk
On Fri, Aug 27, 2021 at 05:55:44PM -0700, Al Kossow via cctalk wrote:
> I was also just thinking you would probably have to have a layer (black) with 
> all of the
> stuff to OCR including the stuff in red and blue, then overlay the color on 
> that
> after the pass through whatever you're using to do the OCR.
> 
> The one bottleneck I would really like to fix is getting the 24 cores on my 
> machine doing
> OCR on 24 different pages at the same time.
> 
> 

The documentation for ocrmypdf describes how to do that.

https://ocrmypdf.readthedocs.io/en/latest/

and

https://ocrmypdf.readthedocs.io/en/latest/batch.html

Don


Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-27 Thread Paul Koning via cctalk



> On Aug 27, 2021, at 8:55 PM, Al Kossow via cctalk  
> wrote:
> 
> I was also just thinking you would probably have to have a layer (black) with 
> all of the
> stuff to OCR including the stuff in red and blue, then overlay the color on 
> that
> after the pass through whatever you're using to do the OCR.
> 
> The one bottleneck I would really like to fix is getting the 24 cores on my 
> machine doing
> OCR on 24 different pages at the same time.

What OCR do you use?  It's been a while, but I've been using ABBYY FineReader.  
I once looked for open source OCR, and found one (an obscure GNU project I 
think) but it didn't do much.

paul



Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-27 Thread Al Kossow via cctalk

I was also just thinking you would probably have to have a layer (black) with 
all of the
stuff to OCR including the stuff in red and blue, then overlay the color on that
after the pass through whatever you're using to do the OCR.

The one bottleneck I would really like to fix is getting the 24 cores on my 
machine doing
OCR on 24 different pages at the same time.



Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-27 Thread Paul Koning via cctalk



> On Aug 27, 2021, at 8:32 PM, Paul Koning via cctalk  
> wrote:
> 
> ...
> 
>> It turns out though that if you drive it with a computer then you also get 
>> the choice of TIFF or PNG as additional choices. TIFF is likely to be quite 
>> a bit too big. I'll try PNG and see how big the files it generates are. I've 
>> no idea what the default compression is straight out of the software but as 
>> long as it's lossless I can hopefully post-process to squeeze things down if 
>> possible.
> 
> TIFF is (normally) lossless.  I think PNG also, or at least can be, but I 
> don't understand it as well.

I should have checked first, but at least now I know: yes, PNG is lossless, it 
uses Deflate compression.  And TIFF is also lossless, with a variety of 
compression schemes.  In particular, it is the way to go for bitonal images 
with the CCITT compression scheme, which is specifically optimized for that 
case.  

paul



Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-27 Thread Paul Koning via cctalk



> On Aug 27, 2021, at 5:36 PM, Antonio Carlini  wrote:
> 
> On 27/08/2021 22:05, Paul Koning wrote:
>> JPG is the wrong tool for pages with color text or color line art. As I've 
>> mentioned before, JPG is fit ONLY for photos, not for any image with hard 
>> edges. Text compressed with JPG will suffer badly. 
> 
> 
> Yes, true. I thought that for colour, all I could get was JPEG. It certainly 
> seems to be the case that the HP PhotoSmart I have scans everything as JPEG 
> 300 dpi when you use the front panel to scan to a memory stick. Post 
> processing wouldn't make that any better, which is why I thought I was stuck 
> with JPEG.

Wow, that's crazy.  Perhaps they thought the product was only going to be used 
by consumers who have no clue.

> It turns out though that if you drive it with a computer then you also get 
> the choice of TIFF or PNG as additional choices. TIFF is likely to be quite a 
> bit too big. I'll try PNG and see how big the files it generates are. I've no 
> idea what the default compression is straight out of the software but as long 
> as it's lossless I can hopefully post-process to squeeze things down if 
> possible.

TIFF is (normally) lossless.  I think PNG also, or at least can be, but I don't 
understand it as well.

TIFF is actually a container and inside it can be any number of encodings.  
Compression schemes can be simple ones like run length coding, or more complex 
ones like LZ.  Either way, if there are patterns, especially significant areas 
of the same color, the compression works very well indeeed.

A raw scan probably won't compress well.  But something as simple as a white 
point adjustment to make the bulk of the background be full white will make the 
file very much smaller.  If you tweak the black point some as well, so areas 
meant to be black are in fact full black rather than slightly-varying grays, 
you will gain still more.  As a bonus, the resulting image will also be much 
crisper and easier to read.

The other day there was a mention of open souce tools at leptonica.org: from 
the examples given in the intro, for example here: 
http://www.leptonica.org/binarization.html it looks like a very nice tool kit 
to clean up images very well and easily.   While I don't see it mentioned, the 
cleaned up images will certainly compress very effectively in TIFF.

paul



Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-27 Thread Paul Koning via cctalk



> On Aug 27, 2021, at 5:19 PM, Al Kossow via cctalk  
> wrote:
> 
> 
> Probably the way to deal with DEC tri-color would be to define rectangular 
> regions and have
> three separate bitonal layers colored black red and blue, and a third JPEG-2 
> for grayscale
> or color images. Doing the layer separations would be the non-fun part.

I haven't looked at tools in, say, GIMP to create spot-color layers from RGB 
images.  If it can do that, a good way to do what you describe is to give it a 
spot color definition that matches what the color print looks like.  Then the 
black layer and the red (or whatever) layer are the two you want, and you could 
run each through a threshold operation to make them bitonal.  Then the image 
converted back to RGB would compress really well with TIFF.

paul



Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-27 Thread Van Snyder via cctalk
On Sat, 2021-08-28 at 01:40 +0200, Torfinn Ingolfsen via cctalk wrote:
> On Sat, Aug 28, 2021 at 12:43 AM Van Snyder via cctalk
>  wrote:
> > 
> > 
> > This isn't a default part of Debian distributions, and apt-get
> > can't
> > find it.
> > 
> > I found it on github
> > 
> > https://github.com/brouhaha/tumble
> > 
> > I had to install several packages that are not default parts of
> > Debian
> > "Buster" 10, such as bison, flex, libtiff-dev libjpeg-dev
> > libnetpbm11-
> > dev
> > 
> > I wasn't able to compile it on Debian "Buster" 10.  Ultimately,
> > "make"
> > gave up:
> 
> FWIW, I got it to compile on Debian 11  by applying the fixes in pull
> request 7
> https://github.com/brouhaha/tumble/pull/7
> Unfortunately, I no longer have any Debian 10 machines to test on.

Thanks for the tip.

That did the trick. Compile worked. Haven't tested the executable yet.

Van

> 
> HTH



Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-27 Thread Torfinn Ingolfsen via cctalk
On Sat, Aug 28, 2021 at 12:43 AM Van Snyder via cctalk
 wrote:
>
>
> This isn't a default part of Debian distributions, and apt-get can't
> find it.
>
> I found it on github
>
> https://github.com/brouhaha/tumble
>
> I had to install several packages that are not default parts of Debian
> "Buster" 10, such as bison, flex, libtiff-dev libjpeg-dev libnetpbm11-
> dev
>
> I wasn't able to compile it on Debian "Buster" 10.  Ultimately, "make"
> gave up:

FWIW, I got it to compile on Debian 11  by applying the fixes in pull request 7
https://github.com/brouhaha/tumble/pull/7
Unfortunately, I no longer have any Debian 10 machines to test on.

HTH
-- 
Regards,
Torfinn Ingolfsen


Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-27 Thread Van Snyder via cctalk
On Fri, 2021-08-27 at 14:10 -0700, Al Kossow via cctalk wrote:
> It is trivial to add page bookmarks with Eric Smith's tumble with the
> -b %F option

This isn't a default part of Debian distributions, and apt-get can't
find it.

I found it on github

https://github.com/brouhaha/tumble

I had to install several packages that are not default parts of Debian
"Buster" 10, such as bison, flex, libtiff-dev libjpeg-dev libnetpbm11-
dev

I wasn't able to compile it on Debian "Buster" 10.  Ultimately, "make"
gave up:

# make clean; make
... bison and flex...
... a bunch of successful cc compiles to get .o files
cc tumble.o semantics.o tumble_input.o tumble_tiff.o tumble_jpeg.o
tumble_pbm.o tumble_png.o tumble_blank.o bitblt.o bitblt_g4.o
bitblt_tables.o g4_tables.o pdf.o pdf_util.o pdf_prim.o pdf_name_tree.o
pdf_bookmark.o pdf_page_label.o pdf_text.o pdf_g4.o pdf_jpeg.o
pdf_png.o scanner.o parser.tab.o -ltiff -ljpeg -lnetpbm -lz -lm -o
tumble
/usr/bin/ld: tumble_input.o:(.bss+0x0): multiple definition of
`blank_handler'; tumble.o:(.bss+0x20): first defined here
/usr/bin/ld: tumble_tiff.o:(.bss+0x20): multiple definition of
`blank_handler'; tumble.o:(.bss+0x20): first defined here
/usr/bin/ld: tumble_jpeg.o:(.bss+0x0): multiple definition of
`blank_handler'; tumble.o:(.bss+0x20): first defined here
/usr/bin/ld: tumble_pbm.o:(.bss+0x0): multiple definition of
`blank_handler'; tumble.o:(.bss+0x20): first defined here
/usr/bin/ld: tumble_png.o:(.bss+0x0): multiple definition of
`blank_handler'; tumble.o:(.bss+0x20): first defined here
/usr/bin/ld: tumble_blank.o:(.data.rel.local+0x0): multiple definition
of `blank_handler'; tumble.o:(.bss+0x20): first defined here
/usr/bin/ld: scanner.o:(.bss+0x18): multiple definition of `yyin';
semantics.o:(.bss+0x50): first defined here
collect2: error: ld returned 1 exit status
make: *** [Makefile:127: tumble] Error 1

Is it available as a statically-linked executable, or only as source
from github?



Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-27 Thread Antonio Carlini via cctalk

On 27/08/2021 22:10, Al Kossow via cctalk wrote:

On 8/27/21 2:05 PM, Paul Koning via cctalk wrote:

For material such as the RSX manuals you mentioned, the tool needed 
is a compression algorithm that handles color with hard edges 
faithfully.  Basically that means a lossless compression scheme.  
That should be fine, since pages like that should compress very well, 
at least if the scan has been touched up just a bit to make the page 
background reasonably pure white.


Ethan worked on a filter a long time ago for DEC manuals. J David 
Bryan's work was mentioned recently.


I did see it, but it didn't look like a cookie-cutter recipe. I'd be 
happy to be proved wrong though. What I don't want to have to do is 
manually process each page (beyond having to decide which to scan in 
colour). I would be looking for an algorithm or process that I can just 
point at scanner data for a page and have it spit out the optimised PDF 
page. I'm sure that will appear at some stage, but I don't think it 
exists yet. The RSX-11M/M-PLUS Error Logging Manual, for example, has 
somewhere between 20 and 50 pages with colour present. I can pick those 
out and re-scan them and I can relatively easily merge those pages with 
the original B scan, but if I have to manually examine each page, I'll 
never make it to whatever manual is in my list after that one :-)




It is trivial to add page bookmarks with Eric Smith's tumble with the 
-b %F option



Thanks, I'll look into tumble.


Antonio

--
Antonio Carlini
anto...@acarlini.com



Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-27 Thread Antonio Carlini via cctalk

On 27/08/2021 22:05, Paul Koning wrote:
JPG is the wrong tool for pages with color text or color line art. As 
I've mentioned before, JPG is fit ONLY for photos, not for any image 
with hard edges. Text compressed with JPG will suffer badly. 



Yes, true. I thought that for colour, all I could get was JPEG. It 
certainly seems to be the case that the HP PhotoSmart I have scans 
everything as JPEG 300 dpi when you use the front panel to scan to a 
memory stick. Post processing wouldn't make that any better, which is 
why I thought I was stuck with JPEG.



It turns out though that if you drive it with a computer then you also 
get the choice of TIFF or PNG as additional choices. TIFF is likely to 
be quite a bit too big. I'll try PNG and see how big the files it 
generates are. I've no idea what the default compression is straight out 
of the software but as long as it's lossless I can hopefully 
post-process to squeeze things down if possible.



If this turns out to work without ballooning file sizes, then I can just 
not bother preserving the B pages, as lossless colour PNG should OCR 
as well as B (I would think).



I have a few pages in the scanner now at 600dpi PNG so I'll soon know 
how that compares to JPG or B



Thanks


Antonio


--
Antonio Carlini
anto...@acarlini.com



Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-27 Thread Al Kossow via cctalk



Probably the way to deal with DEC tri-color would be to define rectangular 
regions and have
three separate bitonal layers colored black red and blue, and a third JPEG-2 
for grayscale
or color images. Doing the layer separations would be the non-fun part.




Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-27 Thread Al Kossow via cctalk

On 8/27/21 2:05 PM, Paul Koning via cctalk wrote:


For material such as the RSX manuals you mentioned, the tool needed is a 
compression algorithm that handles color with hard edges faithfully.  Basically 
that means a lossless compression scheme.  That should be fine, since pages 
like that should compress very well, at least if the scan has been touched up 
just a bit to make the page background reasonably pure white.


Ethan worked on a filter a long time ago for DEC manuals. J David Bryan's work 
was mentioned recently.

It is trivial to add page bookmarks with Eric Smith's tumble with the -b %F 
option



Re: Scanning Suggestions (Bookmarks & Colour)

2021-08-27 Thread Paul Koning via cctalk



> On Aug 27, 2021, at 4:50 PM, Antonio Carlini via cctalk 
>  wrote:
> 
> I have a few manuals to scan and I'm looking for suggestions, about how to 
> add bookmarks and how to handle colour.
> ...
> For photographs or shaded areas that don't necessarily come out well under 
> those settings, I plan to use 8-bit greyscale. I'd prefer to use 600dpi but I 
> may have to fall back to 300dpi if the per-page fiile size shoots up too much.

Depending on the resolution used, given that the photos are printed as halftone 
(black dots of various sizes), you may get weird scan artifacts.  Some scan 
programs may have tools to convert a halftone image to the equivalent grayscale 
image, such a thing is likely to be helpful.

> The real issue is colour. I know that various people have looked at the issue 
> of how to efficiently scan pages that are mostly black and white but have 
> some coloured text (RSX-11 manuals and early VMS manuals did this to 
> highlight terminal input, for example). I don't think this is a solved 
> problem and I'm not expecting a solution, what I'm really looking for is to 
> check that what I'm about to produce will have all the information that a 
> future efficient algorithm is likely to need.
> 
> I'm going to start by scanning the whole manual as though it had no colour 
> (so 600 dpi bilevel G4 encoded, except for pages with photos and shading and 
> so on). Then I'm going to go back and rescan the pages that have colour and 
> scan those at 600 dpi and save as a JPG.

JPG is the wrong tool for pages with color text or color line art.  As I've 
mentioned before, JPG is fit ONLY for photos, not for any image with hard 
edges.  Text compressed with JPG will suffer badly.

For material such as the RSX manuals you mentioned, the tool needed is a 
compression algorithm that handles color with hard edges faithfully.  Basically 
that means a lossless compression scheme.  That should be fine, since pages 
like that should compress very well, at least if the scan has been touched up 
just a bit to make the page background reasonably pure white.  With more effort 
it would be possible to reconstruct the original three-color material (white, 
black, red or whatever), but that's a fair amound harder and probably not 
necessary for adequate compression.  But please, make it a practice to avoid 
JPG except in those cases (rare or non-existent in document scanning work) 
where you're actually dealing with a continuous tone photograph).

paul



Scanning Suggestions (Bookmarks & Colour)

2021-08-27 Thread Antonio Carlini via cctalk
I have a few manuals to scan and I'm looking for suggestions, about how 
to add bookmarks and how to handle colour.


Bookmarks should be easier, so lets start with that. I want to add 
bookmarks (or whatever they are called) so that it is easy to navigate 
to page "2-48" or "C-17" in a document. Many of the PDFs on bitsavers 
have that and I've found it very useful so I'd like to do that for my 
future scans. I've tried with pdftk (the Java port as the original is no 
longer available on my distro) but that failed. So I tried GhostScript 
and that also failed, while also rewriting the PDF to be considerably 
larger. Is there simple way to achieve this (ideally from the CLI)?



Now for the scanning itself.

For manuals that are simple monochrome, I plan to scan at 600dpi bilevel 
G4 encoded, wrapped in PDF.
For photographs or shaded areas that don't necessarily come out well 
under those settings, I plan to use 8-bit greyscale. I'd prefer to use 
600dpi but I may have to fall back to 300dpi if the per-page fiile size 
shoots up too much.


The real issue is colour. I know that various people have looked at the 
issue of how to efficiently scan pages that are mostly black and white 
but have some coloured text (RSX-11 manuals and early VMS manuals did 
this to highlight terminal input, for example). I don't think this is a 
solved problem and I'm not expecting a solution, what I'm really looking 
for is to check that what I'm about to produce will have all the 
information that a future efficient algorithm is likely to need.


I'm going to start by scanning the whole manual as though it had no 
colour (so 600 dpi bilevel G4 encoded, except for pages with photos and 
shading and so on). Then I'm going to go back and rescan the pages that 
have colour and scan those at 600 dpi and save as a JPG. Then I'll 
produce a final PDF with the colour pages inserted. I'll also produce a 
PDF with the B pages that were replaced by colour pages (I assume OCR 
will be better served by non-jaggy scans).


So the final outputs will be:
manual.pdf  - the whole manual, including whole pages scanned as colour 
if any colour is present on them
manual_BW.pdf  - the G4-encoded bilevel pages that were replaced by 
colour pages


Thanks


Antonio


--

Antonio Carlini
anto...@acarlini.com