Thanks much for looking at this.

WRT the frequency of soft hyphens, I've noticed that while many older PDFs
I have indeed do not use them, those that I OCR with the latest version of
Acrobat do use them.

There is apparently something of a controversy about the semantics of soft
hyphens in general, because ISO 8859-1 and Unicode standards take slightly
different views.

However, for PDF the semantics are defined in the ISO 32000-1 (PDF 1.7)
standard, clause 14.8.2.2.3, as follows:

*Hyphenation*. Among the artifacts introduced by text layout is the hyphen
> marking the incidental division of a word at the end of a line. In Tagged
> PDF, such an incidental word division shall be represented by a *soft
> hyphen* character, which the Unicode mapping algorithm (see “Unicode
> Mapping in Tagged PDF” in 14.8.2.4, “Extraction of Character Properties”)
> translates to the Unicode value U+00AD. (This character is distinct from an
> ordinary *hard hyphen*, whose Unicode value is U+002D.) *The producer of
> a Tagged PDF document shall distinguish explicitly between soft and hard
> hyphens so that the consumer does not have to guess which type a given
> character represents.*


I emphasized the last sentence, as it seems most relevant for the issue at
hand. My interpretation would be that not all PDF producers have actually
followed the standard, but over time the situation has changed.

All best,

M.

On Tue, Mar 29, 2022 at 6:18 PM Christiaan Hofman <cmhof...@gmail.com>
wrote:

> No need to apologized, I was the one who responded on the wrong thread.
>
> As I mentioned there, in the next release we will attempt (as much as we
> can) to remove hyphens at the end of the line. BTW, looking for soft
> hyphens isn’t very useful, as PDFs almost always use regular hyphen
> characters (U+002D) to break words. In fact, I haven’t yet encountered any
> PDF using soft hyphens. So we will consider both.
>
> Christiaan
>
> > On 29 Mar 2022, at 04:36, Mark Roberts <mroberts1...@gmail.com> wrote:
> >
> > Let me try this again. It seems I shouldn't have broached this on
> another thread, and for that I do apologize.
> >
> > I'd like to ask about the possibility of Skim automatically trimming
> soft hyphens when I create notes.
> >
> > Using Skim 1.6.9, if I create a note for a passage in the PDF that
> includes a hyphen at the line break, it still includes a soft hyphen
> (U+00AD) and a space, and I have to trim each of these by hand. FWIW, the
> PDF was OCR'd from scanned pages using the latest version of Adobe Acrobat.
> >
> > I checked some other PDF readers and found that Adobe Acrobat Pro, FoxIt
> Reader, and PDF Expert all trim out the soft hyphen, as well as the space,
> while Skim and Preview do not.
> >
> > It seems that if there are any soft hyphens (U+00AD) followed by a space
> in a string copied from a PDF, these two characters can safely be trimmed
> out. Regular hyphens in the PDFs I've checked are represented by U+002D, so
> there should be no danger of losing them if Skim were to perform this
> operation on strings.
> >
> > These soft hyphens appear in Skim notes as zero-width characters, so to
> clean up each note I must first place the cursor in front of the preceding
> character, advance, and then hit delete twice. I.e., point, click, and then
> hit three keys in a row to clear each one. Over time, this becomes rather
> tedious.
> >
> > I did a web search and found discussion on the interwebs about OCR'd
> text and soft hyphens, with many people asking how they can fix this
> problem with various apps. It seems to be a common issue, and — I submit —
> Acrobat Pro, FoxIt Reader, and PDF Expert handle it properly while Skim
> does not.
> >
> > Is this something that could be fixed?
> >
> > Thanks again,
> >
> > M.
>
>
>
> _______________________________________________
> Skim-app-users mailing list
> Skim-app-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/skim-app-users
>
_______________________________________________
Skim-app-users mailing list
Skim-app-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/skim-app-users

Reply via email to