Thanks for going into more detail.

Sorry if I wasn't clear, but I'm not trying to ask for changes to the
behavior of the Skim app itself.

I'm mainly wondering if there is a way to somehow access or export this
information about notes, based on the data that is saved plus what's in the
PDF.

If this information could be exported somehow (e.g., via skimnotes or some
other tool), then I could write my own app to read it and help fix up the
notes. I can easily code something to parse XML and manipulate text, but
digging into the PDF data structures is more difficult. I have investigated
doing this and indeed it's non-trivial.

Skim is one of the very few PDF readers aside from Adobe Acrobat that
actually handles page numbers correctly, AND has lots more flexibility for
the export of annotations, so this is why I'm inquiring here.

As you explain, internally PDF is rather messy and there can be
various degenerate cases, but I assume that for scholarly research 99.9% of
the time the PDF media will be journal articles or books in portrait
orientation. Indeed, there will be degenerate cases, and if the selected
text does not include full lines, then I assume hyphens could fall outside
of the selected text and the detection would of course fail. Still, what I
see on the screen is that lines of text are being recognized and
highlighted correctly 99% of the time, so it seems clear that somewhere
(either in Skim or in PDFKit, I cannot say), all of this machinery for
capturing notes text is working pretty well.

I don't know the details of the lower level APIs, but if they could provide
lines of text instead of just a single string of all the selected lines
concatenated, then it might be possible.

Thanks again for explaining this !

M.

On Wed, Mar 16, 2022 at 8:38 AM Christiaan Hofman <cmhof...@gmail.com>
wrote:

> BTW, I should say that some time ago we *did* try to remove hyphens and
> combine broken lines. But it was simply too unreliable based on the
> available information, and went wrong far too often.
>
> Christiaan
>
> On 16 Mar 2022, at 00:35, Christiaan Hofman <cmhof...@gmail.com> wrote:
>
> Perhaps you can get some information about the placement of the lines. But
> we don’t even get information from the PDF about what the orientation is
> (sometimes PDFs use rotated coordinate systems, e.g. in landscape pages).
> Also, the selected text may not consist of full lines, so the end of the
> text may not be the end of a line. Also, a hyphen does not need to be a
> line break, it can also just be a hyphen in the text. Perhaps it is
> possible to (almost) figure out how some parts of text are placed in the
> payed out text on the page, but then you have to first figure out precisely
> what the lines are and compare all the character ranges in the text. And
> even getting lines can be a real mess, as there is no guarantee that the
> text is simply payed out in nice lines. We certainly don’t get this
> information from the PDF.
>
> Christiaan
>
> On 15 Mar 2022, at 23:11, Mark Roberts <mroberts1...@gmail.com> wrote:
>
> I understand what PDF is about, so I guess I don't see what the issue is
> with getting lines of text.
>
> Looking at the PDF level, there are postscript commands placing characters
> or strings on a page. These have bounding boxes. Meanwhile, the Skim notes
> have highlights. In this case, text highlights. Nothing fancy.
> QuadrilateralPoints.
>
> Is it not possible to detect intersections between text on the page and
> the highlights?
>
> This is just simple math — computing the intersection of bounding boxes —
> right?
>
> If this can be done, then an app can compute what the lines of text are.
> And if we know what the lines of text are, then we can test whether a
> hyphen falls at the end of a line. It doesn't matter if we know what the
> "underlying text" was or not. If we find a hyphen at the end of a line of
> text, then that's a candidate for removal. We only need to have lines of
> text. Once the individual lines are assembled into a string, it becomes
> more difficult to detect this.
>
> Now, of course this all depends on the internal APIs, in this case PDFKit,
> I guess — is that the issue?
>
> Thanks again!
>
> M.
>
> On Tue, Mar 15, 2022 at 11:36 PM Christiaan Hofman <cmhof...@gmail.com>
> wrote:
>
>> No, it is a limitation of the PDF format.You just get a string for the
>> characters. The hyphen is also just one of the characters. There is no
>> information about the underlying text that was used to generate the PDF.
>> You should realize that PDF is an output format.
>>
>> Christiaan
>>
>> On 15 Mar 2022, at 14:37, Mark Roberts <mroberts1...@gmail.com> wrote:
>>
>> I sort of half understand what you are explaining, but maybe an example
>> would help.
>>
>> Let's say I have four lines of text in a PDF, e.g.:
>>
>> Lorem ipsum dolor sit amet, *consectetur adipiscing elit, sed do eius-*
>> *mod tempor incididunt ut labore et dolore magna aliqua.* Ut enim ad
>> minim veniam, quis nostrud exercitation ullamco laboris nisi ut ali-
>> quip ex ea commodo consequat.
>>
>> And let's say I select the text "consectetur adipiscing elit, sed do
>> eiusmod tempor incididunt ut labore et dolore magna aliqua" (on two lines),
>> and the word "eiusmod" is broken by a hyphen.
>>
>> Inside the PDF, there are in fact sequences of characters which form
>> lines that can be selected.
>>
>> Question: when Skim creates a note, I assume(?) it calls PDFKit and gets
>> some data back. Is it a single string, including the hyphen?
>>
>> I.e., is the limitation in the API for PDFKit, or ... ?
>>
>> Thanks again,
>>
>> M.
>>
>>
>> On Tue, Mar 15, 2022 at 6:58 PM Christiaan Hofman <cmhof...@gmail.com>
>> wrote:
>>
>>> I don’t know. It is just not possible to work with information that does
>>> not exist. All of this is just trying to be smart in interpreted whatever
>>> data exists. In this case, the information does not exist, and never
>>> existed. Again, the highlighted text is never part of the data of the note,
>>> it is data in the PDF that you may associate to it because of geometry. And
>>> when we set the text by default, the PDF does not provide sufficient
>>> information go tell us about the exact text and the flow of it, because PDF
>>> is primarily a graphic format. So we have no way of knowing when there is a
>>> hyphen, and whether it is breaks or not. You could try to parse the text
>>> and look for hyphens followed by spaces, and remove that from the text. We
>>> don’t do that automatically, as we cannot know whether that is correct.
>>>
>>> Christiaan
>>>
>>> On 15 Mar 2022, at 08:55, Mark Roberts <mroberts1...@gmail.com> wrote:
>>>
>>> Thanks for clarifying.
>>>
>>> I guess my question remains: how can I fix up these hyphenated lines in
>>> my notes? I can parse and process the XML output from skimnotes, but it
>>> seems there isn't enough data to identify lines.
>>>
>>> The issue is that full-text search of the notes won't work if words are
>>> broken up with hyphens.
>>>
>>> Whatever Skim is doing to handle line breaks isn't working for me — I
>>> still see words broken up by hyphens everywhere.
>>>
>>> Any ideas?
>>>
>>> Thanks,
>>>
>>> M.
>>>
>>> On Mon, Mar 14, 2022 at 11:53 PM Christiaan Hofman <cmhof...@gmail.com>
>>> wrote:
>>>
>>>> You should realize that the text of the note is a completely separate
>>>> data element from the highlighted text. The highlighted text is not part of
>>>> the note, it is just te text that happens to lie behind the highlight in
>>>> the PDF. We just set the text of the note to the text you highlight by
>>>> default, and we already do some cleaning, including trying to handle
>>>> line-breaks, before we set the text. And you can set it to whatever you
>>>> want. So there is no way to relate the geometry of the highlight in any way
>>>> to the text, as there does not exist a relation.
>>>>
>>>> Christiaan
>>>>
>>>> On 14 Mar 2022, at 12:27, Mark Roberts <mroberts1...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> This is very helpful — thanks !!
>>>>
>>>> I just tried your suggestion and got an XML file as expected. I more or
>>>> less understand all the elements of the XML, but it seems the entire note
>>>> is in a <string> element, while the quadrilateralPoints for the
>>>> highlighting boxes are separate.
>>>>
>>>> What I was hoping to do is somehow get each line of my note and then
>>>> look for a hyphen at the end of each line, and then trim that hyphen, as
>>>> necessary. The objective is to try and clean up the skim note to eliminate
>>>> line-break hyphens in the source text.
>>>>
>>>> Any ideas about how I could do this?
>>>>
>>>> Thanks again,
>>>>
>>>> M.
>>>>
>>>> On Mon, Mar 14, 2022 at 7:27 PM Christiaan Hofman <cmhof...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 14 Mar 2022, at 11:13, Christiaan Hofman <cmhof...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 14 Mar 2022, at 10:56, Christiaan Hofman <cmhof...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 14 Mar 2022, at 10:50, Christiaan Hofman <cmhof...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 14 Mar 2022, at 04:49, Mark Roberts <mroberts1...@gmail.com> wrote:
>>>>>
>>>>> Is there some way to get more detailed information about skim notes,
>>>>> i.e., other than the code framework?
>>>>>
>>>>> I have tried the skimnotes command line tool (e.g., the 'get' and
>>>>> 'format' commands), but it seems to only output the basic information 
>>>>> about
>>>>> notes, such as the note type, page number, and note text.
>>>>>
>>>>> Perhaps(?) there's another mode for the skimnotes tool, but I couldn't
>>>>> find it from reading the documentation.
>>>>>
>>>>> I'd like to get more complete data on each note, such as a timestamp,
>>>>> the coordinates of the boxes that are highlighted in the PDF file, the
>>>>> highlight color, and the text contained in each box.
>>>>>
>>>>> I assume(?) this data is in the notes file, but the skimnotes app
>>>>> ignores it for now.
>>>>>
>>>>> I'm wondering about this because if possible I'd like to make a script
>>>>> that gathers my notes for a PDF file, and tries to fix words that were
>>>>> broken by hyphenation in the original PDF. If I can get the highlight 
>>>>> boxes
>>>>> in the notes file, and the text in each box, then it should be possible to
>>>>> check for a hyphen character at the end of each line, and then stitch
>>>>> together the words that were split across lines.
>>>>>
>>>>> Any suggestions?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> M.
>>>>>
>>>>>
>>>>> The skimnotes tool is not a tool that can interpret the data. It only
>>>>> copies the data around to various locations that are supported (such as
>>>>> between extended attributes, .skim files, or within a .pdfd bundle). There
>>>>> is no tool to interpret he data. The Wiki has information about how the
>>>>> data is formatted. You could try to build your own tool to unarchive the
>>>>> data from that, but that would be quite a bit of work.
>>>>>
>>>>> Christiaan
>>>>>
>>>>>
>>>>> I can also note that in the near future the skim notes will be saved
>>>>> in a plist format, which can be read by various tools and apps, including
>>>>> AppleScript. You can already have Skim do that by activating a hidden
>>>>> preference, see the Wiki for details.
>>>>>
>>>>> Christiaan
>>>>>
>>>>>
>>>>> I just remembered that the skimnotes tool *can* convert to the plist
>>>>> format, which you may be able to read, using the ’skimnotes format’
>>>>> command.' skimnotes format plist SKIM_FILE' can do that. The help for
>>>>> skimnotes does not say so, but you can immediately also get the skim notes
>>>>> plist format from the skimnotes tool as follows:
>>>>>
>>>>> skimnotes get plist PDF_FILE SKIM_FILE
>>>>>
>>>>> This will get you a plist file in SKIM_FILE. Perhaps for other tools
>>>>> to read it you have to change the extension to .plist. You could also then
>>>>> pass it through plutil to convert the binary plist to xml plist (plutil
>>>>> -convert xml1 PLIST_FILE), which would even be human readable. You could
>>>>> combine that to get the skimnotes in xml format as follows:
>>>>>
>>>>> skimnotes get plist PDF_FILE - | plutil -format xml1 -o PLIST_FILE -
>>>>>
>>>>> Christiaan
>>>>>
>>>>>
>>>>> Small correction, I messed up ‘-format’ arguments to the commands. It
>>>>> should be added in skimnotes, and in plutil it is -convert:
>>>>>
>>>>> skimnotes get -format plist PDF_FILE SKIM_FILE
>>>>>
>>>>> plutil -convert xml1 PLIST_FILE
>>>>>
>>>>> skimnotes get -format plist PDF_FILE - | plutil -convert xml1 -o
>>>>> PLIST_FILE -
>>>>>
>>>>> If you want to go to the reverse, and write the xml plist data as skim
>>>>> notes, you could do:
>>>>>
>>>>> plutil -convert binary1 -o - PLIST_FILE | skimnotes set PDF_FILE -
>>>>>
>>>>> Christiaan
>>>>>
>>>>
> _______________________________________________
> Skim-app-users mailing list
> Skim-app-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/skim-app-users
>
_______________________________________________
Skim-app-users mailing list
Skim-app-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/skim-app-users

Reply via email to