Thanks for clarifying. I think I understand now.

I looked at the API for PDFKit and found several methods in class PDFPage
that might make it possible to fetch lines of text, e.g., selection(). I
guess(?) I could try to take the selection rectangles from the skim notes,
and then try using PDFKit to get lines of text from the PDF file, though it
seems a bit circuitous.

In any case, if I want to get PDF page numbers for the notes, it looks like
I may need to invoke PDFKit anyway. It appears that the 'pageIndex' key in
a skim note is the sequential page number in the PDF file, and I would need
to call PDFKit functions to convert that to a page label (e.g., pageIndex
of 9 -> page ix).

Am I understanding this correctly?

Thanks,

M.

On Wed, Mar 16, 2022 at 6:37 PM Christiaan Hofman <cmhof...@gmail.com>
wrote:

> No, there is no way to exert that data, as that data does not exist. The
> data of the notes is data from the notes, and what the notes know about
> themselves is prescribed by Adobe in the PDF specifications. And that
> contains just a text string.
>
> Christiaan
>
> On 16 Mar 2022, at 02:23, Mark Roberts <mroberts1...@gmail.com> wrote:
>
> Thanks for going into more detail.
>
> Sorry if I wasn't clear, but I'm not trying to ask for changes to the
> behavior of the Skim app itself.
>
> I'm mainly wondering if there is a way to somehow access or export this
> information about notes, based on the data that is saved plus what's in the
> PDF.
>
> If this information could be exported somehow (e.g., via skimnotes or some
> other tool), then I could write my own app to read it and help fix up the
> notes. I can easily code something to parse XML and manipulate text, but
> digging into the PDF data structures is more difficult. I have investigated
> doing this and indeed it's non-trivial.
>
> Skim is one of the very few PDF readers aside from Adobe Acrobat that
> actually handles page numbers correctly, AND has lots more flexibility for
> the export of annotations, so this is why I'm inquiring here.
>
> As you explain, internally PDF is rather messy and there can be
> various degenerate cases, but I assume that for scholarly research 99.9% of
> the time the PDF media will be journal articles or books in portrait
> orientation. Indeed, there will be degenerate cases, and if the selected
> text does not include full lines, then I assume hyphens could fall outside
> of the selected text and the detection would of course fail. Still, what I
> see on the screen is that lines of text are being recognized and
> highlighted correctly 99% of the time, so it seems clear that somewhere
> (either in Skim or in PDFKit, I cannot say), all of this machinery for
> capturing notes text is working pretty well.
>
> I don't know the details of the lower level APIs, but if they could
> provide lines of text instead of just a single string of all the selected
> lines concatenated, then it might be possible.
>
> Thanks again for explaining this !
>
> M.
>
> On Wed, Mar 16, 2022 at 8:38 AM Christiaan Hofman <cmhof...@gmail.com>
> wrote:
>
>> BTW, I should say that some time ago we *did* try to remove hyphens and
>> combine broken lines. But it was simply too unreliable based on the
>> available information, and went wrong far too often.
>>
>> Christiaan
>>
>> On 16 Mar 2022, at 00:35, Christiaan Hofman <cmhof...@gmail.com> wrote:
>>
>> Perhaps you can get some information about the placement of the lines.
>> But we don’t even get information from the PDF about what the orientation
>> is (sometimes PDFs use rotated coordinate systems, e.g. in landscape
>> pages). Also, the selected text may not consist of full lines, so the end
>> of the text may not be the end of a line. Also, a hyphen does not need to
>> be a line break, it can also just be a hyphen in the text. Perhaps it is
>> possible to (almost) figure out how some parts of text are placed in the
>> payed out text on the page, but then you have to first figure out precisely
>> what the lines are and compare all the character ranges in the text. And
>> even getting lines can be a real mess, as there is no guarantee that the
>> text is simply payed out in nice lines. We certainly don’t get this
>> information from the PDF.
>>
>> Christiaan
>>
>> On 15 Mar 2022, at 23:11, Mark Roberts <mroberts1...@gmail.com> wrote:
>>
>> I understand what PDF is about, so I guess I don't see what the issue is
>> with getting lines of text.
>>
>> Looking at the PDF level, there are postscript commands placing
>> characters or strings on a page. These have bounding boxes. Meanwhile, the
>> Skim notes have highlights. In this case, text highlights. Nothing fancy.
>> QuadrilateralPoints.
>>
>> Is it not possible to detect intersections between text on the page and
>> the highlights?
>>
>> This is just simple math — computing the intersection of bounding boxes —
>> right?
>>
>> If this can be done, then an app can compute what the lines of text are.
>> And if we know what the lines of text are, then we can test whether a
>> hyphen falls at the end of a line. It doesn't matter if we know what the
>> "underlying text" was or not. If we find a hyphen at the end of a line of
>> text, then that's a candidate for removal. We only need to have lines of
>> text. Once the individual lines are assembled into a string, it becomes
>> more difficult to detect this.
>>
>> Now, of course this all depends on the internal APIs, in this case
>> PDFKit, I guess — is that the issue?
>>
>> Thanks again!
>>
>> M.
>>
>> On Tue, Mar 15, 2022 at 11:36 PM Christiaan Hofman <cmhof...@gmail.com>
>> wrote:
>>
>>> No, it is a limitation of the PDF format.You just get a string for the
>>> characters. The hyphen is also just one of the characters. There is no
>>> information about the underlying text that was used to generate the PDF.
>>> You should realize that PDF is an output format.
>>>
>>> Christiaan
>>>
>>> On 15 Mar 2022, at 14:37, Mark Roberts <mroberts1...@gmail.com> wrote:
>>>
>>> I sort of half understand what you are explaining, but maybe an example
>>> would help.
>>>
>>> Let's say I have four lines of text in a PDF, e.g.:
>>>
>>> Lorem ipsum dolor sit amet, *consectetur adipiscing elit, sed do eius-*
>>> *mod tempor incididunt ut labore et dolore magna aliqua.* Ut enim ad
>>> minim veniam, quis nostrud exercitation ullamco laboris nisi ut ali-
>>> quip ex ea commodo consequat.
>>>
>>> And let's say I select the text "consectetur adipiscing elit, sed do
>>> eiusmod tempor incididunt ut labore et dolore magna aliqua" (on two lines),
>>> and the word "eiusmod" is broken by a hyphen.
>>>
>>> Inside the PDF, there are in fact sequences of characters which form
>>> lines that can be selected.
>>>
>>> Question: when Skim creates a note, I assume(?) it calls PDFKit and gets
>>> some data back. Is it a single string, including the hyphen?
>>>
>>> I.e., is the limitation in the API for PDFKit, or ... ?
>>>
>>> Thanks again,
>>>
>>> M.
>>>
>>>
>>> On Tue, Mar 15, 2022 at 6:58 PM Christiaan Hofman <cmhof...@gmail.com>
>>> wrote:
>>>
>>>> I don’t know. It is just not possible to work with information that
>>>> does not exist. All of this is just trying to be smart in interpreted
>>>> whatever data exists. In this case, the information does not exist, and
>>>> never existed. Again, the highlighted text is never part of the data of the
>>>> note, it is data in the PDF that you may associate to it because of
>>>> geometry. And when we set the text by default, the PDF does not provide
>>>> sufficient information go tell us about the exact text and the flow of it,
>>>> because PDF is primarily a graphic format. So we have no way of knowing
>>>> when there is a hyphen, and whether it is breaks or not. You could try to
>>>> parse the text and look for hyphens followed by spaces, and remove that
>>>> from the text. We don’t do that automatically, as we cannot know whether
>>>> that is correct.
>>>>
>>>> Christiaan
>>>>
>>>> On 15 Mar 2022, at 08:55, Mark Roberts <mroberts1...@gmail.com> wrote:
>>>>
>>>> Thanks for clarifying.
>>>>
>>>> I guess my question remains: how can I fix up these hyphenated lines in
>>>> my notes? I can parse and process the XML output from skimnotes, but it
>>>> seems there isn't enough data to identify lines.
>>>>
>>>> The issue is that full-text search of the notes won't work if words are
>>>> broken up with hyphens.
>>>>
>>>> Whatever Skim is doing to handle line breaks isn't working for me — I
>>>> still see words broken up by hyphens everywhere.
>>>>
>>>> Any ideas?
>>>>
>>>> Thanks,
>>>>
>>>> M.
>>>>
>>>> On Mon, Mar 14, 2022 at 11:53 PM Christiaan Hofman <cmhof...@gmail.com>
>>>> wrote:
>>>>
>>>>> You should realize that the text of the note is a completely separate
>>>>> data element from the highlighted text. The highlighted text is not part 
>>>>> of
>>>>> the note, it is just te text that happens to lie behind the highlight in
>>>>> the PDF. We just set the text of the note to the text you highlight by
>>>>> default, and we already do some cleaning, including trying to handle
>>>>> line-breaks, before we set the text. And you can set it to whatever you
>>>>> want. So there is no way to relate the geometry of the highlight in any 
>>>>> way
>>>>> to the text, as there does not exist a relation.
>>>>>
>>>>> Christiaan
>>>>>
>>>>> On 14 Mar 2022, at 12:27, Mark Roberts <mroberts1...@gmail.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> This is very helpful — thanks !!
>>>>>
>>>>> I just tried your suggestion and got an XML file as expected. I more
>>>>> or less understand all the elements of the XML, but it seems the entire
>>>>> note is in a <string> element, while the quadrilateralPoints for the
>>>>> highlighting boxes are separate.
>>>>>
>>>>> What I was hoping to do is somehow get each line of my note and then
>>>>> look for a hyphen at the end of each line, and then trim that hyphen, as
>>>>> necessary. The objective is to try and clean up the skim note to eliminate
>>>>> line-break hyphens in the source text.
>>>>>
>>>>> Any ideas about how I could do this?
>>>>>
>>>>> Thanks again,
>>>>>
>>>>> M.
>>>>>
>>>>> On Mon, Mar 14, 2022 at 7:27 PM Christiaan Hofman <cmhof...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On 14 Mar 2022, at 11:13, Christiaan Hofman <cmhof...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 14 Mar 2022, at 10:56, Christiaan Hofman <cmhof...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 14 Mar 2022, at 10:50, Christiaan Hofman <cmhof...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 14 Mar 2022, at 04:49, Mark Roberts <mroberts1...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Is there some way to get more detailed information about skim notes,
>>>>>> i.e., other than the code framework?
>>>>>>
>>>>>> I have tried the skimnotes command line tool (e.g., the 'get' and
>>>>>> 'format' commands), but it seems to only output the basic information 
>>>>>> about
>>>>>> notes, such as the note type, page number, and note text.
>>>>>>
>>>>>> Perhaps(?) there's another mode for the skimnotes tool, but I
>>>>>> couldn't find it from reading the documentation.
>>>>>>
>>>>>> I'd like to get more complete data on each note, such as a timestamp,
>>>>>> the coordinates of the boxes that are highlighted in the PDF file, the
>>>>>> highlight color, and the text contained in each box.
>>>>>>
>>>>>> I assume(?) this data is in the notes file, but the skimnotes app
>>>>>> ignores it for now.
>>>>>>
>>>>>> I'm wondering about this because if possible I'd like to make a
>>>>>> script that gathers my notes for a PDF file, and tries to fix words that
>>>>>> were broken by hyphenation in the original PDF. If I can get the 
>>>>>> highlight
>>>>>> boxes in the notes file, and the text in each box, then it should be
>>>>>> possible to check for a hyphen character at the end of each line, and 
>>>>>> then
>>>>>> stitch together the words that were split across lines.
>>>>>>
>>>>>> Any suggestions?
>>>>>>
>>>>>> Thanks in advance,
>>>>>>
>>>>>> M.
>>>>>>
>>>>>>
>>>>>> The skimnotes tool is not a tool that can interpret the data. It only
>>>>>> copies the data around to various locations that are supported (such as
>>>>>> between extended attributes, .skim files, or within a .pdfd bundle). 
>>>>>> There
>>>>>> is no tool to interpret he data. The Wiki has information about how the
>>>>>> data is formatted. You could try to build your own tool to unarchive the
>>>>>> data from that, but that would be quite a bit of work.
>>>>>>
>>>>>> Christiaan
>>>>>>
>>>>>>
>>>>>> I can also note that in the near future the skim notes will be saved
>>>>>> in a plist format, which can be read by various tools and apps, including
>>>>>> AppleScript. You can already have Skim do that by activating a hidden
>>>>>> preference, see the Wiki for details.
>>>>>>
>>>>>> Christiaan
>>>>>>
>>>>>>
>>>>>> I just remembered that the skimnotes tool *can* convert to the plist
>>>>>> format, which you may be able to read, using the ’skimnotes format’
>>>>>> command.' skimnotes format plist SKIM_FILE' can do that. The help for
>>>>>> skimnotes does not say so, but you can immediately also get the skim 
>>>>>> notes
>>>>>> plist format from the skimnotes tool as follows:
>>>>>>
>>>>>> skimnotes get plist PDF_FILE SKIM_FILE
>>>>>>
>>>>>> This will get you a plist file in SKIM_FILE. Perhaps for other tools
>>>>>> to read it you have to change the extension to .plist. You could also 
>>>>>> then
>>>>>> pass it through plutil to convert the binary plist to xml plist (plutil
>>>>>> -convert xml1 PLIST_FILE), which would even be human readable. You could
>>>>>> combine that to get the skimnotes in xml format as follows:
>>>>>>
>>>>>> skimnotes get plist PDF_FILE - | plutil -format xml1 -o PLIST_FILE -
>>>>>>
>>>>>> Christiaan
>>>>>>
>>>>>>
>>>>>> Small correction, I messed up ‘-format’ arguments to the commands. It
>>>>>> should be added in skimnotes, and in plutil it is -convert:
>>>>>>
>>>>>> skimnotes get -format plist PDF_FILE SKIM_FILE
>>>>>>
>>>>>> plutil -convert xml1 PLIST_FILE
>>>>>>
>>>>>> skimnotes get -format plist PDF_FILE - | plutil -convert xml1 -o
>>>>>> PLIST_FILE -
>>>>>>
>>>>>> If you want to go to the reverse, and write the xml plist data as
>>>>>> skim notes, you could do:
>>>>>>
>>>>>> plutil -convert binary1 -o - PLIST_FILE | skimnotes set PDF_FILE -
>>>>>>
>>>>>> Christiaan
>>>>>>
>>>>>
> _______________________________________________
> Skim-app-users mailing list
> Skim-app-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/skim-app-users
>
_______________________________________________
Skim-app-users mailing list
Skim-app-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/skim-app-users

Reply via email to