Re: [Skim-app-users] Question about accessing Skim Notes

Christiaan Hofman Tue, 15 Mar 2022 16:36:34 -0700

Perhaps you can get some information about the placement of the lines. But we 
don’t even get information from the PDF about what the orientation is 
(sometimes PDFs use rotated coordinate systems, e.g. in landscape pages). Also, 
the selected text may not consist of full lines, so the end of the text may not 
be the end of a line. Also, a hyphen does not need to be a line break, it can 
also just be a hyphen in the text. Perhaps it is possible to (almost) figure 
out how some parts of text are placed in the payed out text on the page, but 
then you have to first figure out precisely what the lines are and compare all 
the character ranges in the text. And even getting lines can be a real mess, as 
there is no guarantee that the text is simply payed out in nice lines. We 
certainly don’t get this information from the PDF.


Christiaan

> On 15 Mar 2022, at 23:11, Mark Roberts <mroberts1...@gmail.com> wrote:
> 
> I understand what PDF is about, so I guess I don't see what the issue is with 
> getting lines of text.
> 
> Looking at the PDF level, there are postscript commands placing characters or 
> strings on a page. These have bounding boxes. Meanwhile, the Skim notes have 
> highlights. In this case, text highlights. Nothing fancy. QuadrilateralPoints.
> 
> Is it not possible to detect intersections between text on the page and the 
> highlights?
> 
> This is just simple math — computing the intersection of bounding boxes — 
> right?
> 
> If this can be done, then an app can compute what the lines of text are. And 
> if we know what the lines of text are, then we can test whether a hyphen 
> falls at the end of a line. It doesn't matter if we know what the "underlying 
> text" was or not. If we find a hyphen at the end of a line of text, then 
> that's a candidate for removal. We only need to have lines of text. Once the 
> individual lines are assembled into a string, it becomes more difficult to 
> detect this.
> 
> Now, of course this all depends on the internal APIs, in this case PDFKit, I 
> guess — is that the issue?
> 
> Thanks again!
> 
> M.
> 
> On Tue, Mar 15, 2022 at 11:36 PM Christiaan Hofman <cmhof...@gmail.com 
> <mailto:cmhof...@gmail.com>> wrote:
> No, it is a limitation of the PDF format.You just get a string for the 
> characters. The hyphen is also just one of the characters. There is no 
> information about the underlying text that was used to generate the PDF. You 
> should realize that PDF is an output format.
> 
> Christiaan
> 
>> On 15 Mar 2022, at 14:37, Mark Roberts <mroberts1...@gmail.com 
>> <mailto:mroberts1...@gmail.com>> wrote:
>> 
>> I sort of half understand what you are explaining, but maybe an example 
>> would help.
>> 
>> Let's say I have four lines of text in a PDF, e.g.:
>> 
>> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eius-
>> mod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad 
>> minim veniam, quis nostrud exercitation ullamco laboris nisi ut ali-
>> quip ex ea commodo consequat.
>> 
>> And let's say I select the text "consectetur adipiscing elit, sed do eiusmod 
>> tempor incididunt ut labore et dolore magna aliqua" (on two lines), and the 
>> word "eiusmod" is broken by a hyphen.
>> 
>> Inside the PDF, there are in fact sequences of characters which form lines 
>> that can be selected.
>> 
>> Question: when Skim creates a note, I assume(?) it calls PDFKit and gets 
>> some data back. Is it a single string, including the hyphen?
>> 
>> I.e., is the limitation in the API for PDFKit, or ... ?
>> 
>> Thanks again,
>> 
>> M.
>> 
>> 
>> On Tue, Mar 15, 2022 at 6:58 PM Christiaan Hofman <cmhof...@gmail.com 
>> <mailto:cmhof...@gmail.com>> wrote:
>> I don’t know. It is just not possible to work with information that does not 
>> exist. All of this is just trying to be smart in interpreted whatever data 
>> exists. In this case, the information does not exist, and never existed. 
>> Again, the highlighted text is never part of the data of the note, it is 
>> data in the PDF that you may associate to it because of geometry. And when 
>> we set the text by default, the PDF does not provide sufficient information 
>> go tell us about the exact text and the flow of it, because PDF is primarily 
>> a graphic format. So we have no way of knowing when there is a hyphen, and 
>> whether it is breaks or not. You could try to parse the text and look for 
>> hyphens followed by spaces, and remove that from the text. We don’t do that 
>> automatically, as we cannot know whether that is correct.
>> 
>> Christiaan
>> 
>>> On 15 Mar 2022, at 08:55, Mark Roberts <mroberts1...@gmail.com 
>>> <mailto:mroberts1...@gmail.com>> wrote:
>>> 
>>> Thanks for clarifying.
>>> 
>>> I guess my question remains: how can I fix up these hyphenated lines in my 
>>> notes? I can parse and process the XML output from skimnotes, but it seems 
>>> there isn't enough data to identify lines.
>>> 
>>> The issue is that full-text search of the notes won't work if words are 
>>> broken up with hyphens.
>>> 
>>> Whatever Skim is doing to handle line breaks isn't working for me — I still 
>>> see words broken up by hyphens everywhere.
>>> 
>>> Any ideas?
>>> 
>>> Thanks,
>>> 
>>> M.
>>> 
>>> On Mon, Mar 14, 2022 at 11:53 PM Christiaan Hofman <cmhof...@gmail.com 
>>> <mailto:cmhof...@gmail.com>> wrote:
>>> You should realize that the text of the note is a completely separate data 
>>> element from the highlighted text. The highlighted text is not part of the 
>>> note, it is just te text that happens to lie behind the highlight in the 
>>> PDF. We just set the text of the note to the text you highlight by default, 
>>> and we already do some cleaning, including trying to handle line-breaks, 
>>> before we set the text. And you can set it to whatever you want. So there 
>>> is no way to relate the geometry of the highlight in any way to the text, 
>>> as there does not exist a relation.
>>> 
>>> Christiaan
>>> 
>>>> On 14 Mar 2022, at 12:27, Mark Roberts <mroberts1...@gmail.com 
>>>> <mailto:mroberts1...@gmail.com>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> This is very helpful — thanks !!
>>>> 
>>>> I just tried your suggestion and got an XML file as expected. I more or 
>>>> less understand all the elements of the XML, but it seems the entire note 
>>>> is in a <string> element, while the quadrilateralPoints for the 
>>>> highlighting boxes are separate.
>>>> 
>>>> What I was hoping to do is somehow get each line of my note and then look 
>>>> for a hyphen at the end of each line, and then trim that hyphen, as 
>>>> necessary. The objective is to try and clean up the skim note to eliminate 
>>>> line-break hyphens in the source text.
>>>> 
>>>> Any ideas about how I could do this?
>>>> 
>>>> Thanks again,
>>>> 
>>>> M.
>>>> 
>>>> On Mon, Mar 14, 2022 at 7:27 PM Christiaan Hofman <cmhof...@gmail.com 
>>>> <mailto:cmhof...@gmail.com>> wrote:
>>>> 
>>>> 
>>>>> On 14 Mar 2022, at 11:13, Christiaan Hofman <cmhof...@gmail.com 
>>>>> <mailto:cmhof...@gmail.com>> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 14 Mar 2022, at 10:56, Christiaan Hofman <cmhof...@gmail.com 
>>>>>> <mailto:cmhof...@gmail.com>> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 14 Mar 2022, at 10:50, Christiaan Hofman <cmhof...@gmail.com 
>>>>>>> <mailto:cmhof...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 14 Mar 2022, at 04:49, Mark Roberts <mroberts1...@gmail.com 
>>>>>>>> <mailto:mroberts1...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> Is there some way to get more detailed information about skim notes, 
>>>>>>>> i.e., other than the code framework?
>>>>>>>> 
>>>>>>>> I have tried the skimnotes command line tool (e.g., the 'get' and 
>>>>>>>> 'format' commands), but it seems to only output the basic information 
>>>>>>>> about notes, such as the note type, page number, and note text.
>>>>>>>> 
>>>>>>>> Perhaps(?) there's another mode for the skimnotes tool, but I couldn't 
>>>>>>>> find it from reading the documentation.
>>>>>>>> 
>>>>>>>> I'd like to get more complete data on each note, such as a timestamp, 
>>>>>>>> the coordinates of the boxes that are highlighted in the PDF file, the 
>>>>>>>> highlight color, and the text contained in each box.
>>>>>>>> 
>>>>>>>> I assume(?) this data is in the notes file, but the skimnotes app 
>>>>>>>> ignores it for now.
>>>>>>>> 
>>>>>>>> I'm wondering about this because if possible I'd like to make a script 
>>>>>>>> that gathers my notes for a PDF file, and tries to fix words that were 
>>>>>>>> broken by hyphenation in the original PDF. If I can get the highlight 
>>>>>>>> boxes in the notes file, and the text in each box, then it should be 
>>>>>>>> possible to check for a hyphen character at the end of each line, and 
>>>>>>>> then stitch together the words that were split across lines.
>>>>>>>> 
>>>>>>>> Any suggestions?
>>>>>>>> 
>>>>>>>> Thanks in advance,
>>>>>>>> 
>>>>>>>> M.
>>>>>>> 
>>>>>>> The skimnotes tool is not a tool that can interpret the data. It only 
>>>>>>> copies the data around to various locations that are supported (such as 
>>>>>>> between extended attributes, .skim files, or within a .pdfd bundle). 
>>>>>>> There is no tool to interpret he data. The Wiki has information about 
>>>>>>> how the data is formatted. You could try to build your own tool to 
>>>>>>> unarchive the data from that, but that would be quite a bit of work.
>>>>>>> 
>>>>>>> Christiaan
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I can also note that in the near future the skim notes will be saved in 
>>>>>> a plist format, which can be read by various tools and apps, including 
>>>>>> AppleScript. You can already have Skim do that by activating a hidden 
>>>>>> preference, see the Wiki for details. 
>>>>>> 
>>>>>> Christiaan
>>>>>> 
>>>>> 
>>>>> 
>>>>> I just remembered that the skimnotes tool *can* convert to the plist 
>>>>> format, which you may be able to read, using the ’skimnotes format’ 
>>>>> command.' skimnotes format plist SKIM_FILE' can do that. The help for 
>>>>> skimnotes does not say so, but you can immediately also get the skim 
>>>>> notes plist format from the skimnotes tool as follows:
>>>>> 
>>>>> skimnotes get plist PDF_FILE SKIM_FILE
>>>>> 
>>>>> This will get you a plist file in SKIM_FILE. Perhaps for other tools to 
>>>>> read it you have to change the extension to .plist. You could also then 
>>>>> pass it through plutil to convert the binary plist to xml plist (plutil 
>>>>> -convert xml1 PLIST_FILE), which would even be human readable. You could 
>>>>> combine that to get the skimnotes in xml format as follows:
>>>>> 
>>>>> skimnotes get plist PDF_FILE - | plutil -format xml1 -o PLIST_FILE -
>>>>> 
>>>>> Christiaan
>>>>> 
>>>> 
>>>> 
>>>> Small correction, I messed up ‘-format’ arguments to the commands. It 
>>>> should be added in skimnotes, and in plutil it is -convert:
>>>> 
>>>> skimnotes get -format plist PDF_FILE SKIM_FILE
>>>> 
>>>> plutil -convert xml1 PLIST_FILE
>>>> 
>>>> skimnotes get -format plist PDF_FILE - | plutil -convert xml1 -o 
>>>> PLIST_FILE -
>>>> 
>>>> If you want to go to the reverse, and write the xml plist data as skim 
>>>> notes, you could do:
>>>> 
>>>> plutil -convert binary1 -o - PLIST_FILE | skimnotes set PDF_FILE -
>>>> 
>>>> Christiaan

_______________________________________________
Skim-app-users mailing list
Skim-app-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/skim-app-users

Re: [Skim-app-users] Question about accessing Skim Notes

Reply via email to