Re: [Skim-app-users] Question about accessing Skim Notes

Christiaan Hofman Tue, 15 Mar 2022 16:38:49 -0700

BTW, I should say that some time ago we *did* try to remove hyphens and combine 
broken lines. But it was simply too unreliable based on the available 
information, and went wrong far too often.


Christiaan

> On 16 Mar 2022, at 00:35, Christiaan Hofman <cmhof...@gmail.com> wrote:
> 
> Perhaps you can get some information about the placement of the lines. But we 
> don’t even get information from the PDF about what the orientation is 
> (sometimes PDFs use rotated coordinate systems, e.g. in landscape pages). 
> Also, the selected text may not consist of full lines, so the end of the text 
> may not be the end of a line. Also, a hyphen does not need to be a line 
> break, it can also just be a hyphen in the text. Perhaps it is possible to 
> (almost) figure out how some parts of text are placed in the payed out text 
> on the page, but then you have to first figure out precisely what the lines 
> are and compare all the character ranges in the text. And even getting lines 
> can be a real mess, as there is no guarantee that the text is simply payed 
> out in nice lines. We certainly don’t get this information from the PDF.
> 
> Christiaan
> 
>> On 15 Mar 2022, at 23:11, Mark Roberts <mroberts1...@gmail.com 
>> <mailto:mroberts1...@gmail.com>> wrote:
>> 
>> I understand what PDF is about, so I guess I don't see what the issue is 
>> with getting lines of text.
>> 
>> Looking at the PDF level, there are postscript commands placing characters 
>> or strings on a page. These have bounding boxes. Meanwhile, the Skim notes 
>> have highlights. In this case, text highlights. Nothing fancy. 
>> QuadrilateralPoints.
>> 
>> Is it not possible to detect intersections between text on the page and the 
>> highlights?
>> 
>> This is just simple math — computing the intersection of bounding boxes — 
>> right?
>> 
>> If this can be done, then an app can compute what the lines of text are. And 
>> if we know what the lines of text are, then we can test whether a hyphen 
>> falls at the end of a line. It doesn't matter if we know what the 
>> "underlying text" was or not. If we find a hyphen at the end of a line of 
>> text, then that's a candidate for removal. We only need to have lines of 
>> text. Once the individual lines are assembled into a string, it becomes more 
>> difficult to detect this.
>> 
>> Now, of course this all depends on the internal APIs, in this case PDFKit, I 
>> guess — is that the issue?
>> 
>> Thanks again!
>> 
>> M.
>> 
>> On Tue, Mar 15, 2022 at 11:36 PM Christiaan Hofman <cmhof...@gmail.com 
>> <mailto:cmhof...@gmail.com>> wrote:
>> No, it is a limitation of the PDF format.You just get a string for the 
>> characters. The hyphen is also just one of the characters. There is no 
>> information about the underlying text that was used to generate the PDF. You 
>> should realize that PDF is an output format.
>> 
>> Christiaan
>> 
>>> On 15 Mar 2022, at 14:37, Mark Roberts <mroberts1...@gmail.com 
>>> <mailto:mroberts1...@gmail.com>> wrote:
>>> 
>>> I sort of half understand what you are explaining, but maybe an example 
>>> would help.
>>> 
>>> Let's say I have four lines of text in a PDF, e.g.:
>>> 
>>> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eius-
>>> mod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad 
>>> minim veniam, quis nostrud exercitation ullamco laboris nisi ut ali-
>>> quip ex ea commodo consequat.
>>> 
>>> And let's say I select the text "consectetur adipiscing elit, sed do 
>>> eiusmod tempor incididunt ut labore et dolore magna aliqua" (on two lines), 
>>> and the word "eiusmod" is broken by a hyphen.
>>> 
>>> Inside the PDF, there are in fact sequences of characters which form lines 
>>> that can be selected.
>>> 
>>> Question: when Skim creates a note, I assume(?) it calls PDFKit and gets 
>>> some data back. Is it a single string, including the hyphen?
>>> 
>>> I.e., is the limitation in the API for PDFKit, or ... ?
>>> 
>>> Thanks again,
>>> 
>>> M.
>>> 
>>> 
>>> On Tue, Mar 15, 2022 at 6:58 PM Christiaan Hofman <cmhof...@gmail.com 
>>> <mailto:cmhof...@gmail.com>> wrote:
>>> I don’t know. It is just not possible to work with information that does 
>>> not exist. All of this is just trying to be smart in interpreted whatever 
>>> data exists. In this case, the information does not exist, and never 
>>> existed. Again, the highlighted text is never part of the data of the note, 
>>> it is data in the PDF that you may associate to it because of geometry. And 
>>> when we set the text by default, the PDF does not provide sufficient 
>>> information go tell us about the exact text and the flow of it, because PDF 
>>> is primarily a graphic format. So we have no way of knowing when there is a 
>>> hyphen, and whether it is breaks or not. You could try to parse the text 
>>> and look for hyphens followed by spaces, and remove that from the text. We 
>>> don’t do that automatically, as we cannot know whether that is correct.
>>> 
>>> Christiaan
>>> 
>>>> On 15 Mar 2022, at 08:55, Mark Roberts <mroberts1...@gmail.com 
>>>> <mailto:mroberts1...@gmail.com>> wrote:
>>>> 
>>>> Thanks for clarifying.
>>>> 
>>>> I guess my question remains: how can I fix up these hyphenated lines in my 
>>>> notes? I can parse and process the XML output from skimnotes, but it seems 
>>>> there isn't enough data to identify lines.
>>>> 
>>>> The issue is that full-text search of the notes won't work if words are 
>>>> broken up with hyphens.
>>>> 
>>>> Whatever Skim is doing to handle line breaks isn't working for me — I 
>>>> still see words broken up by hyphens everywhere.
>>>> 
>>>> Any ideas?
>>>> 
>>>> Thanks,
>>>> 
>>>> M.
>>>> 
>>>> On Mon, Mar 14, 2022 at 11:53 PM Christiaan Hofman <cmhof...@gmail.com 
>>>> <mailto:cmhof...@gmail.com>> wrote:
>>>> You should realize that the text of the note is a completely separate data 
>>>> element from the highlighted text. The highlighted text is not part of the 
>>>> note, it is just te text that happens to lie behind the highlight in the 
>>>> PDF. We just set the text of the note to the text you highlight by 
>>>> default, and we already do some cleaning, including trying to handle 
>>>> line-breaks, before we set the text. And you can set it to whatever you 
>>>> want. So there is no way to relate the geometry of the highlight in any 
>>>> way to the text, as there does not exist a relation.
>>>> 
>>>> Christiaan
>>>> 
>>>>> On 14 Mar 2022, at 12:27, Mark Roberts <mroberts1...@gmail.com 
>>>>> <mailto:mroberts1...@gmail.com>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> This is very helpful — thanks !!
>>>>> 
>>>>> I just tried your suggestion and got an XML file as expected. I more or 
>>>>> less understand all the elements of the XML, but it seems the entire note 
>>>>> is in a <string> element, while the quadrilateralPoints for the 
>>>>> highlighting boxes are separate.
>>>>> 
>>>>> What I was hoping to do is somehow get each line of my note and then look 
>>>>> for a hyphen at the end of each line, and then trim that hyphen, as 
>>>>> necessary. The objective is to try and clean up the skim note to 
>>>>> eliminate line-break hyphens in the source text.
>>>>> 
>>>>> Any ideas about how I could do this?
>>>>> 
>>>>> Thanks again,
>>>>> 
>>>>> M.
>>>>> 
>>>>> On Mon, Mar 14, 2022 at 7:27 PM Christiaan Hofman <cmhof...@gmail.com 
>>>>> <mailto:cmhof...@gmail.com>> wrote:
>>>>> 
>>>>> 
>>>>>> On 14 Mar 2022, at 11:13, Christiaan Hofman <cmhof...@gmail.com 
>>>>>> <mailto:cmhof...@gmail.com>> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 14 Mar 2022, at 10:56, Christiaan Hofman <cmhof...@gmail.com 
>>>>>>> <mailto:cmhof...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 14 Mar 2022, at 10:50, Christiaan Hofman <cmhof...@gmail.com 
>>>>>>>> <mailto:cmhof...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 14 Mar 2022, at 04:49, Mark Roberts <mroberts1...@gmail.com 
>>>>>>>>> <mailto:mroberts1...@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Is there some way to get more detailed information about skim notes, 
>>>>>>>>> i.e., other than the code framework?
>>>>>>>>> 
>>>>>>>>> I have tried the skimnotes command line tool (e.g., the 'get' and 
>>>>>>>>> 'format' commands), but it seems to only output the basic information 
>>>>>>>>> about notes, such as the note type, page number, and note text.
>>>>>>>>> 
>>>>>>>>> Perhaps(?) there's another mode for the skimnotes tool, but I 
>>>>>>>>> couldn't find it from reading the documentation.
>>>>>>>>> 
>>>>>>>>> I'd like to get more complete data on each note, such as a timestamp, 
>>>>>>>>> the coordinates of the boxes that are highlighted in the PDF file, 
>>>>>>>>> the highlight color, and the text contained in each box.
>>>>>>>>> 
>>>>>>>>> I assume(?) this data is in the notes file, but the skimnotes app 
>>>>>>>>> ignores it for now.
>>>>>>>>> 
>>>>>>>>> I'm wondering about this because if possible I'd like to make a 
>>>>>>>>> script that gathers my notes for a PDF file, and tries to fix words 
>>>>>>>>> that were broken by hyphenation in the original PDF. If I can get the 
>>>>>>>>> highlight boxes in the notes file, and the text in each box, then it 
>>>>>>>>> should be possible to check for a hyphen character at the end of each 
>>>>>>>>> line, and then stitch together the words that were split across lines.
>>>>>>>>> 
>>>>>>>>> Any suggestions?
>>>>>>>>> 
>>>>>>>>> Thanks in advance,
>>>>>>>>> 
>>>>>>>>> M.
>>>>>>>> 
>>>>>>>> The skimnotes tool is not a tool that can interpret the data. It only 
>>>>>>>> copies the data around to various locations that are supported (such 
>>>>>>>> as between extended attributes, .skim files, or within a .pdfd 
>>>>>>>> bundle). There is no tool to interpret he data. The Wiki has 
>>>>>>>> information about how the data is formatted. You could try to build 
>>>>>>>> your own tool to unarchive the data from that, but that would be quite 
>>>>>>>> a bit of work.
>>>>>>>> 
>>>>>>>> Christiaan
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I can also note that in the near future the skim notes will be saved in 
>>>>>>> a plist format, which can be read by various tools and apps, including 
>>>>>>> AppleScript. You can already have Skim do that by activating a hidden 
>>>>>>> preference, see the Wiki for details. 
>>>>>>> 
>>>>>>> Christiaan
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I just remembered that the skimnotes tool *can* convert to the plist 
>>>>>> format, which you may be able to read, using the ’skimnotes format’ 
>>>>>> command.' skimnotes format plist SKIM_FILE' can do that. The help for 
>>>>>> skimnotes does not say so, but you can immediately also get the skim 
>>>>>> notes plist format from the skimnotes tool as follows:
>>>>>> 
>>>>>> skimnotes get plist PDF_FILE SKIM_FILE
>>>>>> 
>>>>>> This will get you a plist file in SKIM_FILE. Perhaps for other tools to 
>>>>>> read it you have to change the extension to .plist. You could also then 
>>>>>> pass it through plutil to convert the binary plist to xml plist (plutil 
>>>>>> -convert xml1 PLIST_FILE), which would even be human readable. You could 
>>>>>> combine that to get the skimnotes in xml format as follows:
>>>>>> 
>>>>>> skimnotes get plist PDF_FILE - | plutil -format xml1 -o PLIST_FILE -
>>>>>> 
>>>>>> Christiaan
>>>>>> 
>>>>> 
>>>>> 
>>>>> Small correction, I messed up ‘-format’ arguments to the commands. It 
>>>>> should be added in skimnotes, and in plutil it is -convert:
>>>>> 
>>>>> skimnotes get -format plist PDF_FILE SKIM_FILE
>>>>> 
>>>>> plutil -convert xml1 PLIST_FILE
>>>>> 
>>>>> skimnotes get -format plist PDF_FILE - | plutil -convert xml1 -o 
>>>>> PLIST_FILE -
>>>>> 
>>>>> If you want to go to the reverse, and write the xml plist data as skim 
>>>>> notes, you could do:
>>>>> 
>>>>> plutil -convert binary1 -o - PLIST_FILE | skimnotes set PDF_FILE -
>>>>> 
>>>>> Christiaan

_______________________________________________
Skim-app-users mailing list
Skim-app-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/skim-app-users

Re: [Skim-app-users] Question about accessing Skim Notes

Reply via email to