Thanks for clarifying. I think I understand now. I looked at the API for PDFKit and found several methods in class PDFPage that might make it possible to fetch lines of text, e.g., selection(). I guess(?) I could try to take the selection rectangles from the skim notes, and then try using PDFKit to get lines of text from the PDF file, though it seems a bit circuitous.
In any case, if I want to get PDF page numbers for the notes, it looks like I may need to invoke PDFKit anyway. It appears that the 'pageIndex' key in a skim note is the sequential page number in the PDF file, and I would need to call PDFKit functions to convert that to a page label (e.g., pageIndex of 9 -> page ix). Am I understanding this correctly? Thanks, M. On Wed, Mar 16, 2022 at 6:37 PM Christiaan Hofman <cmhof...@gmail.com> wrote: > No, there is no way to exert that data, as that data does not exist. The > data of the notes is data from the notes, and what the notes know about > themselves is prescribed by Adobe in the PDF specifications. And that > contains just a text string. > > Christiaan > > On 16 Mar 2022, at 02:23, Mark Roberts <mroberts1...@gmail.com> wrote: > > Thanks for going into more detail. > > Sorry if I wasn't clear, but I'm not trying to ask for changes to the > behavior of the Skim app itself. > > I'm mainly wondering if there is a way to somehow access or export this > information about notes, based on the data that is saved plus what's in the > PDF. > > If this information could be exported somehow (e.g., via skimnotes or some > other tool), then I could write my own app to read it and help fix up the > notes. I can easily code something to parse XML and manipulate text, but > digging into the PDF data structures is more difficult. I have investigated > doing this and indeed it's non-trivial. > > Skim is one of the very few PDF readers aside from Adobe Acrobat that > actually handles page numbers correctly, AND has lots more flexibility for > the export of annotations, so this is why I'm inquiring here. > > As you explain, internally PDF is rather messy and there can be > various degenerate cases, but I assume that for scholarly research 99.9% of > the time the PDF media will be journal articles or books in portrait > orientation. Indeed, there will be degenerate cases, and if the selected > text does not include full lines, then I assume hyphens could fall outside > of the selected text and the detection would of course fail. Still, what I > see on the screen is that lines of text are being recognized and > highlighted correctly 99% of the time, so it seems clear that somewhere > (either in Skim or in PDFKit, I cannot say), all of this machinery for > capturing notes text is working pretty well. > > I don't know the details of the lower level APIs, but if they could > provide lines of text instead of just a single string of all the selected > lines concatenated, then it might be possible. > > Thanks again for explaining this ! > > M. > > On Wed, Mar 16, 2022 at 8:38 AM Christiaan Hofman <cmhof...@gmail.com> > wrote: > >> BTW, I should say that some time ago we *did* try to remove hyphens and >> combine broken lines. But it was simply too unreliable based on the >> available information, and went wrong far too often. >> >> Christiaan >> >> On 16 Mar 2022, at 00:35, Christiaan Hofman <cmhof...@gmail.com> wrote: >> >> Perhaps you can get some information about the placement of the lines. >> But we don’t even get information from the PDF about what the orientation >> is (sometimes PDFs use rotated coordinate systems, e.g. in landscape >> pages). Also, the selected text may not consist of full lines, so the end >> of the text may not be the end of a line. Also, a hyphen does not need to >> be a line break, it can also just be a hyphen in the text. Perhaps it is >> possible to (almost) figure out how some parts of text are placed in the >> payed out text on the page, but then you have to first figure out precisely >> what the lines are and compare all the character ranges in the text. And >> even getting lines can be a real mess, as there is no guarantee that the >> text is simply payed out in nice lines. We certainly don’t get this >> information from the PDF. >> >> Christiaan >> >> On 15 Mar 2022, at 23:11, Mark Roberts <mroberts1...@gmail.com> wrote: >> >> I understand what PDF is about, so I guess I don't see what the issue is >> with getting lines of text. >> >> Looking at the PDF level, there are postscript commands placing >> characters or strings on a page. These have bounding boxes. Meanwhile, the >> Skim notes have highlights. In this case, text highlights. Nothing fancy. >> QuadrilateralPoints. >> >> Is it not possible to detect intersections between text on the page and >> the highlights? >> >> This is just simple math — computing the intersection of bounding boxes — >> right? >> >> If this can be done, then an app can compute what the lines of text are. >> And if we know what the lines of text are, then we can test whether a >> hyphen falls at the end of a line. It doesn't matter if we know what the >> "underlying text" was or not. If we find a hyphen at the end of a line of >> text, then that's a candidate for removal. We only need to have lines of >> text. Once the individual lines are assembled into a string, it becomes >> more difficult to detect this. >> >> Now, of course this all depends on the internal APIs, in this case >> PDFKit, I guess — is that the issue? >> >> Thanks again! >> >> M. >> >> On Tue, Mar 15, 2022 at 11:36 PM Christiaan Hofman <cmhof...@gmail.com> >> wrote: >> >>> No, it is a limitation of the PDF format.You just get a string for the >>> characters. The hyphen is also just one of the characters. There is no >>> information about the underlying text that was used to generate the PDF. >>> You should realize that PDF is an output format. >>> >>> Christiaan >>> >>> On 15 Mar 2022, at 14:37, Mark Roberts <mroberts1...@gmail.com> wrote: >>> >>> I sort of half understand what you are explaining, but maybe an example >>> would help. >>> >>> Let's say I have four lines of text in a PDF, e.g.: >>> >>> Lorem ipsum dolor sit amet, *consectetur adipiscing elit, sed do eius-* >>> *mod tempor incididunt ut labore et dolore magna aliqua.* Ut enim ad >>> minim veniam, quis nostrud exercitation ullamco laboris nisi ut ali- >>> quip ex ea commodo consequat. >>> >>> And let's say I select the text "consectetur adipiscing elit, sed do >>> eiusmod tempor incididunt ut labore et dolore magna aliqua" (on two lines), >>> and the word "eiusmod" is broken by a hyphen. >>> >>> Inside the PDF, there are in fact sequences of characters which form >>> lines that can be selected. >>> >>> Question: when Skim creates a note, I assume(?) it calls PDFKit and gets >>> some data back. Is it a single string, including the hyphen? >>> >>> I.e., is the limitation in the API for PDFKit, or ... ? >>> >>> Thanks again, >>> >>> M. >>> >>> >>> On Tue, Mar 15, 2022 at 6:58 PM Christiaan Hofman <cmhof...@gmail.com> >>> wrote: >>> >>>> I don’t know. It is just not possible to work with information that >>>> does not exist. All of this is just trying to be smart in interpreted >>>> whatever data exists. In this case, the information does not exist, and >>>> never existed. Again, the highlighted text is never part of the data of the >>>> note, it is data in the PDF that you may associate to it because of >>>> geometry. And when we set the text by default, the PDF does not provide >>>> sufficient information go tell us about the exact text and the flow of it, >>>> because PDF is primarily a graphic format. So we have no way of knowing >>>> when there is a hyphen, and whether it is breaks or not. You could try to >>>> parse the text and look for hyphens followed by spaces, and remove that >>>> from the text. We don’t do that automatically, as we cannot know whether >>>> that is correct. >>>> >>>> Christiaan >>>> >>>> On 15 Mar 2022, at 08:55, Mark Roberts <mroberts1...@gmail.com> wrote: >>>> >>>> Thanks for clarifying. >>>> >>>> I guess my question remains: how can I fix up these hyphenated lines in >>>> my notes? I can parse and process the XML output from skimnotes, but it >>>> seems there isn't enough data to identify lines. >>>> >>>> The issue is that full-text search of the notes won't work if words are >>>> broken up with hyphens. >>>> >>>> Whatever Skim is doing to handle line breaks isn't working for me — I >>>> still see words broken up by hyphens everywhere. >>>> >>>> Any ideas? >>>> >>>> Thanks, >>>> >>>> M. >>>> >>>> On Mon, Mar 14, 2022 at 11:53 PM Christiaan Hofman <cmhof...@gmail.com> >>>> wrote: >>>> >>>>> You should realize that the text of the note is a completely separate >>>>> data element from the highlighted text. The highlighted text is not part >>>>> of >>>>> the note, it is just te text that happens to lie behind the highlight in >>>>> the PDF. We just set the text of the note to the text you highlight by >>>>> default, and we already do some cleaning, including trying to handle >>>>> line-breaks, before we set the text. And you can set it to whatever you >>>>> want. So there is no way to relate the geometry of the highlight in any >>>>> way >>>>> to the text, as there does not exist a relation. >>>>> >>>>> Christiaan >>>>> >>>>> On 14 Mar 2022, at 12:27, Mark Roberts <mroberts1...@gmail.com> wrote: >>>>> >>>>> Hi, >>>>> >>>>> This is very helpful — thanks !! >>>>> >>>>> I just tried your suggestion and got an XML file as expected. I more >>>>> or less understand all the elements of the XML, but it seems the entire >>>>> note is in a <string> element, while the quadrilateralPoints for the >>>>> highlighting boxes are separate. >>>>> >>>>> What I was hoping to do is somehow get each line of my note and then >>>>> look for a hyphen at the end of each line, and then trim that hyphen, as >>>>> necessary. The objective is to try and clean up the skim note to eliminate >>>>> line-break hyphens in the source text. >>>>> >>>>> Any ideas about how I could do this? >>>>> >>>>> Thanks again, >>>>> >>>>> M. >>>>> >>>>> On Mon, Mar 14, 2022 at 7:27 PM Christiaan Hofman <cmhof...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> On 14 Mar 2022, at 11:13, Christiaan Hofman <cmhof...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 14 Mar 2022, at 10:56, Christiaan Hofman <cmhof...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 14 Mar 2022, at 10:50, Christiaan Hofman <cmhof...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 14 Mar 2022, at 04:49, Mark Roberts <mroberts1...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Is there some way to get more detailed information about skim notes, >>>>>> i.e., other than the code framework? >>>>>> >>>>>> I have tried the skimnotes command line tool (e.g., the 'get' and >>>>>> 'format' commands), but it seems to only output the basic information >>>>>> about >>>>>> notes, such as the note type, page number, and note text. >>>>>> >>>>>> Perhaps(?) there's another mode for the skimnotes tool, but I >>>>>> couldn't find it from reading the documentation. >>>>>> >>>>>> I'd like to get more complete data on each note, such as a timestamp, >>>>>> the coordinates of the boxes that are highlighted in the PDF file, the >>>>>> highlight color, and the text contained in each box. >>>>>> >>>>>> I assume(?) this data is in the notes file, but the skimnotes app >>>>>> ignores it for now. >>>>>> >>>>>> I'm wondering about this because if possible I'd like to make a >>>>>> script that gathers my notes for a PDF file, and tries to fix words that >>>>>> were broken by hyphenation in the original PDF. If I can get the >>>>>> highlight >>>>>> boxes in the notes file, and the text in each box, then it should be >>>>>> possible to check for a hyphen character at the end of each line, and >>>>>> then >>>>>> stitch together the words that were split across lines. >>>>>> >>>>>> Any suggestions? >>>>>> >>>>>> Thanks in advance, >>>>>> >>>>>> M. >>>>>> >>>>>> >>>>>> The skimnotes tool is not a tool that can interpret the data. It only >>>>>> copies the data around to various locations that are supported (such as >>>>>> between extended attributes, .skim files, or within a .pdfd bundle). >>>>>> There >>>>>> is no tool to interpret he data. The Wiki has information about how the >>>>>> data is formatted. You could try to build your own tool to unarchive the >>>>>> data from that, but that would be quite a bit of work. >>>>>> >>>>>> Christiaan >>>>>> >>>>>> >>>>>> I can also note that in the near future the skim notes will be saved >>>>>> in a plist format, which can be read by various tools and apps, including >>>>>> AppleScript. You can already have Skim do that by activating a hidden >>>>>> preference, see the Wiki for details. >>>>>> >>>>>> Christiaan >>>>>> >>>>>> >>>>>> I just remembered that the skimnotes tool *can* convert to the plist >>>>>> format, which you may be able to read, using the ’skimnotes format’ >>>>>> command.' skimnotes format plist SKIM_FILE' can do that. The help for >>>>>> skimnotes does not say so, but you can immediately also get the skim >>>>>> notes >>>>>> plist format from the skimnotes tool as follows: >>>>>> >>>>>> skimnotes get plist PDF_FILE SKIM_FILE >>>>>> >>>>>> This will get you a plist file in SKIM_FILE. Perhaps for other tools >>>>>> to read it you have to change the extension to .plist. You could also >>>>>> then >>>>>> pass it through plutil to convert the binary plist to xml plist (plutil >>>>>> -convert xml1 PLIST_FILE), which would even be human readable. You could >>>>>> combine that to get the skimnotes in xml format as follows: >>>>>> >>>>>> skimnotes get plist PDF_FILE - | plutil -format xml1 -o PLIST_FILE - >>>>>> >>>>>> Christiaan >>>>>> >>>>>> >>>>>> Small correction, I messed up ‘-format’ arguments to the commands. It >>>>>> should be added in skimnotes, and in plutil it is -convert: >>>>>> >>>>>> skimnotes get -format plist PDF_FILE SKIM_FILE >>>>>> >>>>>> plutil -convert xml1 PLIST_FILE >>>>>> >>>>>> skimnotes get -format plist PDF_FILE - | plutil -convert xml1 -o >>>>>> PLIST_FILE - >>>>>> >>>>>> If you want to go to the reverse, and write the xml plist data as >>>>>> skim notes, you could do: >>>>>> >>>>>> plutil -convert binary1 -o - PLIST_FILE | skimnotes set PDF_FILE - >>>>>> >>>>>> Christiaan >>>>>> >>>>> > _______________________________________________ > Skim-app-users mailing list > Skim-app-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/skim-app-users >
_______________________________________________ Skim-app-users mailing list Skim-app-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/skim-app-users