Yes. But you should realize that the PDFKit method to get selection and lines of text may not work as good, also because of the way text is payed out in the PDF. For instance, PDFKit may decide that some characters may not lie inside a rectangle due to rounding errors, or sometimes the opposite where it gets characters that should lie outside of it. This also depends on the OS version.And yes, the pageIndex is the sequential page index. The page label is just a label for display, so not suitable for data.
Christiaan > On 17 Mar 2022, at 02:24, Mark Roberts <mroberts1...@gmail.com> wrote: > > Thanks for clarifying. I think I understand now. > > I looked at the API for PDFKit and found several methods in class PDFPage > that might make it possible to fetch lines of text, e.g., selection(). I > guess(?) I could try to take the selection rectangles from the skim notes, > and then try using PDFKit to get lines of text from the PDF file, though it > seems a bit circuitous. > > In any case, if I want to get PDF page numbers for the notes, it looks like I > may need to invoke PDFKit anyway. It appears that the 'pageIndex' key in a > skim note is the sequential page number in the PDF file, and I would need to > call PDFKit functions to convert that to a page label (e.g., pageIndex of 9 > -> page ix). > > Am I understanding this correctly? > > Thanks, > > M. > > On Wed, Mar 16, 2022 at 6:37 PM Christiaan Hofman <cmhof...@gmail.com > <mailto:cmhof...@gmail.com>> wrote: > No, there is no way to exert that data, as that data does not exist. The data > of the notes is data from the notes, and what the notes know about themselves > is prescribed by Adobe in the PDF specifications. And that contains just a > text string. > > Christiaan > >> On 16 Mar 2022, at 02:23, Mark Roberts <mroberts1...@gmail.com >> <mailto:mroberts1...@gmail.com>> wrote: >> >> Thanks for going into more detail. >> >> Sorry if I wasn't clear, but I'm not trying to ask for changes to the >> behavior of the Skim app itself. >> >> I'm mainly wondering if there is a way to somehow access or export this >> information about notes, based on the data that is saved plus what's in the >> PDF. >> >> If this information could be exported somehow (e.g., via skimnotes or some >> other tool), then I could write my own app to read it and help fix up the >> notes. I can easily code something to parse XML and manipulate text, but >> digging into the PDF data structures is more difficult. I have investigated >> doing this and indeed it's non-trivial. >> >> Skim is one of the very few PDF readers aside from Adobe Acrobat that >> actually handles page numbers correctly, AND has lots more flexibility for >> the export of annotations, so this is why I'm inquiring here. >> >> As you explain, internally PDF is rather messy and there can be various >> degenerate cases, but I assume that for scholarly research 99.9% of the time >> the PDF media will be journal articles or books in portrait orientation. >> Indeed, there will be degenerate cases, and if the selected text does not >> include full lines, then I assume hyphens could fall outside of the selected >> text and the detection would of course fail. Still, what I see on the screen >> is that lines of text are being recognized and highlighted correctly 99% of >> the time, so it seems clear that somewhere (either in Skim or in PDFKit, I >> cannot say), all of this machinery for capturing notes text is working >> pretty well. >> >> I don't know the details of the lower level APIs, but if they could provide >> lines of text instead of just a single string of all the selected lines >> concatenated, then it might be possible. >> >> Thanks again for explaining this ! >> >> M. >> >> On Wed, Mar 16, 2022 at 8:38 AM Christiaan Hofman <cmhof...@gmail.com >> <mailto:cmhof...@gmail.com>> wrote: >> BTW, I should say that some time ago we *did* try to remove hyphens and >> combine broken lines. But it was simply too unreliable based on the >> available information, and went wrong far too often. >> >> Christiaan >> >>> On 16 Mar 2022, at 00:35, Christiaan Hofman <cmhof...@gmail.com >>> <mailto:cmhof...@gmail.com>> wrote: >>> >>> Perhaps you can get some information about the placement of the lines. But >>> we don’t even get information from the PDF about what the orientation is >>> (sometimes PDFs use rotated coordinate systems, e.g. in landscape pages). >>> Also, the selected text may not consist of full lines, so the end of the >>> text may not be the end of a line. Also, a hyphen does not need to be a >>> line break, it can also just be a hyphen in the text. Perhaps it is >>> possible to (almost) figure out how some parts of text are placed in the >>> payed out text on the page, but then you have to first figure out precisely >>> what the lines are and compare all the character ranges in the text. And >>> even getting lines can be a real mess, as there is no guarantee that the >>> text is simply payed out in nice lines. We certainly don’t get this >>> information from the PDF. >>> >>> Christiaan >>> >>>> On 15 Mar 2022, at 23:11, Mark Roberts <mroberts1...@gmail.com >>>> <mailto:mroberts1...@gmail.com>> wrote: >>>> >>>> I understand what PDF is about, so I guess I don't see what the issue is >>>> with getting lines of text. >>>> >>>> Looking at the PDF level, there are postscript commands placing characters >>>> or strings on a page. These have bounding boxes. Meanwhile, the Skim notes >>>> have highlights. In this case, text highlights. Nothing fancy. >>>> QuadrilateralPoints. >>>> >>>> Is it not possible to detect intersections between text on the page and >>>> the highlights? >>>> >>>> This is just simple math — computing the intersection of bounding boxes — >>>> right? >>>> >>>> If this can be done, then an app can compute what the lines of text are. >>>> And if we know what the lines of text are, then we can test whether a >>>> hyphen falls at the end of a line. It doesn't matter if we know what the >>>> "underlying text" was or not. If we find a hyphen at the end of a line of >>>> text, then that's a candidate for removal. We only need to have lines of >>>> text. Once the individual lines are assembled into a string, it becomes >>>> more difficult to detect this. >>>> >>>> Now, of course this all depends on the internal APIs, in this case PDFKit, >>>> I guess — is that the issue? >>>> >>>> Thanks again! >>>> >>>> M. >>>> >>>> On Tue, Mar 15, 2022 at 11:36 PM Christiaan Hofman <cmhof...@gmail.com >>>> <mailto:cmhof...@gmail.com>> wrote: >>>> No, it is a limitation of the PDF format.You just get a string for the >>>> characters. The hyphen is also just one of the characters. There is no >>>> information about the underlying text that was used to generate the PDF. >>>> You should realize that PDF is an output format. >>>> >>>> Christiaan >>>> >>>>> On 15 Mar 2022, at 14:37, Mark Roberts <mroberts1...@gmail.com >>>>> <mailto:mroberts1...@gmail.com>> wrote: >>>>> >>>>> I sort of half understand what you are explaining, but maybe an example >>>>> would help. >>>>> >>>>> Let's say I have four lines of text in a PDF, e.g.: >>>>> >>>>> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eius- >>>>> mod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad >>>>> minim veniam, quis nostrud exercitation ullamco laboris nisi ut ali- >>>>> quip ex ea commodo consequat. >>>>> >>>>> And let's say I select the text "consectetur adipiscing elit, sed do >>>>> eiusmod tempor incididunt ut labore et dolore magna aliqua" (on two >>>>> lines), and the word "eiusmod" is broken by a hyphen. >>>>> >>>>> Inside the PDF, there are in fact sequences of characters which form >>>>> lines that can be selected. >>>>> >>>>> Question: when Skim creates a note, I assume(?) it calls PDFKit and gets >>>>> some data back. Is it a single string, including the hyphen? >>>>> >>>>> I.e., is the limitation in the API for PDFKit, or ... ? >>>>> >>>>> Thanks again, >>>>> >>>>> M. >>>>> >>>>> >>>>> On Tue, Mar 15, 2022 at 6:58 PM Christiaan Hofman <cmhof...@gmail.com >>>>> <mailto:cmhof...@gmail.com>> wrote: >>>>> I don’t know. It is just not possible to work with information that does >>>>> not exist. All of this is just trying to be smart in interpreted whatever >>>>> data exists. In this case, the information does not exist, and never >>>>> existed. Again, the highlighted text is never part of the data of the >>>>> note, it is data in the PDF that you may associate to it because of >>>>> geometry. And when we set the text by default, the PDF does not provide >>>>> sufficient information go tell us about the exact text and the flow of >>>>> it, because PDF is primarily a graphic format. So we have no way of >>>>> knowing when there is a hyphen, and whether it is breaks or not. You >>>>> could try to parse the text and look for hyphens followed by spaces, and >>>>> remove that from the text. We don’t do that automatically, as we cannot >>>>> know whether that is correct. >>>>> >>>>> Christiaan >>>>> >>>>>> On 15 Mar 2022, at 08:55, Mark Roberts <mroberts1...@gmail.com >>>>>> <mailto:mroberts1...@gmail.com>> wrote: >>>>>> >>>>>> Thanks for clarifying. >>>>>> >>>>>> I guess my question remains: how can I fix up these hyphenated lines in >>>>>> my notes? I can parse and process the XML output from skimnotes, but it >>>>>> seems there isn't enough data to identify lines. >>>>>> >>>>>> The issue is that full-text search of the notes won't work if words are >>>>>> broken up with hyphens. >>>>>> >>>>>> Whatever Skim is doing to handle line breaks isn't working for me — I >>>>>> still see words broken up by hyphens everywhere. >>>>>> >>>>>> Any ideas? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> M. >>>>>> >>>>>> On Mon, Mar 14, 2022 at 11:53 PM Christiaan Hofman <cmhof...@gmail.com >>>>>> <mailto:cmhof...@gmail.com>> wrote: >>>>>> You should realize that the text of the note is a completely separate >>>>>> data element from the highlighted text. The highlighted text is not part >>>>>> of the note, it is just te text that happens to lie behind the highlight >>>>>> in the PDF. We just set the text of the note to the text you highlight >>>>>> by default, and we already do some cleaning, including trying to handle >>>>>> line-breaks, before we set the text. And you can set it to whatever you >>>>>> want. So there is no way to relate the geometry of the highlight in any >>>>>> way to the text, as there does not exist a relation. >>>>>> >>>>>> Christiaan >>>>>> >>>>>>> On 14 Mar 2022, at 12:27, Mark Roberts <mroberts1...@gmail.com >>>>>>> <mailto:mroberts1...@gmail.com>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> This is very helpful — thanks !! >>>>>>> >>>>>>> I just tried your suggestion and got an XML file as expected. I more or >>>>>>> less understand all the elements of the XML, but it seems the entire >>>>>>> note is in a <string> element, while the quadrilateralPoints for the >>>>>>> highlighting boxes are separate. >>>>>>> >>>>>>> What I was hoping to do is somehow get each line of my note and then >>>>>>> look for a hyphen at the end of each line, and then trim that hyphen, >>>>>>> as necessary. The objective is to try and clean up the skim note to >>>>>>> eliminate line-break hyphens in the source text. >>>>>>> >>>>>>> Any ideas about how I could do this? >>>>>>> >>>>>>> Thanks again, >>>>>>> >>>>>>> M. >>>>>>> >>>>>>> On Mon, Mar 14, 2022 at 7:27 PM Christiaan Hofman <cmhof...@gmail.com >>>>>>> <mailto:cmhof...@gmail.com>> wrote: >>>>>>> >>>>>>> >>>>>>>> On 14 Mar 2022, at 11:13, Christiaan Hofman <cmhof...@gmail.com >>>>>>>> <mailto:cmhof...@gmail.com>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On 14 Mar 2022, at 10:56, Christiaan Hofman <cmhof...@gmail.com >>>>>>>>> <mailto:cmhof...@gmail.com>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> On 14 Mar 2022, at 10:50, Christiaan Hofman <cmhof...@gmail.com >>>>>>>>>> <mailto:cmhof...@gmail.com>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On 14 Mar 2022, at 04:49, Mark Roberts <mroberts1...@gmail.com >>>>>>>>>>> <mailto:mroberts1...@gmail.com>> wrote: >>>>>>>>>>> >>>>>>>>>>> Is there some way to get more detailed information about skim >>>>>>>>>>> notes, i.e., other than the code framework? >>>>>>>>>>> >>>>>>>>>>> I have tried the skimnotes command line tool (e.g., the 'get' and >>>>>>>>>>> 'format' commands), but it seems to only output the basic >>>>>>>>>>> information about notes, such as the note type, page number, and >>>>>>>>>>> note text. >>>>>>>>>>> >>>>>>>>>>> Perhaps(?) there's another mode for the skimnotes tool, but I >>>>>>>>>>> couldn't find it from reading the documentation. >>>>>>>>>>> >>>>>>>>>>> I'd like to get more complete data on each note, such as a >>>>>>>>>>> timestamp, the coordinates of the boxes that are highlighted in the >>>>>>>>>>> PDF file, the highlight color, and the text contained in each box. >>>>>>>>>>> >>>>>>>>>>> I assume(?) this data is in the notes file, but the skimnotes app >>>>>>>>>>> ignores it for now. >>>>>>>>>>> >>>>>>>>>>> I'm wondering about this because if possible I'd like to make a >>>>>>>>>>> script that gathers my notes for a PDF file, and tries to fix words >>>>>>>>>>> that were broken by hyphenation in the original PDF. If I can get >>>>>>>>>>> the highlight boxes in the notes file, and the text in each box, >>>>>>>>>>> then it should be possible to check for a hyphen character at the >>>>>>>>>>> end of each line, and then stitch together the words that were >>>>>>>>>>> split across lines. >>>>>>>>>>> >>>>>>>>>>> Any suggestions? >>>>>>>>>>> >>>>>>>>>>> Thanks in advance, >>>>>>>>>>> >>>>>>>>>>> M. >>>>>>>>>> >>>>>>>>>> The skimnotes tool is not a tool that can interpret the data. It >>>>>>>>>> only copies the data around to various locations that are supported >>>>>>>>>> (such as between extended attributes, .skim files, or within a .pdfd >>>>>>>>>> bundle). There is no tool to interpret he data. The Wiki has >>>>>>>>>> information about how the data is formatted. You could try to build >>>>>>>>>> your own tool to unarchive the data from that, but that would be >>>>>>>>>> quite a bit of work. >>>>>>>>>> >>>>>>>>>> Christiaan >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I can also note that in the near future the skim notes will be saved >>>>>>>>> in a plist format, which can be read by various tools and apps, >>>>>>>>> including AppleScript. You can already have Skim do that by >>>>>>>>> activating a hidden preference, see the Wiki for details. >>>>>>>>> >>>>>>>>> Christiaan >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I just remembered that the skimnotes tool *can* convert to the plist >>>>>>>> format, which you may be able to read, using the ’skimnotes format’ >>>>>>>> command.' skimnotes format plist SKIM_FILE' can do that. The help for >>>>>>>> skimnotes does not say so, but you can immediately also get the skim >>>>>>>> notes plist format from the skimnotes tool as follows: >>>>>>>> >>>>>>>> skimnotes get plist PDF_FILE SKIM_FILE >>>>>>>> >>>>>>>> This will get you a plist file in SKIM_FILE. Perhaps for other tools >>>>>>>> to read it you have to change the extension to .plist. You could also >>>>>>>> then pass it through plutil to convert the binary plist to xml plist >>>>>>>> (plutil -convert xml1 PLIST_FILE), which would even be human readable. >>>>>>>> You could combine that to get the skimnotes in xml format as follows: >>>>>>>> >>>>>>>> skimnotes get plist PDF_FILE - | plutil -format xml1 -o PLIST_FILE - >>>>>>>> >>>>>>>> Christiaan >>>>>>>> >>>>>>> >>>>>>> >>>>>>> Small correction, I messed up ‘-format’ arguments to the commands. It >>>>>>> should be added in skimnotes, and in plutil it is -convert: >>>>>>> >>>>>>> skimnotes get -format plist PDF_FILE SKIM_FILE >>>>>>> >>>>>>> plutil -convert xml1 PLIST_FILE >>>>>>> >>>>>>> skimnotes get -format plist PDF_FILE - | plutil -convert xml1 -o >>>>>>> PLIST_FILE - >>>>>>> >>>>>>> If you want to go to the reverse, and write the xml plist data as skim >>>>>>> notes, you could do: >>>>>>> >>>>>>> plutil -convert binary1 -o - PLIST_FILE | skimnotes set PDF_FILE - >>>>>>> >>>>>>> Christiaan
_______________________________________________ Skim-app-users mailing list Skim-app-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/skim-app-users