BTW, I should say that some time ago we *did* try to remove hyphens and combine broken lines. But it was simply too unreliable based on the available information, and went wrong far too often.
Christiaan > On 16 Mar 2022, at 00:35, Christiaan Hofman <cmhof...@gmail.com> wrote: > > Perhaps you can get some information about the placement of the lines. But we > don’t even get information from the PDF about what the orientation is > (sometimes PDFs use rotated coordinate systems, e.g. in landscape pages). > Also, the selected text may not consist of full lines, so the end of the text > may not be the end of a line. Also, a hyphen does not need to be a line > break, it can also just be a hyphen in the text. Perhaps it is possible to > (almost) figure out how some parts of text are placed in the payed out text > on the page, but then you have to first figure out precisely what the lines > are and compare all the character ranges in the text. And even getting lines > can be a real mess, as there is no guarantee that the text is simply payed > out in nice lines. We certainly don’t get this information from the PDF. > > Christiaan > >> On 15 Mar 2022, at 23:11, Mark Roberts <mroberts1...@gmail.com >> <mailto:mroberts1...@gmail.com>> wrote: >> >> I understand what PDF is about, so I guess I don't see what the issue is >> with getting lines of text. >> >> Looking at the PDF level, there are postscript commands placing characters >> or strings on a page. These have bounding boxes. Meanwhile, the Skim notes >> have highlights. In this case, text highlights. Nothing fancy. >> QuadrilateralPoints. >> >> Is it not possible to detect intersections between text on the page and the >> highlights? >> >> This is just simple math — computing the intersection of bounding boxes — >> right? >> >> If this can be done, then an app can compute what the lines of text are. And >> if we know what the lines of text are, then we can test whether a hyphen >> falls at the end of a line. It doesn't matter if we know what the >> "underlying text" was or not. If we find a hyphen at the end of a line of >> text, then that's a candidate for removal. We only need to have lines of >> text. Once the individual lines are assembled into a string, it becomes more >> difficult to detect this. >> >> Now, of course this all depends on the internal APIs, in this case PDFKit, I >> guess — is that the issue? >> >> Thanks again! >> >> M. >> >> On Tue, Mar 15, 2022 at 11:36 PM Christiaan Hofman <cmhof...@gmail.com >> <mailto:cmhof...@gmail.com>> wrote: >> No, it is a limitation of the PDF format.You just get a string for the >> characters. The hyphen is also just one of the characters. There is no >> information about the underlying text that was used to generate the PDF. You >> should realize that PDF is an output format. >> >> Christiaan >> >>> On 15 Mar 2022, at 14:37, Mark Roberts <mroberts1...@gmail.com >>> <mailto:mroberts1...@gmail.com>> wrote: >>> >>> I sort of half understand what you are explaining, but maybe an example >>> would help. >>> >>> Let's say I have four lines of text in a PDF, e.g.: >>> >>> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eius- >>> mod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad >>> minim veniam, quis nostrud exercitation ullamco laboris nisi ut ali- >>> quip ex ea commodo consequat. >>> >>> And let's say I select the text "consectetur adipiscing elit, sed do >>> eiusmod tempor incididunt ut labore et dolore magna aliqua" (on two lines), >>> and the word "eiusmod" is broken by a hyphen. >>> >>> Inside the PDF, there are in fact sequences of characters which form lines >>> that can be selected. >>> >>> Question: when Skim creates a note, I assume(?) it calls PDFKit and gets >>> some data back. Is it a single string, including the hyphen? >>> >>> I.e., is the limitation in the API for PDFKit, or ... ? >>> >>> Thanks again, >>> >>> M. >>> >>> >>> On Tue, Mar 15, 2022 at 6:58 PM Christiaan Hofman <cmhof...@gmail.com >>> <mailto:cmhof...@gmail.com>> wrote: >>> I don’t know. It is just not possible to work with information that does >>> not exist. All of this is just trying to be smart in interpreted whatever >>> data exists. In this case, the information does not exist, and never >>> existed. Again, the highlighted text is never part of the data of the note, >>> it is data in the PDF that you may associate to it because of geometry. And >>> when we set the text by default, the PDF does not provide sufficient >>> information go tell us about the exact text and the flow of it, because PDF >>> is primarily a graphic format. So we have no way of knowing when there is a >>> hyphen, and whether it is breaks or not. You could try to parse the text >>> and look for hyphens followed by spaces, and remove that from the text. We >>> don’t do that automatically, as we cannot know whether that is correct. >>> >>> Christiaan >>> >>>> On 15 Mar 2022, at 08:55, Mark Roberts <mroberts1...@gmail.com >>>> <mailto:mroberts1...@gmail.com>> wrote: >>>> >>>> Thanks for clarifying. >>>> >>>> I guess my question remains: how can I fix up these hyphenated lines in my >>>> notes? I can parse and process the XML output from skimnotes, but it seems >>>> there isn't enough data to identify lines. >>>> >>>> The issue is that full-text search of the notes won't work if words are >>>> broken up with hyphens. >>>> >>>> Whatever Skim is doing to handle line breaks isn't working for me — I >>>> still see words broken up by hyphens everywhere. >>>> >>>> Any ideas? >>>> >>>> Thanks, >>>> >>>> M. >>>> >>>> On Mon, Mar 14, 2022 at 11:53 PM Christiaan Hofman <cmhof...@gmail.com >>>> <mailto:cmhof...@gmail.com>> wrote: >>>> You should realize that the text of the note is a completely separate data >>>> element from the highlighted text. The highlighted text is not part of the >>>> note, it is just te text that happens to lie behind the highlight in the >>>> PDF. We just set the text of the note to the text you highlight by >>>> default, and we already do some cleaning, including trying to handle >>>> line-breaks, before we set the text. And you can set it to whatever you >>>> want. So there is no way to relate the geometry of the highlight in any >>>> way to the text, as there does not exist a relation. >>>> >>>> Christiaan >>>> >>>>> On 14 Mar 2022, at 12:27, Mark Roberts <mroberts1...@gmail.com >>>>> <mailto:mroberts1...@gmail.com>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> This is very helpful — thanks !! >>>>> >>>>> I just tried your suggestion and got an XML file as expected. I more or >>>>> less understand all the elements of the XML, but it seems the entire note >>>>> is in a <string> element, while the quadrilateralPoints for the >>>>> highlighting boxes are separate. >>>>> >>>>> What I was hoping to do is somehow get each line of my note and then look >>>>> for a hyphen at the end of each line, and then trim that hyphen, as >>>>> necessary. The objective is to try and clean up the skim note to >>>>> eliminate line-break hyphens in the source text. >>>>> >>>>> Any ideas about how I could do this? >>>>> >>>>> Thanks again, >>>>> >>>>> M. >>>>> >>>>> On Mon, Mar 14, 2022 at 7:27 PM Christiaan Hofman <cmhof...@gmail.com >>>>> <mailto:cmhof...@gmail.com>> wrote: >>>>> >>>>> >>>>>> On 14 Mar 2022, at 11:13, Christiaan Hofman <cmhof...@gmail.com >>>>>> <mailto:cmhof...@gmail.com>> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> On 14 Mar 2022, at 10:56, Christiaan Hofman <cmhof...@gmail.com >>>>>>> <mailto:cmhof...@gmail.com>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On 14 Mar 2022, at 10:50, Christiaan Hofman <cmhof...@gmail.com >>>>>>>> <mailto:cmhof...@gmail.com>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On 14 Mar 2022, at 04:49, Mark Roberts <mroberts1...@gmail.com >>>>>>>>> <mailto:mroberts1...@gmail.com>> wrote: >>>>>>>>> >>>>>>>>> Is there some way to get more detailed information about skim notes, >>>>>>>>> i.e., other than the code framework? >>>>>>>>> >>>>>>>>> I have tried the skimnotes command line tool (e.g., the 'get' and >>>>>>>>> 'format' commands), but it seems to only output the basic information >>>>>>>>> about notes, such as the note type, page number, and note text. >>>>>>>>> >>>>>>>>> Perhaps(?) there's another mode for the skimnotes tool, but I >>>>>>>>> couldn't find it from reading the documentation. >>>>>>>>> >>>>>>>>> I'd like to get more complete data on each note, such as a timestamp, >>>>>>>>> the coordinates of the boxes that are highlighted in the PDF file, >>>>>>>>> the highlight color, and the text contained in each box. >>>>>>>>> >>>>>>>>> I assume(?) this data is in the notes file, but the skimnotes app >>>>>>>>> ignores it for now. >>>>>>>>> >>>>>>>>> I'm wondering about this because if possible I'd like to make a >>>>>>>>> script that gathers my notes for a PDF file, and tries to fix words >>>>>>>>> that were broken by hyphenation in the original PDF. If I can get the >>>>>>>>> highlight boxes in the notes file, and the text in each box, then it >>>>>>>>> should be possible to check for a hyphen character at the end of each >>>>>>>>> line, and then stitch together the words that were split across lines. >>>>>>>>> >>>>>>>>> Any suggestions? >>>>>>>>> >>>>>>>>> Thanks in advance, >>>>>>>>> >>>>>>>>> M. >>>>>>>> >>>>>>>> The skimnotes tool is not a tool that can interpret the data. It only >>>>>>>> copies the data around to various locations that are supported (such >>>>>>>> as between extended attributes, .skim files, or within a .pdfd >>>>>>>> bundle). There is no tool to interpret he data. The Wiki has >>>>>>>> information about how the data is formatted. You could try to build >>>>>>>> your own tool to unarchive the data from that, but that would be quite >>>>>>>> a bit of work. >>>>>>>> >>>>>>>> Christiaan >>>>>>>> >>>>>>> >>>>>>> >>>>>>> I can also note that in the near future the skim notes will be saved in >>>>>>> a plist format, which can be read by various tools and apps, including >>>>>>> AppleScript. You can already have Skim do that by activating a hidden >>>>>>> preference, see the Wiki for details. >>>>>>> >>>>>>> Christiaan >>>>>>> >>>>>> >>>>>> >>>>>> I just remembered that the skimnotes tool *can* convert to the plist >>>>>> format, which you may be able to read, using the ’skimnotes format’ >>>>>> command.' skimnotes format plist SKIM_FILE' can do that. The help for >>>>>> skimnotes does not say so, but you can immediately also get the skim >>>>>> notes plist format from the skimnotes tool as follows: >>>>>> >>>>>> skimnotes get plist PDF_FILE SKIM_FILE >>>>>> >>>>>> This will get you a plist file in SKIM_FILE. Perhaps for other tools to >>>>>> read it you have to change the extension to .plist. You could also then >>>>>> pass it through plutil to convert the binary plist to xml plist (plutil >>>>>> -convert xml1 PLIST_FILE), which would even be human readable. You could >>>>>> combine that to get the skimnotes in xml format as follows: >>>>>> >>>>>> skimnotes get plist PDF_FILE - | plutil -format xml1 -o PLIST_FILE - >>>>>> >>>>>> Christiaan >>>>>> >>>>> >>>>> >>>>> Small correction, I messed up ‘-format’ arguments to the commands. It >>>>> should be added in skimnotes, and in plutil it is -convert: >>>>> >>>>> skimnotes get -format plist PDF_FILE SKIM_FILE >>>>> >>>>> plutil -convert xml1 PLIST_FILE >>>>> >>>>> skimnotes get -format plist PDF_FILE - | plutil -convert xml1 -o >>>>> PLIST_FILE - >>>>> >>>>> If you want to go to the reverse, and write the xml plist data as skim >>>>> notes, you could do: >>>>> >>>>> plutil -convert binary1 -o - PLIST_FILE | skimnotes set PDF_FILE - >>>>> >>>>> Christiaan
_______________________________________________ Skim-app-users mailing list Skim-app-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/skim-app-users