I sort of half understand what you are explaining, but maybe an example would help.
Let's say I have four lines of text in a PDF, e.g.: Lorem ipsum dolor sit amet, *consectetur adipiscing elit, sed do eius-* *mod tempor incididunt ut labore et dolore magna aliqua.* Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut ali- quip ex ea commodo consequat. And let's say I select the text "consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua" (on two lines), and the word "eiusmod" is broken by a hyphen. Inside the PDF, there are in fact sequences of characters which form lines that can be selected. Question: when Skim creates a note, I assume(?) it calls PDFKit and gets some data back. Is it a single string, including the hyphen? I.e., is the limitation in the API for PDFKit, or ... ? Thanks again, M. On Tue, Mar 15, 2022 at 6:58 PM Christiaan Hofman <cmhof...@gmail.com> wrote: > I don’t know. It is just not possible to work with information that does > not exist. All of this is just trying to be smart in interpreted whatever > data exists. In this case, the information does not exist, and never > existed. Again, the highlighted text is never part of the data of the note, > it is data in the PDF that you may associate to it because of geometry. And > when we set the text by default, the PDF does not provide sufficient > information go tell us about the exact text and the flow of it, because PDF > is primarily a graphic format. So we have no way of knowing when there is a > hyphen, and whether it is breaks or not. You could try to parse the text > and look for hyphens followed by spaces, and remove that from the text. We > don’t do that automatically, as we cannot know whether that is correct. > > Christiaan > > On 15 Mar 2022, at 08:55, Mark Roberts <mroberts1...@gmail.com> wrote: > > Thanks for clarifying. > > I guess my question remains: how can I fix up these hyphenated lines in my > notes? I can parse and process the XML output from skimnotes, but it seems > there isn't enough data to identify lines. > > The issue is that full-text search of the notes won't work if words are > broken up with hyphens. > > Whatever Skim is doing to handle line breaks isn't working for me — I > still see words broken up by hyphens everywhere. > > Any ideas? > > Thanks, > > M. > > On Mon, Mar 14, 2022 at 11:53 PM Christiaan Hofman <cmhof...@gmail.com> > wrote: > >> You should realize that the text of the note is a completely separate >> data element from the highlighted text. The highlighted text is not part of >> the note, it is just te text that happens to lie behind the highlight in >> the PDF. We just set the text of the note to the text you highlight by >> default, and we already do some cleaning, including trying to handle >> line-breaks, before we set the text. And you can set it to whatever you >> want. So there is no way to relate the geometry of the highlight in any way >> to the text, as there does not exist a relation. >> >> Christiaan >> >> On 14 Mar 2022, at 12:27, Mark Roberts <mroberts1...@gmail.com> wrote: >> >> Hi, >> >> This is very helpful — thanks !! >> >> I just tried your suggestion and got an XML file as expected. I more or >> less understand all the elements of the XML, but it seems the entire note >> is in a <string> element, while the quadrilateralPoints for the >> highlighting boxes are separate. >> >> What I was hoping to do is somehow get each line of my note and then look >> for a hyphen at the end of each line, and then trim that hyphen, as >> necessary. The objective is to try and clean up the skim note to eliminate >> line-break hyphens in the source text. >> >> Any ideas about how I could do this? >> >> Thanks again, >> >> M. >> >> On Mon, Mar 14, 2022 at 7:27 PM Christiaan Hofman <cmhof...@gmail.com> >> wrote: >> >>> >>> >>> On 14 Mar 2022, at 11:13, Christiaan Hofman <cmhof...@gmail.com> wrote: >>> >>> >>> >>> On 14 Mar 2022, at 10:56, Christiaan Hofman <cmhof...@gmail.com> wrote: >>> >>> >>> >>> On 14 Mar 2022, at 10:50, Christiaan Hofman <cmhof...@gmail.com> wrote: >>> >>> >>> >>> On 14 Mar 2022, at 04:49, Mark Roberts <mroberts1...@gmail.com> wrote: >>> >>> Is there some way to get more detailed information about skim notes, >>> i.e., other than the code framework? >>> >>> I have tried the skimnotes command line tool (e.g., the 'get' and >>> 'format' commands), but it seems to only output the basic information about >>> notes, such as the note type, page number, and note text. >>> >>> Perhaps(?) there's another mode for the skimnotes tool, but I couldn't >>> find it from reading the documentation. >>> >>> I'd like to get more complete data on each note, such as a timestamp, >>> the coordinates of the boxes that are highlighted in the PDF file, the >>> highlight color, and the text contained in each box. >>> >>> I assume(?) this data is in the notes file, but the skimnotes app >>> ignores it for now. >>> >>> I'm wondering about this because if possible I'd like to make a script >>> that gathers my notes for a PDF file, and tries to fix words that were >>> broken by hyphenation in the original PDF. If I can get the highlight boxes >>> in the notes file, and the text in each box, then it should be possible to >>> check for a hyphen character at the end of each line, and then stitch >>> together the words that were split across lines. >>> >>> Any suggestions? >>> >>> Thanks in advance, >>> >>> M. >>> >>> >>> The skimnotes tool is not a tool that can interpret the data. It only >>> copies the data around to various locations that are supported (such as >>> between extended attributes, .skim files, or within a .pdfd bundle). There >>> is no tool to interpret he data. The Wiki has information about how the >>> data is formatted. You could try to build your own tool to unarchive the >>> data from that, but that would be quite a bit of work. >>> >>> Christiaan >>> >>> >>> I can also note that in the near future the skim notes will be saved in >>> a plist format, which can be read by various tools and apps, including >>> AppleScript. You can already have Skim do that by activating a hidden >>> preference, see the Wiki for details. >>> >>> Christiaan >>> >>> >>> I just remembered that the skimnotes tool *can* convert to the plist >>> format, which you may be able to read, using the ’skimnotes format’ >>> command.' skimnotes format plist SKIM_FILE' can do that. The help for >>> skimnotes does not say so, but you can immediately also get the skim notes >>> plist format from the skimnotes tool as follows: >>> >>> skimnotes get plist PDF_FILE SKIM_FILE >>> >>> This will get you a plist file in SKIM_FILE. Perhaps for other tools to >>> read it you have to change the extension to .plist. You could also then >>> pass it through plutil to convert the binary plist to xml plist (plutil >>> -convert xml1 PLIST_FILE), which would even be human readable. You could >>> combine that to get the skimnotes in xml format as follows: >>> >>> skimnotes get plist PDF_FILE - | plutil -format xml1 -o PLIST_FILE - >>> >>> Christiaan >>> >>> >>> Small correction, I messed up ‘-format’ arguments to the commands. It >>> should be added in skimnotes, and in plutil it is -convert: >>> >>> skimnotes get -format plist PDF_FILE SKIM_FILE >>> >>> plutil -convert xml1 PLIST_FILE >>> >>> skimnotes get -format plist PDF_FILE - | plutil -convert xml1 -o >>> PLIST_FILE - >>> >>> If you want to go to the reverse, and write the xml plist data as skim >>> notes, you could do: >>> >>> plutil -convert binary1 -o - PLIST_FILE | skimnotes set PDF_FILE - >>> >>> Christiaan >>> >> > _______________________________________________ > Skim-app-users mailing list > Skim-app-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/skim-app-users >
_______________________________________________ Skim-app-users mailing list Skim-app-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/skim-app-users