I sort of half understand what you are explaining, but maybe an example
would help.

Let's say I have four lines of text in a PDF, e.g.:

Lorem ipsum dolor sit amet, *consectetur adipiscing elit, sed do eius-*
*mod tempor incididunt ut labore et dolore magna aliqua.* Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi ut ali-
quip ex ea commodo consequat.

And let's say I select the text "consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua" (on two lines),
and the word "eiusmod" is broken by a hyphen.

Inside the PDF, there are in fact sequences of characters which form lines
that can be selected.

Question: when Skim creates a note, I assume(?) it calls PDFKit and gets
some data back. Is it a single string, including the hyphen?

I.e., is the limitation in the API for PDFKit, or ... ?

Thanks again,

M.


On Tue, Mar 15, 2022 at 6:58 PM Christiaan Hofman <cmhof...@gmail.com>
wrote:

> I don’t know. It is just not possible to work with information that does
> not exist. All of this is just trying to be smart in interpreted whatever
> data exists. In this case, the information does not exist, and never
> existed. Again, the highlighted text is never part of the data of the note,
> it is data in the PDF that you may associate to it because of geometry. And
> when we set the text by default, the PDF does not provide sufficient
> information go tell us about the exact text and the flow of it, because PDF
> is primarily a graphic format. So we have no way of knowing when there is a
> hyphen, and whether it is breaks or not. You could try to parse the text
> and look for hyphens followed by spaces, and remove that from the text. We
> don’t do that automatically, as we cannot know whether that is correct.
>
> Christiaan
>
> On 15 Mar 2022, at 08:55, Mark Roberts <mroberts1...@gmail.com> wrote:
>
> Thanks for clarifying.
>
> I guess my question remains: how can I fix up these hyphenated lines in my
> notes? I can parse and process the XML output from skimnotes, but it seems
> there isn't enough data to identify lines.
>
> The issue is that full-text search of the notes won't work if words are
> broken up with hyphens.
>
> Whatever Skim is doing to handle line breaks isn't working for me — I
> still see words broken up by hyphens everywhere.
>
> Any ideas?
>
> Thanks,
>
> M.
>
> On Mon, Mar 14, 2022 at 11:53 PM Christiaan Hofman <cmhof...@gmail.com>
> wrote:
>
>> You should realize that the text of the note is a completely separate
>> data element from the highlighted text. The highlighted text is not part of
>> the note, it is just te text that happens to lie behind the highlight in
>> the PDF. We just set the text of the note to the text you highlight by
>> default, and we already do some cleaning, including trying to handle
>> line-breaks, before we set the text. And you can set it to whatever you
>> want. So there is no way to relate the geometry of the highlight in any way
>> to the text, as there does not exist a relation.
>>
>> Christiaan
>>
>> On 14 Mar 2022, at 12:27, Mark Roberts <mroberts1...@gmail.com> wrote:
>>
>> Hi,
>>
>> This is very helpful — thanks !!
>>
>> I just tried your suggestion and got an XML file as expected. I more or
>> less understand all the elements of the XML, but it seems the entire note
>> is in a <string> element, while the quadrilateralPoints for the
>> highlighting boxes are separate.
>>
>> What I was hoping to do is somehow get each line of my note and then look
>> for a hyphen at the end of each line, and then trim that hyphen, as
>> necessary. The objective is to try and clean up the skim note to eliminate
>> line-break hyphens in the source text.
>>
>> Any ideas about how I could do this?
>>
>> Thanks again,
>>
>> M.
>>
>> On Mon, Mar 14, 2022 at 7:27 PM Christiaan Hofman <cmhof...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On 14 Mar 2022, at 11:13, Christiaan Hofman <cmhof...@gmail.com> wrote:
>>>
>>>
>>>
>>> On 14 Mar 2022, at 10:56, Christiaan Hofman <cmhof...@gmail.com> wrote:
>>>
>>>
>>>
>>> On 14 Mar 2022, at 10:50, Christiaan Hofman <cmhof...@gmail.com> wrote:
>>>
>>>
>>>
>>> On 14 Mar 2022, at 04:49, Mark Roberts <mroberts1...@gmail.com> wrote:
>>>
>>> Is there some way to get more detailed information about skim notes,
>>> i.e., other than the code framework?
>>>
>>> I have tried the skimnotes command line tool (e.g., the 'get' and
>>> 'format' commands), but it seems to only output the basic information about
>>> notes, such as the note type, page number, and note text.
>>>
>>> Perhaps(?) there's another mode for the skimnotes tool, but I couldn't
>>> find it from reading the documentation.
>>>
>>> I'd like to get more complete data on each note, such as a timestamp,
>>> the coordinates of the boxes that are highlighted in the PDF file, the
>>> highlight color, and the text contained in each box.
>>>
>>> I assume(?) this data is in the notes file, but the skimnotes app
>>> ignores it for now.
>>>
>>> I'm wondering about this because if possible I'd like to make a script
>>> that gathers my notes for a PDF file, and tries to fix words that were
>>> broken by hyphenation in the original PDF. If I can get the highlight boxes
>>> in the notes file, and the text in each box, then it should be possible to
>>> check for a hyphen character at the end of each line, and then stitch
>>> together the words that were split across lines.
>>>
>>> Any suggestions?
>>>
>>> Thanks in advance,
>>>
>>> M.
>>>
>>>
>>> The skimnotes tool is not a tool that can interpret the data. It only
>>> copies the data around to various locations that are supported (such as
>>> between extended attributes, .skim files, or within a .pdfd bundle). There
>>> is no tool to interpret he data. The Wiki has information about how the
>>> data is formatted. You could try to build your own tool to unarchive the
>>> data from that, but that would be quite a bit of work.
>>>
>>> Christiaan
>>>
>>>
>>> I can also note that in the near future the skim notes will be saved in
>>> a plist format, which can be read by various tools and apps, including
>>> AppleScript. You can already have Skim do that by activating a hidden
>>> preference, see the Wiki for details.
>>>
>>> Christiaan
>>>
>>>
>>> I just remembered that the skimnotes tool *can* convert to the plist
>>> format, which you may be able to read, using the ’skimnotes format’
>>> command.' skimnotes format plist SKIM_FILE' can do that. The help for
>>> skimnotes does not say so, but you can immediately also get the skim notes
>>> plist format from the skimnotes tool as follows:
>>>
>>> skimnotes get plist PDF_FILE SKIM_FILE
>>>
>>> This will get you a plist file in SKIM_FILE. Perhaps for other tools to
>>> read it you have to change the extension to .plist. You could also then
>>> pass it through plutil to convert the binary plist to xml plist (plutil
>>> -convert xml1 PLIST_FILE), which would even be human readable. You could
>>> combine that to get the skimnotes in xml format as follows:
>>>
>>> skimnotes get plist PDF_FILE - | plutil -format xml1 -o PLIST_FILE -
>>>
>>> Christiaan
>>>
>>>
>>> Small correction, I messed up ‘-format’ arguments to the commands. It
>>> should be added in skimnotes, and in plutil it is -convert:
>>>
>>> skimnotes get -format plist PDF_FILE SKIM_FILE
>>>
>>> plutil -convert xml1 PLIST_FILE
>>>
>>> skimnotes get -format plist PDF_FILE - | plutil -convert xml1 -o
>>> PLIST_FILE -
>>>
>>> If you want to go to the reverse, and write the xml plist data as skim
>>> notes, you could do:
>>>
>>> plutil -convert binary1 -o - PLIST_FILE | skimnotes set PDF_FILE -
>>>
>>> Christiaan
>>>
>>
> _______________________________________________
> Skim-app-users mailing list
> Skim-app-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/skim-app-users
>
_______________________________________________
Skim-app-users mailing list
Skim-app-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/skim-app-users

Reply via email to