2017-07-09 23:58 GMT+02:00 Jean-Francois Nifenecker <
jean-francois.nifenec...@laposte.net>:

> Hello Gilles,
>
> Le 09/07/2017 à 19:20, Gilles a écrit :
>
>> Hello,
>>
>> This PDF file
>> <https://www.legifrance.gouv.fr/download_code_pdf.do?cidText
>> e=LEGITEXT000006074228&dlType=pdf>
>> has no Table of Contents, and I was wondering if LO could grab all the
>> headers and build a TOC.
>>
>
> In order to create a PDF with a TOC/index you'll have to set heading
> styles to the appropriate paragraphs.
>
> Opening a PDF with LibO won't go anywhere as the tool for that is Draw
> which can't set styles for a text processor.
>
> I can't see a way to do that quickly, I'm afraid: a copy/paste from the
> PDF document to Writer is possible but you'll have to fix a lot of things
> (eg. useless carriage returns) and apply heading styles by hand. On a 400+
> pages document this a big PITA.
>
> Hopefully someone else will come with brighter ideas.
>
>
>
​You want brighter ideas? Say no more!

So... hmm... I'm afraid there won't be many fully-automated tools that can
build a TOC for you. A PDF basically contains a lot of individual elements,
that are arranged to look like ​something coherent.
From the document you linked, it could theoretically be possible to write a
tool that split every pages, grab the raw text, use a regex to find actual
titles, build a TOC, and inject it in the PDF. This would assume:
- Text extraction works correctly (it's not always the case with PDF)
- Titles always follow the same format

But on this kind of document, you could definitely get some acceptable
results. I experimented a bit. The output is here:
http://www.cjoint.com/c/GGjw0OtPkGc
And for the curious, the "script" I used is here:
​https://pastebin.com/icQSZxQr

As you'll see, it is VERY specific to this document, ​but it is possible to
do something.

-- 
To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted

Reply via email to