Re: Title extract logic

Markus Jelsma Fri, 23 Apr 2021 05:23:23 -0700

If you are doing webcrawling, you can obtain the anchor texts of the
hyperlinks that link to the PDF. That text is usually very descriptive, and
can be used as title for a PDF.


Op do 22 apr. 2021 om 16:36 schreef Nicholas DiPiazza <
[email protected]>:

> Yeah you're right! Thanks for pointing that out I sent a bad example.
>
> So my results after parsing I try to show *Title*
> *(filename) *
>
> It makes for a much better document in a search result. But unfortunately,
> it's all too-often set to something like
>
> *- 4 - (myfilename.pdf)*
>
> > Another trick is to use the most common hyperlink anchor,
>
> Can you elaborate on this one?
>
>
> On Thu, Apr 22, 2021 at 5:44 AM Markus Jelsma <[email protected]>
> wrote:
>
>> Hello Nicholas,
>>
>> The PDF you link to has a decent title in its metadata, but if it isn't
>> there, i would not rely on the first N characters of the content, as it is
>> very unreliable. You can find all kinds of bad markup right at the start of
>> PDFs.
>>
>> But there is a choice, you can still use the raw filename, which is fine
>> in most cases, and usually prettier to read than the first N characters.
>> Another trick is to use the most common hyperlink anchor, which is most of
>> the times very readable and descriptive.
>>
>> Regards,
>> Markus
>>
>> Op wo 21 apr. 2021 om 18:02 schreef Nicholas DiPiazza <
>> [email protected]>:
>>
>>> Hi Tika Users:
>>>
>>> Does Tika have any built-in Title extract logic?
>>>
>>> I am currently using a simple algorithm that:
>>>
>>> 1) Checks metadata for a title. Use that if there.
>>> 2) If no title metadata, then use the body text. Extract the first line
>>> of the body text and use that as the title.
>>>
>>> Let's take this PDF for example:
>>> https://www.fdic.gov/regulations/reform/resplans/plans/icicibank-165-1612.pdf
>>>
>>> That results in
>>>
>>> - 4 -
>>>
>>> as a title. Not great, right? Ha!
>>>
>>> So then I add something like:
>>>
>>> 3) If the first line has < 5 alpha num characters, go to the next line
>>> until you find a title.
>>>
>>> That works in this case but doesn't work for many other cases.
>>>
>>> What are others doing for title extraction? I would imagine there's no
>>> perfect solution here. Just curious what ya'll are doing to troubleshoot
>>> this stuff.
>>>
>>

Re: Title extract logic

Reply via email to