Yeah you're right! Thanks for pointing that out I sent a bad example.

So my results after parsing I try to show *Title*
*(filename) *

It makes for a much better document in a search result. But unfortunately,
it's all too-often set to something like

*- 4 - (myfilename.pdf)*

> Another trick is to use the most common hyperlink anchor,

Can you elaborate on this one?


On Thu, Apr 22, 2021 at 5:44 AM Markus Jelsma <[email protected]>
wrote:

> Hello Nicholas,
>
> The PDF you link to has a decent title in its metadata, but if it isn't
> there, i would not rely on the first N characters of the content, as it is
> very unreliable. You can find all kinds of bad markup right at the start of
> PDFs.
>
> But there is a choice, you can still use the raw filename, which is fine
> in most cases, and usually prettier to read than the first N characters.
> Another trick is to use the most common hyperlink anchor, which is most of
> the times very readable and descriptive.
>
> Regards,
> Markus
>
> Op wo 21 apr. 2021 om 18:02 schreef Nicholas DiPiazza <
> [email protected]>:
>
>> Hi Tika Users:
>>
>> Does Tika have any built-in Title extract logic?
>>
>> I am currently using a simple algorithm that:
>>
>> 1) Checks metadata for a title. Use that if there.
>> 2) If no title metadata, then use the body text. Extract the first line
>> of the body text and use that as the title.
>>
>> Let's take this PDF for example:
>> https://www.fdic.gov/regulations/reform/resplans/plans/icicibank-165-1612.pdf
>>
>> That results in
>>
>> - 4 -
>>
>> as a title. Not great, right? Ha!
>>
>> So then I add something like:
>>
>> 3) If the first line has < 5 alpha num characters, go to the next line
>> until you find a title.
>>
>> That works in this case but doesn't work for many other cases.
>>
>> What are others doing for title extraction? I would imagine there's no
>> perfect solution here. Just curious what ya'll are doing to troubleshoot
>> this stuff.
>>
>

Reply via email to