Title extract logic

Nicholas DiPiazza Wed, 21 Apr 2021 09:02:38 -0700

Hi Tika Users:

Does Tika have any built-in Title extract logic?


I am currently using a simple algorithm that:

1) Checks metadata for a title. Use that if there.
2) If no title metadata, then use the body text. Extract the first line of
the body text and use that as the title.

Let's take this PDF for example:
https://www.fdic.gov/regulations/reform/resplans/plans/icicibank-165-1612.pdf

That results in

- 4 -

as a title. Not great, right? Ha!

So then I add something like:

3) If the first line has < 5 alpha num characters, go to the next line
until you find a title.

That works in this case but doesn't work for many other cases.

What are others doing for title extraction? I would imagine there's no
perfect solution here. Just curious what ya'll are doing to troubleshoot
this stuff.

Title extract logic

Reply via email to