Hi Tika Users:

Does Tika have any built-in Title extract logic?

I am currently using a simple algorithm that:

1) Checks metadata for a title. Use that if there.
2) If no title metadata, then use the body text. Extract the first line of
the body text and use that as the title.

Let's take this PDF for example:
https://www.fdic.gov/regulations/reform/resplans/plans/icicibank-165-1612.pdf

That results in

- 4 -

as a title. Not great, right? Ha!

So then I add something like:

3) If the first line has < 5 alpha num characters, go to the next line
until you find a title.

That works in this case but doesn't work for many other cases.

What are others doing for title extraction? I would imagine there's no
perfect solution here. Just curious what ya'll are doing to troubleshoot
this stuff.

Reply via email to