Yeah you're right! Thanks for pointing that out I sent a bad example. So my results after parsing I try to show *Title* *(filename) *
It makes for a much better document in a search result. But unfortunately, it's all too-often set to something like *- 4 - (myfilename.pdf)* > Another trick is to use the most common hyperlink anchor, Can you elaborate on this one? On Thu, Apr 22, 2021 at 5:44 AM Markus Jelsma <[email protected]> wrote: > Hello Nicholas, > > The PDF you link to has a decent title in its metadata, but if it isn't > there, i would not rely on the first N characters of the content, as it is > very unreliable. You can find all kinds of bad markup right at the start of > PDFs. > > But there is a choice, you can still use the raw filename, which is fine > in most cases, and usually prettier to read than the first N characters. > Another trick is to use the most common hyperlink anchor, which is most of > the times very readable and descriptive. > > Regards, > Markus > > Op wo 21 apr. 2021 om 18:02 schreef Nicholas DiPiazza < > [email protected]>: > >> Hi Tika Users: >> >> Does Tika have any built-in Title extract logic? >> >> I am currently using a simple algorithm that: >> >> 1) Checks metadata for a title. Use that if there. >> 2) If no title metadata, then use the body text. Extract the first line >> of the body text and use that as the title. >> >> Let's take this PDF for example: >> https://www.fdic.gov/regulations/reform/resplans/plans/icicibank-165-1612.pdf >> >> That results in >> >> - 4 - >> >> as a title. Not great, right? Ha! >> >> So then I add something like: >> >> 3) If the first line has < 5 alpha num characters, go to the next line >> until you find a title. >> >> That works in this case but doesn't work for many other cases. >> >> What are others doing for title extraction? I would imagine there's no >> perfect solution here. Just curious what ya'll are doing to troubleshoot >> this stuff. >> >
