When you say underline are you talking about the visual breaks like between "[email protected]" and "Experience". How where they created?
Is it because in they are images in the pdf, not text? I downloaded the pdf opened on my mac, I tried searching for _ and - and only found 4 matches for -. Personally I would say tika is returning what I would expect it to return, if the visual breaks as mentioned in my opening sentence are what you mean by underscores i.e _ not hyphen - If you mean something else be underscores are you able to identify where in the pdf your talking about. Cheers, John On 5 January 2017 at 08:51, Kamesh Joshi <[email protected]> wrote: > I already tried that but it does not give me any indication for the > underline present in the line it juts give me data in text data in <p></p> > tags > > On Thu, Jan 5, 2017 at 1:27 PM, Nick Burch <[email protected]> wrote: >> >> On Thu, 5 Jan 2017, Kamesh Joshi wrote: >>> >>> I am trying to parse the attached the pdf.but it does not give me the >>> places where the underline is present it just returns me plain text. >>> Please help me how can i also get the underline present in pdf or some >>> way >>> to split text based on that. >>> >>> I am using curl -T Downloads/kameshjoshi.pdf http://localhost:9998/tika >>> --header "Accept: text/plain" in my command line. >> >> >> You need to ask Tika to give you the HTML version to be able to spot >> markup like underlines. Swap that accept header to text/html and you should >> then be able to see them >> >> Nick > >
