The Breaks which i am trying to parse are those line present before *Experience* or *Skills & Expertise (in attached pdf) but *there is no indication of these lines when i am parsing the pdf through tika.
On Thu, Jan 5, 2017 at 4:50 PM, John Patrick <[email protected]> wrote: > When you say underline are you talking about the visual breaks like > between "[email protected]" and "Experience". How where they > created? > > Is it because in they are images in the pdf, not text? > > I downloaded the pdf opened on my mac, I tried searching for _ and - > and only found 4 matches for -. > > Personally I would say tika is returning what I would expect it to > return, if the visual breaks as mentioned in my opening sentence are > what you mean by underscores i.e _ not hyphen - > > If you mean something else be underscores are you able to identify > where in the pdf your talking about. > > Cheers, > John > > > On 5 January 2017 at 08:51, Kamesh Joshi <[email protected]> wrote: > > I already tried that but it does not give me any indication for the > > underline present in the line it juts give me data in text data in > <p></p> > > tags > > > > On Thu, Jan 5, 2017 at 1:27 PM, Nick Burch <[email protected]> wrote: > >> > >> On Thu, 5 Jan 2017, Kamesh Joshi wrote: > >>> > >>> I am trying to parse the attached the pdf.but it does not give me the > >>> places where the underline is present it just returns me plain text. > >>> Please help me how can i also get the underline present in pdf or some > >>> way > >>> to split text based on that. > >>> > >>> I am using curl -T Downloads/kameshjoshi.pdf > http://localhost:9998/tika > >>> --header "Accept: text/plain" in my command line. > >> > >> > >> You need to ask Tika to give you the HTML version to be able to spot > >> markup like underlines. Swap that accept header to text/html and you > should > >> then be able to see them > >> > >> Nick > > > > >
