Re: Fwd: Tika not parsing underlines

Kamesh Joshi Thu, 05 Jan 2017 04:38:26 -0800

The Breaks which i am trying to parse are those line present before
*Experience* or *Skills & Expertise (in attached pdf)  but *there is no
indication of these lines when i am parsing the pdf through tika.


On Thu, Jan 5, 2017 at 4:50 PM, John Patrick <[email protected]> wrote:

> When you say underline are you talking about the visual breaks like
> between "[email protected]" and "Experience". How where they
> created?
>
> Is it because in they are images in the pdf, not text?
>
> I downloaded the pdf opened on my mac, I tried searching for _ and -
> and only found 4 matches for -.
>
> Personally I would say tika is returning what I would expect it to
> return, if the visual breaks as mentioned in my opening sentence are
> what you mean by underscores i.e _ not hyphen -
>
> If you mean something else be underscores are you able to identify
> where in the pdf your talking about.
>
> Cheers,
> John
>
>
> On 5 January 2017 at 08:51, Kamesh Joshi <[email protected]> wrote:
> > I already tried that but it does not give me any indication for the
> > underline present in the line it juts give me data in text data in
> <p></p>
> > tags
> >
> > On Thu, Jan 5, 2017 at 1:27 PM, Nick Burch <[email protected]> wrote:
> >>
> >> On Thu, 5 Jan 2017, Kamesh Joshi wrote:
> >>>
> >>> I am trying to parse the attached the pdf.but it does not give me the
> >>> places where the underline is present it just returns me plain text.
> >>> Please help me how can i also get the underline present in pdf or some
> >>> way
> >>> to split text based on that.
> >>>
> >>> I am using curl -T Downloads/kameshjoshi.pdf
> http://localhost:9998/tika
> >>> --header "Accept: text/plain" in my command line.
> >>
> >>
> >> You need to ask Tika to give you the HTML version to be able to spot
> >> markup like underlines. Swap that accept header to text/html and you
> should
> >> then be able to see them
> >>
> >> Nick
> >
> >
>

Re: Fwd: Tika not parsing underlines

Reply via email to