Re: Fwd: Tika not parsing underlines

John Patrick Thu, 05 Jan 2017 03:20:56 -0800

When you say underline are you talking about the visual breaks like
between "[email protected]" and "Experience". How where they
created?


Is it because in they are images in the pdf, not text?

I downloaded the pdf opened on my mac, I tried searching for _ and -
and only found 4 matches for -.

Personally I would say tika is returning what I would expect it to
return, if the visual breaks as mentioned in my opening sentence are
what you mean by underscores i.e _ not hyphen -

If you mean something else be underscores are you able to identify
where in the pdf your talking about.

Cheers,
John


On 5 January 2017 at 08:51, Kamesh Joshi <[email protected]> wrote:
> I already tried that but it does not give me any indication for the
> underline present in the line it juts give me data in text data in <p></p>
> tags
>
> On Thu, Jan 5, 2017 at 1:27 PM, Nick Burch <[email protected]> wrote:
>>
>> On Thu, 5 Jan 2017, Kamesh Joshi wrote:
>>>
>>> I am trying to parse the attached the pdf.but it does not give me the
>>> places where the underline is present it just returns me plain text.
>>> Please help me how can i also get the underline present in pdf or some
>>> way
>>> to split text based on that.
>>>
>>> I am using curl -T Downloads/kameshjoshi.pdf  http://localhost:9998/tika
>>> --header "Accept: text/plain" in my command line.
>>
>>
>> You need to ask Tika to give you the HTML version to be able to spot
>> markup like underlines. Swap that accept header to text/html and you should
>> then be able to see them
>>
>> Nick
>
>

Re: Fwd: Tika not parsing underlines

Reply via email to