RE: Fwd: Tika not parsing underlines

Allison, Timothy B. Thu, 05 Jan 2017 05:54:08 -0800

+1 to John's feedback

Another option, if you want to get into the weeds, is to override your own 
PDFTextStripper and use the TextPosition (x/y coordinates on the page) to do 
your own custom zoning.  This will be application/document stream specific, tho.

-----Original Message-----
From: John Patrick [mailto:[email protected]] 
Sent: Thursday, January 5, 2017 8:15 AM
To: [email protected]
Subject: Re: Fwd: Tika not parsing underlines

okay so I'm looking at the right part of the pdf, as I previously said those 
visual elements might have started life as underscores but in the pdf they are 
some form as image so I would not expect them to be returned when you asked for 
text.

so tika server text/plain in my view is working correctly

are you able to go back to the original, change the image back to underscores 
and don't let your word editor make them look pretty and then save as a pdf.

you could potentially write your own pdf parser or extend and existing one, and 
work out how those images are present in the pdf. But this can be done in 
multiple ways, the images might actually be a background image, they might be 
images with absolute page coordinates given, or they might be embedded in the 
right location. Depending what pdf version and what extra metadata was put into 
the pdf you might be able to write code to correctly detect the image and 
replace it with underscores.

I've done pdf processing several times with tika and the source pdf can be your 
biggest issue as their are several ways of doing the same thing and several 
version of pdf spec.

On 5 January 2017 at 12:37, Kamesh Joshi <[email protected]> wrote:
> The Breaks which i am trying to parse are those line present before 
> Experience or Skills & Expertise (in attached pdf)  but there is no 
> indication of these lines when i am parsing the pdf through tika.
>
> On Thu, Jan 5, 2017 at 4:50 PM, John Patrick <[email protected]> wrote:
>>
>> When you say underline are you talking about the visual breaks like 
>> between "[email protected]" and "Experience". How where they 
>> created?
>>
>> Is it because in they are images in the pdf, not text?
>>
>> I downloaded the pdf opened on my mac, I tried searching for _ and - 
>> and only found 4 matches for -.
>>
>> Personally I would say tika is returning what I would expect it to 
>> return, if the visual breaks as mentioned in my opening sentence are 
>> what you mean by underscores i.e _ not hyphen -
>>
>> If you mean something else be underscores are you able to identify 
>> where in the pdf your talking about.
>>
>> Cheers,
>> John
>>
>>
>> On 5 January 2017 at 08:51, Kamesh Joshi <[email protected]> wrote:
>> > I already tried that but it does not give me any indication for the 
>> > underline present in the line it juts give me data in text data in 
>> > <p></p> tags
>> >
>> > On Thu, Jan 5, 2017 at 1:27 PM, Nick Burch <[email protected]> wrote:
>> >>
>> >> On Thu, 5 Jan 2017, Kamesh Joshi wrote:
>> >>>
>> >>> I am trying to parse the attached the pdf.but it does not give me 
>> >>> the places where the underline is present it just returns me plain text.
>> >>> Please help me how can i also get the underline present in pdf or 
>> >>> some way to split text based on that.
>> >>>
>> >>> I am using curl -T Downloads/kameshjoshi.pdf 
>> >>> http://localhost:9998/tika --header "Accept: text/plain" in my 
>> >>> command line.
>> >>
>> >>
>> >> You need to ask Tika to give you the HTML version to be able to 
>> >> spot markup like underlines. Swap that accept header to text/html 
>> >> and you should then be able to see them
>> >>
>> >> Nick
>> >
>> >
>
>

RE: Fwd: Tika not parsing underlines

Reply via email to