RE: Questions about using the Tika

Allison, Timothy B. Wed, 21 Oct 2015 06:22:51 -0700

Bouncing to user@tika...

If the PDFs have fixed fields (AcroForm), then that should be easy enough to 
parse out of the xhtml that Tika produces, or you could go with straight PDFBox.


If (as I suspect), these are free text resumes, then Tika can help pull out the 
text, but then you're on your own and off into the land of natural language 
processing (or some great regexes) to do the slot filling that you're looking 
for.

Oh, wait, don't forget that there's a chance that you might find useful 
information in the metadata of the PDF: author, company etc., but I have no 
idea how reliable that would be.

-----Original Message-----
From: Cao, Renzhi (MU-Student) [mailto:[email protected]] 
Sent: Wednesday, October 21, 2015 8:45 AM
To: Mattmann, Chris A (3980) <[email protected]>; 
[email protected]
Cc: [email protected]
Subject: Re: Questions about using the Tika

Dear all,
     I am interested in parsing the information (like name, skill, location and 
etc) from the PDF resume, and I see that it seems Tika can do that. Could you 
please let me know if it is possible or any example of how to use Tika to parse 
the resume? Thank you very much for your help!

Renzhi Cao
Graduate Research Assistant
Department of Computer Science
University of Missouri-Columbia
Columbia, MO 65211
Cell: 573-825-8874
Email : [email protected]
http://web.missouri.edu/~rcrg4/

________________________________________
From: Mattmann, Chris A (3980) <[email protected]>
Sent: Wednesday, October 21, 2015 12:14 AM
To: Cao, Renzhi (MU-Student); [email protected]
Subject: Re: Questions about using the Tika

Please subscribe by sending email to [email protected] and then 
once you are subscribed post the below to [email protected].

Cheers!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion 
Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department University of Southern 
California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: "Cao, Renzhi (MU-Student)" <[email protected]>
Date: Tuesday, October 20, 2015 at 9:45 PM
To: "[email protected]" <[email protected]>
Subject: Questions about using the Tika

>Dear editor of Tika project,
>     I am interested in parsing the information (like name, skill, 
>location and etc) from the PDF resume, and I see that it seems Tika can 
>do that. Could you please let me know if it is possible or any example 
>of how to use Tika to parse the resume? Thank  you very much for your 
>help!
>
>
>
>
>
>
>Renzhi Cao
>Graduate Research Assistant
>Department of Computer Science
>University of Missouri-Columbia
>Columbia, MO 65211
>Cell: 573-825-8874
>Email : [email protected]
><https://bluprd0112.outlook.com/owa/redir.aspx?C=HgdIKZwfkkG-ZqHZQdR5l5
>Qje 
>ol9gdAIEexz2Okb9KSvfYJfxGlJ7wHelHyOveteZCNx50ztf78.&URL=mailto%3arcrg4%
>40m
>ail.missouri.edu>
>http://web.missouri.edu/~rcrg4/
>
>
>
>

RE: Questions about using the Tika

Reply via email to