Bouncing to user@tika... If the PDFs have fixed fields (AcroForm), then that should be easy enough to parse out of the xhtml that Tika produces, or you could go with straight PDFBox.
If (as I suspect), these are free text resumes, then Tika can help pull out the text, but then you're on your own and off into the land of natural language processing (or some great regexes) to do the slot filling that you're looking for. Oh, wait, don't forget that there's a chance that you might find useful information in the metadata of the PDF: author, company etc., but I have no idea how reliable that would be. -----Original Message----- From: Cao, Renzhi (MU-Student) [mailto:[email protected]] Sent: Wednesday, October 21, 2015 8:45 AM To: Mattmann, Chris A (3980) <[email protected]>; [email protected] Cc: [email protected] Subject: Re: Questions about using the Tika Dear all, I am interested in parsing the information (like name, skill, location and etc) from the PDF resume, and I see that it seems Tika can do that. Could you please let me know if it is possible or any example of how to use Tika to parse the resume? Thank you very much for your help! Renzhi Cao Graduate Research Assistant Department of Computer Science University of Missouri-Columbia Columbia, MO 65211 Cell: 573-825-8874 Email : [email protected] http://web.missouri.edu/~rcrg4/ ________________________________________ From: Mattmann, Chris A (3980) <[email protected]> Sent: Wednesday, October 21, 2015 12:14 AM To: Cao, Renzhi (MU-Student); [email protected] Subject: Re: Questions about using the Tika Please subscribe by sending email to [email protected] and then once you are subscribed post the below to [email protected]. Cheers! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: "Cao, Renzhi (MU-Student)" <[email protected]> Date: Tuesday, October 20, 2015 at 9:45 PM To: "[email protected]" <[email protected]> Subject: Questions about using the Tika >Dear editor of Tika project, > I am interested in parsing the information (like name, skill, >location and etc) from the PDF resume, and I see that it seems Tika can >do that. Could you please let me know if it is possible or any example >of how to use Tika to parse the resume? Thank you very much for your >help! > > > > > > >Renzhi Cao >Graduate Research Assistant >Department of Computer Science >University of Missouri-Columbia >Columbia, MO 65211 >Cell: 573-825-8874 >Email : [email protected] ><https://bluprd0112.outlook.com/owa/redir.aspx?C=HgdIKZwfkkG-ZqHZQdR5l5 >Qje >ol9gdAIEexz2Okb9KSvfYJfxGlJ7wHelHyOveteZCNx50ztf78.&URL=mailto%3arcrg4% >40m >ail.missouri.edu> >http://web.missouri.edu/~rcrg4/ > > > >
