I think, you can take a look at GROBID[1] to see an approach to such tagging problems. GROBID is designed to extract bibliographical data from scientific publications (like authors, their affilations, abstract, bibliographical links etc).
[1]: https://github.com/kermitt2/grobid ср, 21 окт. 2015 г. в 16:21, Allison, Timothy B. <[email protected]>: > Bouncing to user@tika... > > If the PDFs have fixed fields (AcroForm), then that should be easy enough > to parse out of the xhtml that Tika produces, or you could go with straight > PDFBox. > > If (as I suspect), these are free text resumes, then Tika can help pull > out the text, but then you're on your own and off into the land of natural > language processing (or some great regexes) to do the slot filling that > you're looking for. > > Oh, wait, don't forget that there's a chance that you might find useful > information in the metadata of the PDF: author, company etc., but I have no > idea how reliable that would be. > > -----Original Message----- > From: Cao, Renzhi (MU-Student) [mailto:[email protected]] > Sent: Wednesday, October 21, 2015 8:45 AM > To: Mattmann, Chris A (3980) <[email protected]>; > [email protected] > Cc: [email protected] > Subject: Re: Questions about using the Tika > > Dear all, > I am interested in parsing the information (like name, skill, > location and etc) from the PDF resume, and I see that it seems Tika can do > that. Could you please let me know if it is possible or any example of how > to use Tika to parse the resume? Thank you very much for your help! > > Renzhi Cao > Graduate Research Assistant > Department of Computer Science > University of Missouri-Columbia > Columbia, MO 65211 > Cell: 573-825-8874 > Email : [email protected] > http://web.missouri.edu/~rcrg4/ > > ________________________________________ > From: Mattmann, Chris A (3980) <[email protected]> > Sent: Wednesday, October 21, 2015 12:14 AM > To: Cao, Renzhi (MU-Student); [email protected] > Subject: Re: Questions about using the Tika > > Please subscribe by sending email to [email protected] and > then once you are subscribed post the below to [email protected]. > > Cheers! > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) NASA Jet > Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department University of > Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > -----Original Message----- > From: "Cao, Renzhi (MU-Student)" <[email protected]> > Date: Tuesday, October 20, 2015 at 9:45 PM > To: "[email protected]" <[email protected]> > Subject: Questions about using the Tika > > >Dear editor of Tika project, > > I am interested in parsing the information (like name, skill, > >location and etc) from the PDF resume, and I see that it seems Tika can > >do that. Could you please let me know if it is possible or any example > >of how to use Tika to parse the resume? Thank you very much for your > >help! > > > > > > > > > > > > > >Renzhi Cao > >Graduate Research Assistant > >Department of Computer Science > >University of Missouri-Columbia > >Columbia, MO 65211 > >Cell: 573-825-8874 > >Email : [email protected] > ><https://bluprd0112.outlook.com/owa/redir.aspx?C=HgdIKZwfkkG-ZqHZQdR5l5 > >Qje > >ol9gdAIEexz2Okb9KSvfYJfxGlJ7wHelHyOveteZCNx50ztf78.&URL=mailto%3arcrg4% > >40m > >ail.missouri.edu> > >http://web.missouri.edu/~rcrg4/ > > > > > > > > > -- Best regards, Konstantin Gribov
