I think, you can take a look at GROBID[1] to see an approach to such
tagging problems. GROBID is designed to extract bibliographical data from
scientific publications (like authors, their affilations, abstract,
bibliographical links etc).

[1]: https://github.com/kermitt2/grobid

ср, 21 окт. 2015 г. в 16:21, Allison, Timothy B. <[email protected]>:

> Bouncing to user@tika...
>
> If the PDFs have fixed fields (AcroForm), then that should be easy enough
> to parse out of the xhtml that Tika produces, or you could go with straight
> PDFBox.
>
> If (as I suspect), these are free text resumes, then Tika can help pull
> out the text, but then you're on your own and off into the land of natural
> language processing (or some great regexes) to do the slot filling that
> you're looking for.
>
> Oh, wait, don't forget that there's a chance that you might find useful
> information in the metadata of the PDF: author, company etc., but I have no
> idea how reliable that would be.
>
> -----Original Message-----
> From: Cao, Renzhi (MU-Student) [mailto:[email protected]]
> Sent: Wednesday, October 21, 2015 8:45 AM
> To: Mattmann, Chris A (3980) <[email protected]>;
> [email protected]
> Cc: [email protected]
> Subject: Re: Questions about using the Tika
>
> Dear all,
>      I am interested in parsing the information (like name, skill,
> location and etc) from the PDF resume, and I see that it seems Tika can do
> that. Could you please let me know if it is possible or any example of how
> to use Tika to parse the resume? Thank you very much for your help!
>
> Renzhi Cao
> Graduate Research Assistant
> Department of Computer Science
> University of Missouri-Columbia
> Columbia, MO 65211
> Cell: 573-825-8874
> Email : [email protected]
> http://web.missouri.edu/~rcrg4/
>
> ________________________________________
> From: Mattmann, Chris A (3980) <[email protected]>
> Sent: Wednesday, October 21, 2015 12:14 AM
> To: Cao, Renzhi (MU-Student); [email protected]
> Subject: Re: Questions about using the Tika
>
> Please subscribe by sending email to [email protected] and
> then once you are subscribed post the below to [email protected].
>
> Cheers!
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398) NASA Jet
> Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department University of
> Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: "Cao, Renzhi (MU-Student)" <[email protected]>
> Date: Tuesday, October 20, 2015 at 9:45 PM
> To: "[email protected]" <[email protected]>
> Subject: Questions about using the Tika
>
> >Dear editor of Tika project,
> >     I am interested in parsing the information (like name, skill,
> >location and etc) from the PDF resume, and I see that it seems Tika can
> >do that. Could you please let me know if it is possible or any example
> >of how to use Tika to parse the resume? Thank  you very much for your
> >help!
> >
> >
> >
> >
> >
> >
> >Renzhi Cao
> >Graduate Research Assistant
> >Department of Computer Science
> >University of Missouri-Columbia
> >Columbia, MO 65211
> >Cell: 573-825-8874
> >Email : [email protected]
> ><https://bluprd0112.outlook.com/owa/redir.aspx?C=HgdIKZwfkkG-ZqHZQdR5l5
> >Qje
> >ol9gdAIEexz2Okb9KSvfYJfxGlJ7wHelHyOveteZCNx50ztf78.&URL=mailto%3arcrg4%
> >40m
> >ail.missouri.edu>
> >http://web.missouri.edu/~rcrg4/
> >
> >
> >
> >
>
-- 
Best regards,
Konstantin Gribov

Reply via email to