Hello, There is a nice project called Grobid [1] that does most of what you are describing. Tika has Grobid parser built in (it calls grobid over REST API) . checkout [2] for details
I have a project that makes use of Tika with Grobid and NER support. It also builds a search index using solr. Check out [3] for setting up and [4] for parsing and indexing to solr if you like to try out my python project. Here I am able to extract title, author names, affiliations, and the whole text of articles. I did not extract sections within the main body of research articles. I assume there should be a way to configure it in Grobid. Alternatively, if Grobid can't detect sections, you can try XHTML content handler which preserves the basic structure of PDF file using <p> <br> and heading tags. So technically it should be possible to write a wrapper to break XHTML output from tika into sections To get it: # In bash do `pip install tika’ if tika isn’t already installed import tika tika.initVM() from tika import parser file_path = "<pdf_dir>/2538.pdf" data = parser.from_file(file_path, xmlContent=True) print(data['content']) Best, Thamme [1] http://grobid.readthedocs.io/en/latest/Introduction/ [2] https://wiki.apache.org/tika/GrobidJournalParser [3] https://github.com/USCDataScience/parser-indexer-py/tree/master/parser-server [4] https://github.com/USCDataScience/parser-indexer-py/blob/master/docs/parser-index-journals.md *--* *Thamme Gowda* TG | @thammegowda <https://twitter.com/thammegowda> ~Sent via somebody's Webmail server! On Wed, May 3, 2017 at 9:34 AM, [email protected] <[email protected]> wrote: > Hi, > > I am working with published research articles using Apache Tika. These > articles have distinct sections like abstract, introduction, literature > review, methodology, experimental setup, discussion and conclusions. Is > there some way to extract document sections with Apache Tika > > Regards, >
