Hello,

There is a nice project called Grobid [1] that does most of what you are
describing.
Tika has Grobid parser built in (it calls grobid over REST API) . checkout
[2] for details

I have a project that makes use of Tika with Grobid and NER support. It
also builds a search index using solr.
Check out [3] for setting up and [4] for parsing and indexing to solr if
you like to try out my python project.
Here I am able to extract title, author names, affiliations, and the whole
text of articles.
I did not extract sections within the main body of research articles.  I
assume there should be a way to configure it in Grobid.

Alternatively, if Grobid can't detect sections, you can try XHTML content
handler which preserves the basic structure of PDF file using <p>  <br> and
heading tags. So technically it should be possible to write a wrapper to
break XHTML output from tika into sections

To get it:

# In bash do `pip install tika’ if tika isn’t already installed
import tika
tika.initVM()
from tika import parser


file_path = "<pdf_dir>/2538.pdf"
data = parser.from_file(file_path, xmlContent=True)
print(data['content'])




Best,
Thamme

[1] http://grobid.readthedocs.io/en/latest/Introduction/
[2] https://wiki.apache.org/tika/GrobidJournalParser
[3]
https://github.com/USCDataScience/parser-indexer-py/tree/master/parser-server
[4]
https://github.com/USCDataScience/parser-indexer-py/blob/master/docs/parser-index-journals.md


*--*
*Thamme Gowda*
TG | @thammegowda <https://twitter.com/thammegowda>
~Sent via somebody's Webmail server!

On Wed, May 3, 2017 at 9:34 AM, [email protected] <[email protected]> wrote:

> Hi,
>
> I am working with published research articles using Apache Tika. These
> articles have distinct sections like abstract, introduction, literature
> review, methodology, experimental setup, discussion and conclusions. Is
> there some way to extract document sections with Apache Tika
>
> Regards,
>

Reply via email to