Re: Analysing a document sections with Apache Tika

Thamme Gowda Thu, 04 May 2017 15:26:47 -0700

Hi,

You CAN use tika in java code.
Tika is primarily written in Java and you will have no issues using in Java.
It may be a lot easier to use tika with Grobid instead of using
Grobid directly.


Checkout what resources are added to the classpath of "Tika-App"
https://wiki.apache.org/tika/GrobidJournalParser

Checkout these examples:
https://tika.apache.org/1.14/gettingstarted.html
https://tika.apache.org/1.14/examples.html


*--*
*Thamme Gowda*
TG | @thammegowda <https://twitter.com/thammegowda>
~Sent via somebody's Webmail server!

On Thu, May 4, 2017 at 10:28 AM, [email protected] <[email protected]>
wrote:

> Hi,
>
> Thanks for sharing the link.
>
> I need to integrate this feature into my Java code.
>
>
> Regards,
>
>
> On Thu, May 4, 2017 at 4:47 PM, Chris Mattmann <[email protected]>
> wrote:
>
>> FYI here:
>>
>>
>>
>> http://wiki.apache.org/tika/GrobidJournalParser
>>
>>
>>
>>
>>
>>
>>
>> *From: *"[email protected]" <[email protected]>
>> *Reply-To: *"[email protected]" <[email protected]>
>> *Date: *Thursday, May 4, 2017 at 8:38 AM
>> *To: *"[email protected]" <[email protected]>
>> *Cc: *"[email protected]" <[email protected]>
>> *Subject: *Re: Analysing a document sections with Apache Tika
>>
>>
>>
>> Dear Thamme,
>>
>>
>>
>> Thanks for your reply and the suggestions.
>>
>>
>>
>> I build Grobid usign the instruction from http://grobid.readthedocs
>> .io/en/latest/Install-Grobid/
>>
>> Trying to run the following example code from GitHub repository(
>> https://github.com/kermitt2/grobid-example)
>>
>> =================
>>
>>
>>
>>  import org.grobid.core.*;
>>
>>     import org.grobid.core.data.*;
>>
>>     import org.grobid.core.factory.*;
>>
>>     import org.grobid.core.mock.*;
>>
>>     import org.grobid.core.utilities.*;
>>
>>     import org.grobid.core.engines.Engine;
>>
>>
>>
>> public class GrobidTest {
>>
>>
>>
>> public GrobidTest() {
>>
>> // TODO Auto-generated constructor stub
>>
>> }
>>
>> public static void main(String[] args)
>>
>> {
>>
>> run("D:/Eclipse-Workspace/PDFs/Train/6.pdf");
>>
>> }
>>
>> public static void run(String faFileName)
>>
>> {
>>
>> String pdfPath =faFileName;
>>
>>
>>
>> try {
>>
>> String pGrobidHome = "D:/Eclipse-Workspace/Libraries/Grobid/grobid-home";
>>
>> String pGrobidProperties = "D:/Eclipse-Workspace/Librarie
>> s/Grobid/grobid-home/config/grobid.properties";
>>
>>
>>
>> MockContext.setInitialContext(pGrobidHome, pGrobidProperties);
>>
>> GrobidProperties.getInstance();
>>
>>
>>
>> System.out.println(">>>>>>>> GROBID_HOME="+GrobidProperties
>> .get_GROBID_HOME_PATH());
>>
>>
>>
>> Engine engine = GrobidFactory.getInstance().createEngine();
>>
>>
>>
>> // Biblio object for the result
>>
>> BiblioItem resHeader = new BiblioItem();
>>
>> String tei = engine.processHeader(pdfPath, false, resHeader);
>>
>> }
>>
>> catch (Exception e) {
>>
>> // If an exception is generated, print a stack trace
>>
>> e.printStackTrace();
>>
>> }
>>
>> finally {
>>
>> try {
>>
>> MockContext.destroyInitialContext();
>>
>> }
>>
>> catch (Exception e) {
>>
>> e.printStackTrace();
>>
>> }
>>
>> }
>>
>> }
>>
>>
>>
>> }
>>
>>
>>
>> ================
>>
>>
>>
>> Gettign the following exception:
>>
>>
>>
>> javax.naming.NoInitialContextException: Cannot instantiate class:
>> org.apache.naming.java.javaURLContextFactory [Root exception is
>> java.lang.ClassNotFoundException: org.apache.naming.java.javaURL
>> ContextFactory]
>>
>> at javax.naming.spi.NamingManager.getInitialContext(Unknown Source)
>>
>> at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source)
>>
>> at javax.naming.InitialContext.init(Unknown Source)
>>
>> at javax.naming.InitialContext.<init>(Unknown Source)
>>
>> at org.grobid.core.mock.MockContext.setInitialContext(MockConte
>> xt.java:36)
>>
>> at org.grobid.core.mock.MockContext.setInitialContext(MockConte
>> xt.java:76)
>>
>> at GrobidTest.run(GrobidTest.java:28)
>>
>> at GrobidTest.main(GrobidTest.java:17)
>>
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.naming.java.javaURLContextFactory
>>
>> at java.net.URLClassLoader.findClass(Unknown Source)
>>
>> at java.lang.ClassLoader.loadClass(Unknown Source)
>>
>> at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
>>
>> at java.lang.ClassLoader.loadClass(Unknown Source)
>>
>> at java.lang.Class.forName0(Native Method)
>>
>> at java.lang.Class.forName(Unknown Source)
>>
>> at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)
>>
>> at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)
>>
>> ... 8 more
>>
>> javax.naming.NoInitialContextException: Cannot instantiate class:
>> org.apache.naming.java.javaURLContextFactory [Root exception is
>> java.lang.ClassNotFoundException: org.apache.naming.java.javaURL
>> ContextFactory]
>>
>> at javax.naming.spi.NamingManager.getInitialContext(Unknown Source)
>>
>> at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source)
>>
>> at javax.naming.InitialContext.init(Unknown Source)
>>
>> at javax.naming.InitialContext.<init>(Unknown Source)
>>
>> at org.grobid.core.mock.MockContext.destroyInitialContext(MockC
>> ontext.java:105)
>>
>> at GrobidTest.run(GrobidTest.java:45)
>>
>> at GrobidTest.main(GrobidTest.java:17)
>>
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.naming.java.javaURLContextFactory
>>
>> at java.net.URLClassLoader.findClass(Unknown Source)
>>
>> at java.lang.ClassLoader.loadClass(Unknown Source)
>>
>> at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
>>
>> at java.lang.ClassLoader.loadClass(Unknown Source)
>>
>> at java.lang.Class.forName0(Native Method)
>>
>> at java.lang.Class.forName(Unknown Source)
>>
>> at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)
>>
>> at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source)
>>
>> ... 7 more
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, May 3, 2017 at 6:16 PM, Thamme Gowda <[email protected]>
>> wrote:
>>
>> Hello,
>>
>>
>>
>> There is a nice project called Grobid [1] that does most of what you are
>> describing.
>>
>> Tika has Grobid parser built in (it calls grobid over REST API) .
>> checkout [2] for details
>>
>>
>>
>> I have a project that makes use of Tika with Grobid and NER support. It
>> also builds a search index using solr.
>>
>> Check out [3] for setting up and [4] for parsing and indexing to solr if
>> you like to try out my python project.
>>
>> Here I am able to extract title, author names, affiliations, and the
>> whole text of articles.
>>
>> I did not extract sections within the main body of research articles.  I
>> assume there should be a way to configure it in Grobid.
>>
>>
>>
>> Alternatively, if Grobid can't detect sections, you can try XHTML content
>> handler which preserves the basic structure of PDF file using <p>  <br> and
>> heading tags. So technically it should be possible to write a wrapper to
>> break XHTML output from tika into sections
>>
>>
>>
>> To get it:
>>
>> # In bash do `pip install tika’ if tika isn’t already installed
>>
>> import tika
>>
>> tika.initVM()
>>
>> from tika import parser
>>
>>
>>
>>
>>
>> file_path = "<pdf_dir>/2538.pdf"
>>
>> data = parser.from_file(file_path, xmlContent=True)
>>
>> print(data['content'])
>>
>>
>>
>>
>>
>>
>>
>> Best,
>>
>> Thamme
>>
>>
>>
>> [1] http://grobid.readthedocs.io/en/latest/Introduction/
>>
>> [2] https://wiki.apache.org/tika/GrobidJournalParser
>>
>> [3] https://github.com/USCDataScience/parser-indexer-py/
>> tree/master/parser-server
>>
>> [4] https://github.com/USCDataScience/parser-indexer-py/
>> blob/master/docs/parser-index-journals.md
>>
>>
>> *--*
>>
>> *Thamme Gowda*
>>
>> TG | @thammegowda <https://twitter.com/thammegowda>
>>
>> ~Sent via somebody's Webmail server!
>>
>>
>>
>> On Wed, May 3, 2017 at 9:34 AM, [email protected] <[email protected]>
>> wrote:
>>
>> Hi,
>>
>>
>>
>> I am working with published research articles using Apache Tika. These
>> articles have distinct sections like abstract, introduction, literature
>> review, methodology, experimental setup, discussion and conclusions. Is
>> there some way to extract document sections with Apache Tika
>>
>>
>>
>> Regards,
>>
>>
>>
>>
>>
>
>

Re: Analysing a document sections with Apache Tika

Reply via email to