Hi, You CAN use tika in java code. Tika is primarily written in Java and you will have no issues using in Java. It may be a lot easier to use tika with Grobid instead of using Grobid directly.
Checkout what resources are added to the classpath of "Tika-App" https://wiki.apache.org/tika/GrobidJournalParser Checkout these examples: https://tika.apache.org/1.14/gettingstarted.html https://tika.apache.org/1.14/examples.html *--* *Thamme Gowda* TG | @thammegowda <https://twitter.com/thammegowda> ~Sent via somebody's Webmail server! On Thu, May 4, 2017 at 10:28 AM, [email protected] <[email protected]> wrote: > Hi, > > Thanks for sharing the link. > > I need to integrate this feature into my Java code. > > > Regards, > > > On Thu, May 4, 2017 at 4:47 PM, Chris Mattmann <[email protected]> > wrote: > >> FYI here: >> >> >> >> http://wiki.apache.org/tika/GrobidJournalParser >> >> >> >> >> >> >> >> *From: *"[email protected]" <[email protected]> >> *Reply-To: *"[email protected]" <[email protected]> >> *Date: *Thursday, May 4, 2017 at 8:38 AM >> *To: *"[email protected]" <[email protected]> >> *Cc: *"[email protected]" <[email protected]> >> *Subject: *Re: Analysing a document sections with Apache Tika >> >> >> >> Dear Thamme, >> >> >> >> Thanks for your reply and the suggestions. >> >> >> >> I build Grobid usign the instruction from http://grobid.readthedocs >> .io/en/latest/Install-Grobid/ >> >> Trying to run the following example code from GitHub repository( >> https://github.com/kermitt2/grobid-example) >> >> ================= >> >> >> >> import org.grobid.core.*; >> >> import org.grobid.core.data.*; >> >> import org.grobid.core.factory.*; >> >> import org.grobid.core.mock.*; >> >> import org.grobid.core.utilities.*; >> >> import org.grobid.core.engines.Engine; >> >> >> >> public class GrobidTest { >> >> >> >> public GrobidTest() { >> >> // TODO Auto-generated constructor stub >> >> } >> >> public static void main(String[] args) >> >> { >> >> run("D:/Eclipse-Workspace/PDFs/Train/6.pdf"); >> >> } >> >> public static void run(String faFileName) >> >> { >> >> String pdfPath =faFileName; >> >> >> >> try { >> >> String pGrobidHome = "D:/Eclipse-Workspace/Libraries/Grobid/grobid-home"; >> >> String pGrobidProperties = "D:/Eclipse-Workspace/Librarie >> s/Grobid/grobid-home/config/grobid.properties"; >> >> >> >> MockContext.setInitialContext(pGrobidHome, pGrobidProperties); >> >> GrobidProperties.getInstance(); >> >> >> >> System.out.println(">>>>>>>> GROBID_HOME="+GrobidProperties >> .get_GROBID_HOME_PATH()); >> >> >> >> Engine engine = GrobidFactory.getInstance().createEngine(); >> >> >> >> // Biblio object for the result >> >> BiblioItem resHeader = new BiblioItem(); >> >> String tei = engine.processHeader(pdfPath, false, resHeader); >> >> } >> >> catch (Exception e) { >> >> // If an exception is generated, print a stack trace >> >> e.printStackTrace(); >> >> } >> >> finally { >> >> try { >> >> MockContext.destroyInitialContext(); >> >> } >> >> catch (Exception e) { >> >> e.printStackTrace(); >> >> } >> >> } >> >> } >> >> >> >> } >> >> >> >> ================ >> >> >> >> Gettign the following exception: >> >> >> >> javax.naming.NoInitialContextException: Cannot instantiate class: >> org.apache.naming.java.javaURLContextFactory [Root exception is >> java.lang.ClassNotFoundException: org.apache.naming.java.javaURL >> ContextFactory] >> >> at javax.naming.spi.NamingManager.getInitialContext(Unknown Source) >> >> at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source) >> >> at javax.naming.InitialContext.init(Unknown Source) >> >> at javax.naming.InitialContext.<init>(Unknown Source) >> >> at org.grobid.core.mock.MockContext.setInitialContext(MockConte >> xt.java:36) >> >> at org.grobid.core.mock.MockContext.setInitialContext(MockConte >> xt.java:76) >> >> at GrobidTest.run(GrobidTest.java:28) >> >> at GrobidTest.main(GrobidTest.java:17) >> >> Caused by: java.lang.ClassNotFoundException: >> org.apache.naming.java.javaURLContextFactory >> >> at java.net.URLClassLoader.findClass(Unknown Source) >> >> at java.lang.ClassLoader.loadClass(Unknown Source) >> >> at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) >> >> at java.lang.ClassLoader.loadClass(Unknown Source) >> >> at java.lang.Class.forName0(Native Method) >> >> at java.lang.Class.forName(Unknown Source) >> >> at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source) >> >> at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source) >> >> ... 8 more >> >> javax.naming.NoInitialContextException: Cannot instantiate class: >> org.apache.naming.java.javaURLContextFactory [Root exception is >> java.lang.ClassNotFoundException: org.apache.naming.java.javaURL >> ContextFactory] >> >> at javax.naming.spi.NamingManager.getInitialContext(Unknown Source) >> >> at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source) >> >> at javax.naming.InitialContext.init(Unknown Source) >> >> at javax.naming.InitialContext.<init>(Unknown Source) >> >> at org.grobid.core.mock.MockContext.destroyInitialContext(MockC >> ontext.java:105) >> >> at GrobidTest.run(GrobidTest.java:45) >> >> at GrobidTest.main(GrobidTest.java:17) >> >> Caused by: java.lang.ClassNotFoundException: >> org.apache.naming.java.javaURLContextFactory >> >> at java.net.URLClassLoader.findClass(Unknown Source) >> >> at java.lang.ClassLoader.loadClass(Unknown Source) >> >> at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) >> >> at java.lang.ClassLoader.loadClass(Unknown Source) >> >> at java.lang.Class.forName0(Native Method) >> >> at java.lang.Class.forName(Unknown Source) >> >> at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source) >> >> at com.sun.naming.internal.VersionHelper12.loadClass(Unknown Source) >> >> ... 7 more >> >> >> >> >> >> >> >> >> >> On Wed, May 3, 2017 at 6:16 PM, Thamme Gowda <[email protected]> >> wrote: >> >> Hello, >> >> >> >> There is a nice project called Grobid [1] that does most of what you are >> describing. >> >> Tika has Grobid parser built in (it calls grobid over REST API) . >> checkout [2] for details >> >> >> >> I have a project that makes use of Tika with Grobid and NER support. It >> also builds a search index using solr. >> >> Check out [3] for setting up and [4] for parsing and indexing to solr if >> you like to try out my python project. >> >> Here I am able to extract title, author names, affiliations, and the >> whole text of articles. >> >> I did not extract sections within the main body of research articles. I >> assume there should be a way to configure it in Grobid. >> >> >> >> Alternatively, if Grobid can't detect sections, you can try XHTML content >> handler which preserves the basic structure of PDF file using <p> <br> and >> heading tags. So technically it should be possible to write a wrapper to >> break XHTML output from tika into sections >> >> >> >> To get it: >> >> # In bash do `pip install tika’ if tika isn’t already installed >> >> import tika >> >> tika.initVM() >> >> from tika import parser >> >> >> >> >> >> file_path = "<pdf_dir>/2538.pdf" >> >> data = parser.from_file(file_path, xmlContent=True) >> >> print(data['content']) >> >> >> >> >> >> >> >> Best, >> >> Thamme >> >> >> >> [1] http://grobid.readthedocs.io/en/latest/Introduction/ >> >> [2] https://wiki.apache.org/tika/GrobidJournalParser >> >> [3] https://github.com/USCDataScience/parser-indexer-py/ >> tree/master/parser-server >> >> [4] https://github.com/USCDataScience/parser-indexer-py/ >> blob/master/docs/parser-index-journals.md >> >> >> *--* >> >> *Thamme Gowda* >> >> TG | @thammegowda <https://twitter.com/thammegowda> >> >> ~Sent via somebody's Webmail server! >> >> >> >> On Wed, May 3, 2017 at 9:34 AM, [email protected] <[email protected]> >> wrote: >> >> Hi, >> >> >> >> I am working with published research articles using Apache Tika. These >> articles have distinct sections like abstract, introduction, literature >> review, methodology, experimental setup, discussion and conclusions. Is >> there some way to extract document sections with Apache Tika >> >> >> >> Regards, >> >> >> >> >> > >
