Re: Analysing archive PDFs

Richard Eckart de Castilho Thu, 19 Feb 2015 12:51:43 -0800

On 19.02.2015, at 21:28, Philippe de Rochambeau <[email protected]> wrote:


> Hello,
> 
> In the past few months, I have indexed tens of thousands of PDFs containing 
> newspaper articles from 1887 until 1940 using SOLR for my company.
> 
> Every day, my colleagues in the Archive Department spend hours searching 
> through the archives using SOLR, looking for potentially-interesting articles 
> from a social and historical point of view.
> 
> Can UIMA or OpenNLP be used to automate their work and/or to analyze patterns 
> in the data?

I'd say that depends quite a bit on what kind of information your colleagues 
search for.
UIMA itself is just a framework to support unstructured information analysis. 
It does not
actually analyze text - that is the job of UIMA components. There are many UIMA 
components
for various kinds of tasks, in particular for natural language processing task. 

OpenNLP provides tools for basic linguistic analysis of texts such as 
part-of-speech tagging,
parsing, named entity recognition. OpenNLP provides some UIMA components. 
However, to use
OpenNLP effectively, you need to train models for it. Most models available for 
download from
the OpenNLP website give suboptimal results because they are trained only on 
small data sets.

If you look for patterns, then UIMA Ruta might help. You can implement patterns 
to detect and 
analyze certain kinds of information, e.g. bibliographic records or information 
from a CV.

Apart from what Apache UIMA has to offer, I these pointers might also be 
interesting to you: 

Topic modelling is a trending technology with respect to sieving through data 
and detecting
interesting things. There are many recent research publications on this topic. 

This video [1] recently twittered by me, so I might as well share it here.

A colleague of mine uses topic models to analyze historical school books [2]. 
As part of this,
we also built UIMA components in DKPro Core [3] to generate topic models using 
the Mallet library [4].

Cheers,

-- Richard

[1] 
http://nycdatascience.com/news/using-machine-learning-to-aid-journalism-at-the-new-york-times/
[2] https://www.ukp.tu-darmstadt.de/research/current-projects/welt-der-kinder/
[3] https://dkpro-core-asl.googlecode.com
[4] http://mallet.cs.umass.edu

Re: Analysing archive PDFs

Reply via email to