Hi Nicolas,

I'm curious, why did you decide to use Spark over UIMA-AS or DUCC? Is it 
because you are more familiar with Spark or were their other reasons?

I have been using UIMA-AS, I am currently experimenting with DUCC and would 
love to hear your thoughts on the matter.

 -John


________________________________________
From: Nicolas Paris [nipari...@gmail.com]
Sent: Thursday, September 14, 2017 5:32 PM
To: user@uima.apache.org
Subject: Re: UIMA analysis from a database

Hi Benedict

Not sure this is helpful for you, but only an advice.
I recommend usint UIMA for what it is first intended : nlp pipeline.

When dealing with multi threaded application, I would go for dedicated
technologies.

I have been successfuly using UIMA together with apache spark. While this
design works well on a single computer, I am now able to distribute UIMA
pipeline over dosen of them, withou extra need.

Then I focus on UIMA pipeline doing it's job well, and after testing,
industrialize them over spark.

Advantages of this design are:
- benefit from spark distributing expertise (note failure, memory
  consumption, data partitionning...)
- simplify UIMA programming (no multithread inside, only NLP stuff)
- scale when needed (add more chip computer, get better performances)
- get expertise with spark, and use it with any java code you d'like
- spark do have JDBC connectors and may be able to fetch data in
  multithread easily.

you can have an wotking example in my repo 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_parisni_UimaOnSpark&d=DwIDAw&c=o3PTkfaYAd6-No7SurnLtwPssd47t-De9Do23lQNz7U&r=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo&m=9w0-CmPPbyYElPML1EOD_jqp84ZXz2rpRpEFsYxecTY&s=RQJDDNPq5uLPH4q0rY6tvPy_CxvFLjUCkLqpnPCeSgU&e=
This have not been simple to make it working, but I can tell know this
methods is robust and optimized.


Le 14 sept. 2017 à 21:24, Benedict Holland écrivait :
> Hello everyone,
>
> I am trying to get my project off the ground and hit a small problem.
>
> I want to read text from a large database (lets say, 100,000+ rows). Each
> row will have a text article. I want to connect to the database, request a
> single row from the database, and process this document through an NLP
> engine and I want to do this in parallel. Each document will be say, split
> up into sentences and each sentence will be POS tagged.
>
> After reading the documentation, I am more confused than when I started. I
> think I want something like the FileSystemCollectionReader example and
> build a CPE. Instead of reading from the file system, it will read from the
> database.
>
> There are two problems with this approach:
>
> 1. I am not sure it is multi threaded: CAS initializers are deprecated and
> it appears that the getNext() method will only run in a single thread.
> 2. The FileSystemCollectionReader loads references to the file location
> into memory but not the text itself.
>
> For problem 1, the line I find very troubling is
>
> File file = (File) mFiles.get(mCurrentIndex++);
>
> I have to assume from this line that the CollectionReader_ImplBase is not
> multi-threaded but is intended to rapidly iterate over a set of documents
> in a single thread.
>
> Problem 2 is easily solved as I can create a massive array of integers if I
> feel like.
>
> Anyway, after deciding that this is not likely the solution, I looked into
> Multi-view Sofa annotators. I don't think these do what I want either. In
> this context, I would treat the database table as a single object with many
> "views" being chunks of rows. I don't think this works, based on the
> SofaExampleAnnotator code provided. It also appears to run in a single
> thread.
>
> This leaves me with CAS pools. I know that this is doing to be
> multi-threaded. I believe I create however many CAS objects from the
> annotator I want, probably an aggregate annotator. Is this correct and am I
> on the right track with CAS Pools?
>
> Thank you so much,
> ~Ben

Reply via email to