Hi Nicolas, I'm curious, why did you decide to use Spark over UIMA-AS or DUCC? Is it because you are more familiar with Spark or were their other reasons?
I have been using UIMA-AS, I am currently experimenting with DUCC and would love to hear your thoughts on the matter. -John ________________________________________ From: Nicolas Paris [nipari...@gmail.com] Sent: Thursday, September 14, 2017 5:32 PM To: user@uima.apache.org Subject: Re: UIMA analysis from a database Hi Benedict Not sure this is helpful for you, but only an advice. I recommend usint UIMA for what it is first intended : nlp pipeline. When dealing with multi threaded application, I would go for dedicated technologies. I have been successfuly using UIMA together with apache spark. While this design works well on a single computer, I am now able to distribute UIMA pipeline over dosen of them, withou extra need. Then I focus on UIMA pipeline doing it's job well, and after testing, industrialize them over spark. Advantages of this design are: - benefit from spark distributing expertise (note failure, memory consumption, data partitionning...) - simplify UIMA programming (no multithread inside, only NLP stuff) - scale when needed (add more chip computer, get better performances) - get expertise with spark, and use it with any java code you d'like - spark do have JDBC connectors and may be able to fetch data in multithread easily. you can have an wotking example in my repo https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_parisni_UimaOnSpark&d=DwIDAw&c=o3PTkfaYAd6-No7SurnLtwPssd47t-De9Do23lQNz7U&r=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo&m=9w0-CmPPbyYElPML1EOD_jqp84ZXz2rpRpEFsYxecTY&s=RQJDDNPq5uLPH4q0rY6tvPy_CxvFLjUCkLqpnPCeSgU&e= This have not been simple to make it working, but I can tell know this methods is robust and optimized. Le 14 sept. 2017 à 21:24, Benedict Holland écrivait : > Hello everyone, > > I am trying to get my project off the ground and hit a small problem. > > I want to read text from a large database (lets say, 100,000+ rows). Each > row will have a text article. I want to connect to the database, request a > single row from the database, and process this document through an NLP > engine and I want to do this in parallel. Each document will be say, split > up into sentences and each sentence will be POS tagged. > > After reading the documentation, I am more confused than when I started. I > think I want something like the FileSystemCollectionReader example and > build a CPE. Instead of reading from the file system, it will read from the > database. > > There are two problems with this approach: > > 1. I am not sure it is multi threaded: CAS initializers are deprecated and > it appears that the getNext() method will only run in a single thread. > 2. The FileSystemCollectionReader loads references to the file location > into memory but not the text itself. > > For problem 1, the line I find very troubling is > > File file = (File) mFiles.get(mCurrentIndex++); > > I have to assume from this line that the CollectionReader_ImplBase is not > multi-threaded but is intended to rapidly iterate over a set of documents > in a single thread. > > Problem 2 is easily solved as I can create a massive array of integers if I > feel like. > > Anyway, after deciding that this is not likely the solution, I looked into > Multi-view Sofa annotators. I don't think these do what I want either. In > this context, I would treat the database table as a single object with many > "views" being chunks of rows. I don't think this works, based on the > SofaExampleAnnotator code provided. It also appears to run in a single > thread. > > This leaves me with CAS pools. I know that this is doing to be > multi-threaded. I believe I create however many CAS objects from the > annotator I want, probably an aggregate annotator. Is this correct and am I > on the right track with CAS Pools? > > Thank you so much, > ~Ben