Indeed I confused sometimes UIMA-FIT & AS in previous email
Le 15 sept. 2017 à 09:28, Nicolas Paris écrivait : > - Spark is simpler to learn than UIMA-AS (at least I don't know DUCC). > - Spark is more generalist and can be used in other projects; for eg. I > have used the same design to transform pdf->text with apache pdfbox. > - Spark can benefit from yarn or mesos job manager, on more 10K > computer > - Spark benefits from hadoop hdfs distributed storage > - Spark benefits from new optimized data format such Avro, a very robust > , and distributed format binary format > - spark processes partitioned data and write to disk as batch (faster > than one by one) > - Spark only instanciate one UIMA pipeline per partition, passes all its > text over, with nice performances > - Spark can use (python/java/scala/R/Julia) for preprocessing texts and > then send the result to UIMA > - Spark does have connector for databases or interfaces well with apache > sqoop, to fetch data from relational database in parrallel, very > easily > - Spark has native machine learning tooling, and can be extended with > python or R ones. > > > > > - UIMA-AS is another way to program UIMA > - UIMA-FIT is complicated > - UIMA-FIT only work with UIMA > - UIMA only focuses on text Annotation > - UIMA is not good at: > - text transformation > - read data from source in parallel > - write data to folder in parallel > - machine learning interface > > > The only difficult part have been adressed : make it working, and you > can read my messy repository to begin > > Le 15 sept. 2017 à 04:28, Osborne, John D écrivait : > > Hi Nicolas, > > > > I'm curious, why did you decide to use Spark over UIMA-AS or DUCC? Is it > > because you are more familiar with Spark or were their other reasons? > > > > I have been using UIMA-AS, I am currently experimenting with DUCC and would > > love to hear your thoughts on the matter. > > > > -John > > > > > > ________________________________________ > > From: Nicolas Paris [[email protected]] > > Sent: Thursday, September 14, 2017 5:32 PM > > To: [email protected] > > Subject: Re: UIMA analysis from a database > > > > Hi Benedict > > > > Not sure this is helpful for you, but only an advice. > > I recommend usint UIMA for what it is first intended : nlp pipeline. > > > > When dealing with multi threaded application, I would go for dedicated > > technologies. > > > > I have been successfuly using UIMA together with apache spark. While this > > design works well on a single computer, I am now able to distribute UIMA > > pipeline over dosen of them, withou extra need. > > > > Then I focus on UIMA pipeline doing it's job well, and after testing, > > industrialize them over spark. > > > > Advantages of this design are: > > - benefit from spark distributing expertise (note failure, memory > > consumption, data partitionning...) > > - simplify UIMA programming (no multithread inside, only NLP stuff) > > - scale when needed (add more chip computer, get better performances) > > - get expertise with spark, and use it with any java code you d'like > > - spark do have JDBC connectors and may be able to fetch data in > > multithread easily. > > > > you can have an wotking example in my repo > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_parisni_UimaOnSpark&d=DwIDAw&c=o3PTkfaYAd6-No7SurnLtwPssd47t-De9Do23lQNz7U&r=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo&m=9w0-CmPPbyYElPML1EOD_jqp84ZXz2rpRpEFsYxecTY&s=RQJDDNPq5uLPH4q0rY6tvPy_CxvFLjUCkLqpnPCeSgU&e= > > This have not been simple to make it working, but I can tell know this > > methods is robust and optimized. > > > > > > Le 14 sept. 2017 à 21:24, Benedict Holland écrivait : > > > Hello everyone, > > > > > > I am trying to get my project off the ground and hit a small problem. > > > > > > I want to read text from a large database (lets say, 100,000+ rows). Each > > > row will have a text article. I want to connect to the database, request a > > > single row from the database, and process this document through an NLP > > > engine and I want to do this in parallel. Each document will be say, split > > > up into sentences and each sentence will be POS tagged. > > > > > > After reading the documentation, I am more confused than when I started. I > > > think I want something like the FileSystemCollectionReader example and > > > build a CPE. Instead of reading from the file system, it will read from > > > the > > > database. > > > > > > There are two problems with this approach: > > > > > > 1. I am not sure it is multi threaded: CAS initializers are deprecated and > > > it appears that the getNext() method will only run in a single thread. > > > 2. The FileSystemCollectionReader loads references to the file location > > > into memory but not the text itself. > > > > > > For problem 1, the line I find very troubling is > > > > > > File file = (File) mFiles.get(mCurrentIndex++); > > > > > > I have to assume from this line that the CollectionReader_ImplBase is not > > > multi-threaded but is intended to rapidly iterate over a set of documents > > > in a single thread. > > > > > > Problem 2 is easily solved as I can create a massive array of integers if > > > I > > > feel like. > > > > > > Anyway, after deciding that this is not likely the solution, I looked into > > > Multi-view Sofa annotators. I don't think these do what I want either. In > > > this context, I would treat the database table as a single object with > > > many > > > "views" being chunks of rows. I don't think this works, based on the > > > SofaExampleAnnotator code provided. It also appears to run in a single > > > thread. > > > > > > This leaves me with CAS pools. I know that this is doing to be > > > multi-threaded. I believe I create however many CAS objects from the > > > annotator I want, probably an aggregate annotator. Is this correct and am > > > I > > > on the right track with CAS Pools? > > > > > > Thank you so much, > > > ~Ben
