Hello everyone, I am trying to get my project off the ground and hit a small problem.
I want to read text from a large database (lets say, 100,000+ rows). Each row will have a text article. I want to connect to the database, request a single row from the database, and process this document through an NLP engine and I want to do this in parallel. Each document will be say, split up into sentences and each sentence will be POS tagged. After reading the documentation, I am more confused than when I started. I think I want something like the FileSystemCollectionReader example and build a CPE. Instead of reading from the file system, it will read from the database. There are two problems with this approach: 1. I am not sure it is multi threaded: CAS initializers are deprecated and it appears that the getNext() method will only run in a single thread. 2. The FileSystemCollectionReader loads references to the file location into memory but not the text itself. For problem 1, the line I find very troubling is File file = (File) mFiles.get(mCurrentIndex++); I have to assume from this line that the CollectionReader_ImplBase is not multi-threaded but is intended to rapidly iterate over a set of documents in a single thread. Problem 2 is easily solved as I can create a massive array of integers if I feel like. Anyway, after deciding that this is not likely the solution, I looked into Multi-view Sofa annotators. I don't think these do what I want either. In this context, I would treat the database table as a single object with many "views" being chunks of rows. I don't think this works, based on the SofaExampleAnnotator code provided. It also appears to run in a single thread. This leaves me with CAS pools. I know that this is doing to be multi-threaded. I believe I create however many CAS objects from the annotator I want, probably an aggregate annotator. Is this correct and am I on the right track with CAS Pools? Thank you so much, ~Ben
