On Thu, Mar 28, 2013 at 12:15 PM, <[email protected]> wrote: > Thanks! I like the idea of sending my own thread pool to the batch scanner, > that would definitely be the better solution.
Would you like to open a ticket about this issue? I just remembered, there is an issues w/ this approach to be aware of . I have seen this when multiple threads share a batch scanner (more in this below). Consider the following situation. 1. Thread A gives a lot of work to BatchScanner1 using Threadpool1, creating BatchScannerIterator1 2. BatchScannerIterator1's internal queue fills up as result of work given by Thread A 3. All threads in ThreadPool1 block trying to add to BatchScannerIterator1 queue 4. Thread B gives a lot of work to BatchScanner2 using Threadpool1, creating BatchScannerIterator2 5. Thread B attempts to iterate over BatchScannerIterator2, but blocks forever because no threads service it This problem occurs because Thread A never reads from BatchScannerIterator1 In the current code, multiple threads can use a BatchScanner. You just need to make configuring the BatchScanner and getting an iterator an atomic operation. When an iterator is created by a batch scanner, it copies the config that exist at that point in time. Changes to the BatchScanner config after an iterator is created, will not affect the iterator. > > Yeah I thought about creating a batch scanner with only one thread, but I was > not sure if that is making a separate thread (outside of the current one) or > using the current one. At the time I did not want a new thread to be created > at all. Though, didn't realize the Scanner was also spinning up a thread at > all, thought that was in process. The batch scanner will create a new thread pool w/ one thread. > > To mitigate the separate RPC call per range, would it make more sense to do a > "binRanges" based on the ranges at the tablets to reduce the number of ranges? Probably do not want to combine ranges, that could bring back data in the gaps between ranges. > > On Mar 28, 2013, at 11:55 AM, Keith Turner <[email protected]> wrote: > >> I took a quick look at the code. Excluding the threading issue, a >> major conceptual difference is that BatchScannerWithScanners seems to >> do a RPC round trip for each range. The TabletServerBatchReader >> sends all of the ranges that a tablet server needs to lookup in one >> RPC. >> >> Instead of creating a BatchScannerWithScanners, maybe you could create >> a batch scanner with just one thread when resources are exceeded? >> This will be similar to what you are doing now, just one thread will >> be doing work fetching data. The client thread would just be waiting >> on this background thread. Although this does allow the processing >> of result to happen concurrently with fetching of data. Using >> BatchScannerWithScanners would not allow this. >> >> Something to be aware of, the regular scanner will spin up a read >> ahead thread if you read a lot of data through it. It does not do >> this immediately, only after fetching a few batches of key value pairs >> from the tablet server. If this happens you could have one thread >> fetching data while the client thread processes results. >> >> Do you think we should open a a ticket about giving users control over >> threads created by client code? Maybe users could pass in their own >> thread pool to a batch scanner? >> >> >> Keith >> >> On Thu, Mar 28, 2013 at 11:00 AM, <[email protected]> wrote: >>> In some of my projects, we needed to control the number of threads spun up >>> with the use of multiple batch scanners. We created a utility to control >>> the number of threads, and if the max threads has been reached, return a >>> batch scanner that is actually backed by Scanners. Wanted to get any >>> feedback on the code. Seems like such a simple thing to do, I bet someone >>> already has this. Thanks! >>> >>> https://github.com/calrissian/mango/tree/master/accumulo >
