Vaibhav, I have included some answers below.
Cheers, Adam On Mon, Jul 13, 2015 at 11:19 AM, vaibhav thapliyal < [email protected]> wrote: > Dear all, > > I have the following questions on intersecting iterator and partition ids > used in document sharded indexing: > > 1. Can we run a boolean and query using the current intersecting iterator > on a given range of ids. These ids are a subset of the total ids stored in > the column qualifier field as per the document sharded indexing format. > The IntersectingIterator is designed to do index intersections, which are very similar to boolean AND queries. It does require indexes to be built in a particular fashion. You should play around with the WikiSearch example ( https://accumulo.apache.org/example/wikisearch.html) to get familiar with its use. > If it's not possible with current iterator can I tweak the existing one? > If you are indexing documents similar to what the IntersectingIterator expects then you should be able to get it to work for you. More generally, any row-local logic can be implemented in an iterator. If you're not building indexes then you might want to look at the RowFilter as a starting point. > 2. Is the partitioning suggested in document sharded indexing logical or > physical. For eg if I have 30 partition ids do I have to physically > presplit the table based on the partition ids for the and query to run in > the most efficient way so that I have 30 tablets in table? > You don't have to pre-split -- Accumulo will automatically split big rows into their own tablets. However, there are some performance advantages to pre-splitting before your tablet gets big enough to split on its own. > 3. Lastly, Can anybody suggest me the number of partitions for > document sharded indexing. What should I look for when deciding it? > You have to consider a few factors for this: (a) ingest parallelization, for which you want approximately as many partitions as you have cores in your cluster, (b) size of a partition when full, which you want to be under about 20GB for compaction performance reasons, and (c) query parallelism, for which you want no more than a small factor of the number of cores in your cluster to reduce query latency. If you can't find a solution that works for all of these factors then you will be forced to make trade-offs (or do something complicated like time-based partitioning). > Thanks > Vaibhav >
