Re: Questions on intersecting iterator and partition ids

Adam Fuchs Mon, 13 Jul 2015 10:57:08 -0700

Vaibhav,

I have included some answers below.


Cheers,
Adam

On Mon, Jul 13, 2015 at 11:19 AM, vaibhav thapliyal <
[email protected]> wrote:

> Dear all,
>
> I have the following questions on intersecting iterator and partition ids
> used in document sharded indexing:
>
> 1. Can we run a boolean and query using the current intersecting iterator
> on a given range of ids. These ids are a subset of the total ids stored in
> the column qualifier field as per the document sharded indexing format.
>
The IntersectingIterator is designed to do index intersections, which are
very similar to boolean AND queries. It does require indexes to be built in
a particular fashion. You should play around with the WikiSearch example (
https://accumulo.apache.org/example/wikisearch.html) to get familiar with
its use.

> If it's not possible with current iterator can I tweak the existing one?
>
If you are indexing documents similar to what the IntersectingIterator
expects then you should be able to get it to work for you. More generally,
any row-local logic can be implemented in an iterator. If you're not
building indexes then you might want to look at the RowFilter as a starting
point.

>  2. Is the partitioning suggested in document sharded indexing logical or
> physical. For eg if I have 30 partition ids do I have to physically
> presplit the table based on the partition ids for the and query to run in
> the most efficient way so that I have 30 tablets in table?
>
You don't have to pre-split -- Accumulo will automatically split big rows
into their own tablets. However, there are some performance advantages to
pre-splitting before your tablet gets big enough to split on its own.

>  3.  Lastly,  Can anybody suggest me the number of partitions for
> document sharded indexing. What should I look for when deciding it?
>
You have to consider a few factors for this: (a) ingest parallelization,
for which you want approximately as many partitions as you have cores in
your cluster, (b) size of a partition when full, which you want to be under
about 20GB for compaction performance reasons, and (c) query parallelism,
for which you want no more than a small factor of the number of cores in
your cluster to reduce query latency. If you can't find a solution that
works for all of these factors then you will be forced to make trade-offs
(or do something complicated like time-based partitioning).

>  Thanks
> Vaibhav
>

Re: Questions on intersecting iterator and partition ids

Reply via email to