Re: Feedback about techniques for tuning batch scanning for my problem

Josh Elser Mon, 23 May 2016 04:10:07 -0700

This probably isn't a big issue unless you're running into stability issues
with Accumulo. They're both designed to scale horizontally. Unless you have
a reason that they can't be colocated, it's fine.
On May 21, 2016 2:29 PM, "David Medinets" <[email protected]> wrote:


> Why are you sharing the machines accumulo and Spark? Does Spark give you
> any kind of data locality that accumlo does? Could it be better to use the
> full amount of memory for each?
> On May 21, 2016 1:15 PM, "Mario Pastorelli" <
> [email protected]> wrote:
>
>> Currently setting the number of threads to both the number of servers and
>> the number of cores yield to the similar performance for scanning with
>> BatchScanner. Thanks for the advice, I will try to use half of cores of
>> each machines on the cluster.
>>
>> Anything else?
>>
>> On Sat, May 21, 2016 at 5:03 AM, David Medinets <[email protected]
>> > wrote:
>>
>>> It's been a few years so I don't remember the specific property names.
>>> Set one thread count to the number of servers times the number of cores to
>>> start. Divide by .5 if spark is equally as active as  accumulo. Look in
>>> properties.java for the property names.
>>>
>>> On Fri, May 20, 2016 at 10:09 AM, Mario Pastorelli <
>>> [email protected]> wrote:
>>>
>>>> Machines have 32 cores shared between Accumulo and Spark. Each machine
>>>> has 5 disks on which there is HDFS and that Accumulo can use. How many
>>>> threads I should used?
>>>>
>>>> On Fri, May 20, 2016 at 3:49 PM, David Medinets <
>>>> [email protected]> wrote:
>>>>
>>>>> How many cores are on your servers? There are several thread counts
>>>>> you can change. Even +1 thread per server counts at some point if you have
>>>>> enough servers in the cluster.
>>>>>
>>>>> On Fri, May 20, 2016 at 2:54 AM, Mario Pastorelli <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> You mean the BatchScanner number of threads? I've made it parametric
>>>>>> and usually I use 1 or 2 threads per tablet server. Going up doesn't seem
>>>>>> to do anything for the performance.
>>>>>>
>>>>>> On Thu, May 19, 2016 at 6:21 PM, David Medinets <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Have you tuned thread counts?
>>>>>>> On May 19, 2016 11:08 AM, "Mario Pastorelli" <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hey people,
>>>>>>>> I'm trying to tune a bit the query performance to see how fast it
>>>>>>>> can go and I thought it would be great to have comments from the 
>>>>>>>> community.
>>>>>>>> The problem that I'm trying to solve in Accumulo is the following: we 
>>>>>>>> want
>>>>>>>> to store the entities that have been in a certain location in a certain
>>>>>>>> day. The location is a Long and the entity id is a Long. I want to be 
>>>>>>>> able
>>>>>>>> to scan ~1M of rows in few seconds, possibly less than one. Right now, 
>>>>>>>> I'm
>>>>>>>> doing the following things:
>>>>>>>>
>>>>>>>>    1. I'm using a sharding byte at the start of the rowId to keep
>>>>>>>>    the data in the same range distributed in the cluster
>>>>>>>>    2. all the records are encoded, one single record is composed by
>>>>>>>>       1. rowId: 1 shard byte + 3 bytes for the day
>>>>>>>>       2. column family: 8 byte for the long corresponding to the
>>>>>>>>       hash of the location
>>>>>>>>       3. column qualifier: 8 byte corresponding to the identifier
>>>>>>>>       of the entity
>>>>>>>>       4. value: 2 bytes for some additional information
>>>>>>>>    3. I use a batch scanner because I don't need sorting and it's
>>>>>>>>    faster
>>>>>>>>
>>>>>>>> As expected, it takes few seconds to scan 1M rows but now I'm
>>>>>>>> wondering if I can improve it. My ideas are the following:
>>>>>>>>
>>>>>>>>    1. set table.compaction.major.ration to 1 because I don't care
>>>>>>>>    about the ingestion performance and this should improve the query
>>>>>>>>    performance
>>>>>>>>    2. pre-split tables to match the number of servers and then use
>>>>>>>>    a byte of shard as first byte of the rowId. This should improve both
>>>>>>>>    writing and reading the data because both should work in parallel 
>>>>>>>> for what
>>>>>>>>    I understood
>>>>>>>>    3. enable bloom filter on the table
>>>>>>>>
>>>>>>>> Do you think those ideas make sense? Furthermore, I have two
>>>>>>>> questions:
>>>>>>>>
>>>>>>>>    1. considering that a single entry is only 22 bytes but I'm
>>>>>>>>    going to scan ~1M records per query, do you think I should change 
>>>>>>>> the
>>>>>>>>    BatchScanner buffers somehow?
>>>>>>>>    2. anything else to improve the scan speed? Again, I don't care
>>>>>>>>    about the ingestion time
>>>>>>>>
>>>>>>>> Thanks for the help!
>>>>>>>>
>>>>>>>> --
>>>>>>>> Mario Pastorelli | TERALYTICS
>>>>>>>>
>>>>>>>> *software engineer*
>>>>>>>>
>>>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>>>>>> phone: +41794381682
>>>>>>>> email: [email protected]
>>>>>>>> www.teralytics.net
>>>>>>>>
>>>>>>>> Company registration number: CH-020.3.037.709-7 | Trade register
>>>>>>>> Canton Zurich
>>>>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark
>>>>>>>> Schmitz, Yann de Vries
>>>>>>>>
>>>>>>>> This e-mail message contains confidential information which is for
>>>>>>>> the sole attention and use of the intended recipient. Please notify us 
>>>>>>>> at
>>>>>>>> once if you think that it may not be intended for you and delete it
>>>>>>>> immediately.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mario Pastorelli | TERALYTICS
>>>>>>
>>>>>> *software engineer*
>>>>>>
>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>>>> phone: +41794381682
>>>>>> email: [email protected]
>>>>>> www.teralytics.net
>>>>>>
>>>>>> Company registration number: CH-020.3.037.709-7 | Trade register
>>>>>> Canton Zurich
>>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>>>>> Yann de Vries
>>>>>>
>>>>>> This e-mail message contains confidential information which is for
>>>>>> the sole attention and use of the intended recipient. Please notify us at
>>>>>> once if you think that it may not be intended for you and delete it
>>>>>> immediately.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Mario Pastorelli | TERALYTICS
>>>>
>>>> *software engineer*
>>>>
>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>> phone: +41794381682
>>>> email: [email protected]
>>>> www.teralytics.net
>>>>
>>>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>>>> Zurich
>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>>> Yann de Vries
>>>>
>>>> This e-mail message contains confidential information which is for the
>>>> sole attention and use of the intended recipient. Please notify us at once
>>>> if you think that it may not be intended for you and delete it immediately.
>>>>
>>>
>>>
>>
>>
>> --
>> Mario Pastorelli | TERALYTICS
>>
>> *software engineer*
>>
>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>> phone: +41794381682
>> email: [email protected]
>> www.teralytics.net
>>
>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>> Zurich
>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>> Yann de Vries
>>
>> This e-mail message contains confidential information which is for the
>> sole attention and use of the intended recipient. Please notify us at once
>> if you think that it may not be intended for you and delete it immediately.
>>
>

Re: Feedback about techniques for tuning batch scanning for my problem

Reply via email to