Re: Feedback about techniques for tuning batch scanning for my problem

Marc Reichman Mon, 23 May 2016 06:13:19 -0700

I've been successful with this same model, HDFS and TServers on the same
host, to take advantage of those shortcircuit settings. They make a major
difference if your calculation problem is read-I/O bound, which for my
MapReduce/Spark applications, was the case. Depending on my row count or
precomputed table split, i have seen anywhere from 5% to 62% improvement in
overall job execution time.


On Mon, May 23, 2016 at 6:09 AM, Josh Elser <[email protected]> wrote:

> This probably isn't a big issue unless you're running into stability
> issues with Accumulo. They're both designed to scale horizontally. Unless
> you have a reason that they can't be colocated, it's fine.
> On May 21, 2016 2:29 PM, "David Medinets" <[email protected]>
> wrote:
>
>> Why are you sharing the machines accumulo and Spark? Does Spark give you
>> any kind of data locality that accumlo does? Could it be better to use the
>> full amount of memory for each?
>> On May 21, 2016 1:15 PM, "Mario Pastorelli" <
>> [email protected]> wrote:
>>
>>> Currently setting the number of threads to both the number of servers
>>> and the number of cores yield to the similar performance for scanning with
>>> BatchScanner. Thanks for the advice, I will try to use half of cores of
>>> each machines on the cluster.
>>>
>>> Anything else?
>>>
>>> On Sat, May 21, 2016 at 5:03 AM, David Medinets <
>>> [email protected]> wrote:
>>>
>>>> It's been a few years so I don't remember the specific property names.
>>>> Set one thread count to the number of servers times the number of cores to
>>>> start. Divide by .5 if spark is equally as active as  accumulo. Look in
>>>> properties.java for the property names.
>>>>
>>>> On Fri, May 20, 2016 at 10:09 AM, Mario Pastorelli <
>>>> [email protected]> wrote:
>>>>
>>>>> Machines have 32 cores shared between Accumulo and Spark. Each machine
>>>>> has 5 disks on which there is HDFS and that Accumulo can use. How many
>>>>> threads I should used?
>>>>>
>>>>> On Fri, May 20, 2016 at 3:49 PM, David Medinets <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> How many cores are on your servers? There are several thread counts
>>>>>> you can change. Even +1 thread per server counts at some point if you 
>>>>>> have
>>>>>> enough servers in the cluster.
>>>>>>
>>>>>> On Fri, May 20, 2016 at 2:54 AM, Mario Pastorelli <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> You mean the BatchScanner number of threads? I've made it parametric
>>>>>>> and usually I use 1 or 2 threads per tablet server. Going up doesn't 
>>>>>>> seem
>>>>>>> to do anything for the performance.
>>>>>>>
>>>>>>> On Thu, May 19, 2016 at 6:21 PM, David Medinets <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Have you tuned thread counts?
>>>>>>>> On May 19, 2016 11:08 AM, "Mario Pastorelli" <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hey people,
>>>>>>>>> I'm trying to tune a bit the query performance to see how fast it
>>>>>>>>> can go and I thought it would be great to have comments from the 
>>>>>>>>> community.
>>>>>>>>> The problem that I'm trying to solve in Accumulo is the following: we 
>>>>>>>>> want
>>>>>>>>> to store the entities that have been in a certain location in a 
>>>>>>>>> certain
>>>>>>>>> day. The location is a Long and the entity id is a Long. I want to be 
>>>>>>>>> able
>>>>>>>>> to scan ~1M of rows in few seconds, possibly less than one. Right 
>>>>>>>>> now, I'm
>>>>>>>>> doing the following things:
>>>>>>>>>
>>>>>>>>>    1. I'm using a sharding byte at the start of the rowId to keep
>>>>>>>>>    the data in the same range distributed in the cluster
>>>>>>>>>    2. all the records are encoded, one single record is composed
>>>>>>>>>    by
>>>>>>>>>       1. rowId: 1 shard byte + 3 bytes for the day
>>>>>>>>>       2. column family: 8 byte for the long corresponding to the
>>>>>>>>>       hash of the location
>>>>>>>>>       3. column qualifier: 8 byte corresponding to the identifier
>>>>>>>>>       of the entity
>>>>>>>>>       4. value: 2 bytes for some additional information
>>>>>>>>>    3. I use a batch scanner because I don't need sorting and it's
>>>>>>>>>    faster
>>>>>>>>>
>>>>>>>>> As expected, it takes few seconds to scan 1M rows but now I'm
>>>>>>>>> wondering if I can improve it. My ideas are the following:
>>>>>>>>>
>>>>>>>>>    1. set table.compaction.major.ration to 1 because I don't care
>>>>>>>>>    about the ingestion performance and this should improve the query
>>>>>>>>>    performance
>>>>>>>>>    2. pre-split tables to match the number of servers and then
>>>>>>>>>    use a byte of shard as first byte of the rowId. This should 
>>>>>>>>> improve both
>>>>>>>>>    writing and reading the data because both should work in parallel 
>>>>>>>>> for what
>>>>>>>>>    I understood
>>>>>>>>>    3. enable bloom filter on the table
>>>>>>>>>
>>>>>>>>> Do you think those ideas make sense? Furthermore, I have two
>>>>>>>>> questions:
>>>>>>>>>
>>>>>>>>>    1. considering that a single entry is only 22 bytes but I'm
>>>>>>>>>    going to scan ~1M records per query, do you think I should change 
>>>>>>>>> the
>>>>>>>>>    BatchScanner buffers somehow?
>>>>>>>>>    2. anything else to improve the scan speed? Again, I don't
>>>>>>>>>    care about the ingestion time
>>>>>>>>>
>>>>>>>>> Thanks for the help!
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Mario Pastorelli | TERALYTICS
>>>>>>>>>
>>>>>>>>> *software engineer*
>>>>>>>>>
>>>>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>>>>>>> phone: +41794381682
>>>>>>>>> email: [email protected]
>>>>>>>>> www.teralytics.net
>>>>>>>>>
>>>>>>>>> Company registration number: CH-020.3.037.709-7 | Trade register
>>>>>>>>> Canton Zurich
>>>>>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark
>>>>>>>>> Schmitz, Yann de Vries
>>>>>>>>>
>>>>>>>>> This e-mail message contains confidential information which is for
>>>>>>>>> the sole attention and use of the intended recipient. Please notify 
>>>>>>>>> us at
>>>>>>>>> once if you think that it may not be intended for you and delete it
>>>>>>>>> immediately.
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Mario Pastorelli | TERALYTICS
>>>>>>>
>>>>>>> *software engineer*
>>>>>>>
>>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>>>>> phone: +41794381682
>>>>>>> email: [email protected]
>>>>>>> www.teralytics.net
>>>>>>>
>>>>>>> Company registration number: CH-020.3.037.709-7 | Trade register
>>>>>>> Canton Zurich
>>>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark
>>>>>>> Schmitz, Yann de Vries
>>>>>>>
>>>>>>> This e-mail message contains confidential information which is for
>>>>>>> the sole attention and use of the intended recipient. Please notify us 
>>>>>>> at
>>>>>>> once if you think that it may not be intended for you and delete it
>>>>>>> immediately.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mario Pastorelli | TERALYTICS
>>>>>
>>>>> *software engineer*
>>>>>
>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>>> phone: +41794381682
>>>>> email: [email protected]
>>>>> www.teralytics.net
>>>>>
>>>>> Company registration number: CH-020.3.037.709-7 | Trade register
>>>>> Canton Zurich
>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>>>> Yann de Vries
>>>>>
>>>>> This e-mail message contains confidential information which is for the
>>>>> sole attention and use of the intended recipient. Please notify us at once
>>>>> if you think that it may not be intended for you and delete it 
>>>>> immediately.
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Mario Pastorelli | TERALYTICS
>>>
>>> *software engineer*
>>>
>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>> phone: +41794381682
>>> email: [email protected]
>>> www.teralytics.net
>>>
>>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>>> Zurich
>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>> Yann de Vries
>>>
>>> This e-mail message contains confidential information which is for the
>>> sole attention and use of the intended recipient. Please notify us at once
>>> if you think that it may not be intended for you and delete it immediately.
>>>
>>

Re: Feedback about techniques for tuning batch scanning for my problem

Reply via email to