Re: Feedback about techniques for tuning batch scanning for my problem

Mario Pastorelli Sun, 22 May 2016 02:39:55 -0700

Accumulo and HDFS are for storage, Spark if for processing.

On Sat, May 21, 2016 at 8:29 PM, David Medinets <[email protected]>
wrote:


> Why are you sharing the machines accumulo and Spark? Does Spark give you
> any kind of data locality that accumlo does? Could it be better to use the
> full amount of memory for each?
> On May 21, 2016 1:15 PM, "Mario Pastorelli" <
> [email protected]> wrote:
>
>> Currently setting the number of threads to both the number of servers and
>> the number of cores yield to the similar performance for scanning with
>> BatchScanner. Thanks for the advice, I will try to use half of cores of
>> each machines on the cluster.
>>
>> Anything else?
>>
>> On Sat, May 21, 2016 at 5:03 AM, David Medinets <[email protected]
>> > wrote:
>>
>>> It's been a few years so I don't remember the specific property names.
>>> Set one thread count to the number of servers times the number of cores to
>>> start. Divide by .5 if spark is equally as active as  accumulo. Look in
>>> properties.java for the property names.
>>>
>>> On Fri, May 20, 2016 at 10:09 AM, Mario Pastorelli <
>>> [email protected]> wrote:
>>>
>>>> Machines have 32 cores shared between Accumulo and Spark. Each machine
>>>> has 5 disks on which there is HDFS and that Accumulo can use. How many
>>>> threads I should used?
>>>>
>>>> On Fri, May 20, 2016 at 3:49 PM, David Medinets <
>>>> [email protected]> wrote:
>>>>
>>>>> How many cores are on your servers? There are several thread counts
>>>>> you can change. Even +1 thread per server counts at some point if you have
>>>>> enough servers in the cluster.
>>>>>
>>>>> On Fri, May 20, 2016 at 2:54 AM, Mario Pastorelli <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> You mean the BatchScanner number of threads? I've made it parametric
>>>>>> and usually I use 1 or 2 threads per tablet server. Going up doesn't seem
>>>>>> to do anything for the performance.
>>>>>>
>>>>>> On Thu, May 19, 2016 at 6:21 PM, David Medinets <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Have you tuned thread counts?
>>>>>>> On May 19, 2016 11:08 AM, "Mario Pastorelli" <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hey people,
>>>>>>>> I'm trying to tune a bit the query performance to see how fast it
>>>>>>>> can go and I thought it would be great to have comments from the 
>>>>>>>> community.
>>>>>>>> The problem that I'm trying to solve in Accumulo is the following: we 
>>>>>>>> want
>>>>>>>> to store the entities that have been in a certain location in a certain
>>>>>>>> day. The location is a Long and the entity id is a Long. I want to be 
>>>>>>>> able
>>>>>>>> to scan ~1M of rows in few seconds, possibly less than one. Right now, 
>>>>>>>> I'm
>>>>>>>> doing the following things:
>>>>>>>>
>>>>>>>>    1. I'm using a sharding byte at the start of the rowId to keep
>>>>>>>>    the data in the same range distributed in the cluster
>>>>>>>>    2. all the records are encoded, one single record is composed by
>>>>>>>>       1. rowId: 1 shard byte + 3 bytes for the day
>>>>>>>>       2. column family: 8 byte for the long corresponding to the
>>>>>>>>       hash of the location
>>>>>>>>       3. column qualifier: 8 byte corresponding to the identifier
>>>>>>>>       of the entity
>>>>>>>>       4. value: 2 bytes for some additional information
>>>>>>>>    3. I use a batch scanner because I don't need sorting and it's
>>>>>>>>    faster
>>>>>>>>
>>>>>>>> As expected, it takes few seconds to scan 1M rows but now I'm
>>>>>>>> wondering if I can improve it. My ideas are the following:
>>>>>>>>
>>>>>>>>    1. set table.compaction.major.ration to 1 because I don't care
>>>>>>>>    about the ingestion performance and this should improve the query
>>>>>>>>    performance
>>>>>>>>    2. pre-split tables to match the number of servers and then use
>>>>>>>>    a byte of shard as first byte of the rowId. This should improve both
>>>>>>>>    writing and reading the data because both should work in parallel 
>>>>>>>> for what
>>>>>>>>    I understood
>>>>>>>>    3. enable bloom filter on the table
>>>>>>>>
>>>>>>>> Do you think those ideas make sense? Furthermore, I have two
>>>>>>>> questions:
>>>>>>>>
>>>>>>>>    1. considering that a single entry is only 22 bytes but I'm
>>>>>>>>    going to scan ~1M records per query, do you think I should change 
>>>>>>>> the
>>>>>>>>    BatchScanner buffers somehow?
>>>>>>>>    2. anything else to improve the scan speed? Again, I don't care
>>>>>>>>    about the ingestion time
>>>>>>>>
>>>>>>>> Thanks for the help!
>>>>>>>>
>>>>>>>> --
>>>>>>>> Mario Pastorelli | TERALYTICS
>>>>>>>>
>>>>>>>> *software engineer*
>>>>>>>>
>>>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>>>>>> phone: +41794381682
>>>>>>>> email: [email protected]
>>>>>>>> www.teralytics.net
>>>>>>>>
>>>>>>>> Company registration number: CH-020.3.037.709-7 | Trade register
>>>>>>>> Canton Zurich
>>>>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark
>>>>>>>> Schmitz, Yann de Vries
>>>>>>>>
>>>>>>>> This e-mail message contains confidential information which is for
>>>>>>>> the sole attention and use of the intended recipient. Please notify us 
>>>>>>>> at
>>>>>>>> once if you think that it may not be intended for you and delete it
>>>>>>>> immediately.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mario Pastorelli | TERALYTICS
>>>>>>
>>>>>> *software engineer*
>>>>>>
>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>>>> phone: +41794381682
>>>>>> email: [email protected]
>>>>>> www.teralytics.net
>>>>>>
>>>>>> Company registration number: CH-020.3.037.709-7 | Trade register
>>>>>> Canton Zurich
>>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>>>>> Yann de Vries
>>>>>>
>>>>>> This e-mail message contains confidential information which is for
>>>>>> the sole attention and use of the intended recipient. Please notify us at
>>>>>> once if you think that it may not be intended for you and delete it
>>>>>> immediately.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Mario Pastorelli | TERALYTICS
>>>>
>>>> *software engineer*
>>>>
>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>> phone: +41794381682
>>>> email: [email protected]
>>>> www.teralytics.net
>>>>
>>>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>>>> Zurich
>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>>> Yann de Vries
>>>>
>>>> This e-mail message contains confidential information which is for the
>>>> sole attention and use of the intended recipient. Please notify us at once
>>>> if you think that it may not be intended for you and delete it immediately.
>>>>
>>>
>>>
>>
>>
>> --
>> Mario Pastorelli | TERALYTICS
>>
>> *software engineer*
>>
>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>> phone: +41794381682
>> email: [email protected]
>> www.teralytics.net
>>
>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>> Zurich
>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>> Yann de Vries
>>
>> This e-mail message contains confidential information which is for the
>> sole attention and use of the intended recipient. Please notify us at once
>> if you think that it may not be intended for you and delete it immediately.
>>
>


-- 
Mario Pastorelli | TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
phone: +41794381682
email: [email protected]
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
de Vries

This e-mail message contains confidential information which is for the sole
attention and use of the intended recipient. Please notify us at once if
you think that it may not be intended for you and delete it immediately.

Re: Feedback about techniques for tuning batch scanning for my problem

Reply via email to