This probably isn't a big issue unless you're running into stability issues with Accumulo. They're both designed to scale horizontally. Unless you have a reason that they can't be colocated, it's fine. On May 21, 2016 2:29 PM, "David Medinets" <[email protected]> wrote:
> Why are you sharing the machines accumulo and Spark? Does Spark give you > any kind of data locality that accumlo does? Could it be better to use the > full amount of memory for each? > On May 21, 2016 1:15 PM, "Mario Pastorelli" < > [email protected]> wrote: > >> Currently setting the number of threads to both the number of servers and >> the number of cores yield to the similar performance for scanning with >> BatchScanner. Thanks for the advice, I will try to use half of cores of >> each machines on the cluster. >> >> Anything else? >> >> On Sat, May 21, 2016 at 5:03 AM, David Medinets <[email protected] >> > wrote: >> >>> It's been a few years so I don't remember the specific property names. >>> Set one thread count to the number of servers times the number of cores to >>> start. Divide by .5 if spark is equally as active as accumulo. Look in >>> properties.java for the property names. >>> >>> On Fri, May 20, 2016 at 10:09 AM, Mario Pastorelli < >>> [email protected]> wrote: >>> >>>> Machines have 32 cores shared between Accumulo and Spark. Each machine >>>> has 5 disks on which there is HDFS and that Accumulo can use. How many >>>> threads I should used? >>>> >>>> On Fri, May 20, 2016 at 3:49 PM, David Medinets < >>>> [email protected]> wrote: >>>> >>>>> How many cores are on your servers? There are several thread counts >>>>> you can change. Even +1 thread per server counts at some point if you have >>>>> enough servers in the cluster. >>>>> >>>>> On Fri, May 20, 2016 at 2:54 AM, Mario Pastorelli < >>>>> [email protected]> wrote: >>>>> >>>>>> You mean the BatchScanner number of threads? I've made it parametric >>>>>> and usually I use 1 or 2 threads per tablet server. Going up doesn't seem >>>>>> to do anything for the performance. >>>>>> >>>>>> On Thu, May 19, 2016 at 6:21 PM, David Medinets < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Have you tuned thread counts? >>>>>>> On May 19, 2016 11:08 AM, "Mario Pastorelli" < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hey people, >>>>>>>> I'm trying to tune a bit the query performance to see how fast it >>>>>>>> can go and I thought it would be great to have comments from the >>>>>>>> community. >>>>>>>> The problem that I'm trying to solve in Accumulo is the following: we >>>>>>>> want >>>>>>>> to store the entities that have been in a certain location in a certain >>>>>>>> day. The location is a Long and the entity id is a Long. I want to be >>>>>>>> able >>>>>>>> to scan ~1M of rows in few seconds, possibly less than one. Right now, >>>>>>>> I'm >>>>>>>> doing the following things: >>>>>>>> >>>>>>>> 1. I'm using a sharding byte at the start of the rowId to keep >>>>>>>> the data in the same range distributed in the cluster >>>>>>>> 2. all the records are encoded, one single record is composed by >>>>>>>> 1. rowId: 1 shard byte + 3 bytes for the day >>>>>>>> 2. column family: 8 byte for the long corresponding to the >>>>>>>> hash of the location >>>>>>>> 3. column qualifier: 8 byte corresponding to the identifier >>>>>>>> of the entity >>>>>>>> 4. value: 2 bytes for some additional information >>>>>>>> 3. I use a batch scanner because I don't need sorting and it's >>>>>>>> faster >>>>>>>> >>>>>>>> As expected, it takes few seconds to scan 1M rows but now I'm >>>>>>>> wondering if I can improve it. My ideas are the following: >>>>>>>> >>>>>>>> 1. set table.compaction.major.ration to 1 because I don't care >>>>>>>> about the ingestion performance and this should improve the query >>>>>>>> performance >>>>>>>> 2. pre-split tables to match the number of servers and then use >>>>>>>> a byte of shard as first byte of the rowId. This should improve both >>>>>>>> writing and reading the data because both should work in parallel >>>>>>>> for what >>>>>>>> I understood >>>>>>>> 3. enable bloom filter on the table >>>>>>>> >>>>>>>> Do you think those ideas make sense? Furthermore, I have two >>>>>>>> questions: >>>>>>>> >>>>>>>> 1. considering that a single entry is only 22 bytes but I'm >>>>>>>> going to scan ~1M records per query, do you think I should change >>>>>>>> the >>>>>>>> BatchScanner buffers somehow? >>>>>>>> 2. anything else to improve the scan speed? Again, I don't care >>>>>>>> about the ingestion time >>>>>>>> >>>>>>>> Thanks for the help! >>>>>>>> >>>>>>>> -- >>>>>>>> Mario Pastorelli | TERALYTICS >>>>>>>> >>>>>>>> *software engineer* >>>>>>>> >>>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland >>>>>>>> phone: +41794381682 >>>>>>>> email: [email protected] >>>>>>>> www.teralytics.net >>>>>>>> >>>>>>>> Company registration number: CH-020.3.037.709-7 | Trade register >>>>>>>> Canton Zurich >>>>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark >>>>>>>> Schmitz, Yann de Vries >>>>>>>> >>>>>>>> This e-mail message contains confidential information which is for >>>>>>>> the sole attention and use of the intended recipient. Please notify us >>>>>>>> at >>>>>>>> once if you think that it may not be intended for you and delete it >>>>>>>> immediately. >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Mario Pastorelli | TERALYTICS >>>>>> >>>>>> *software engineer* >>>>>> >>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland >>>>>> phone: +41794381682 >>>>>> email: [email protected] >>>>>> www.teralytics.net >>>>>> >>>>>> Company registration number: CH-020.3.037.709-7 | Trade register >>>>>> Canton Zurich >>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, >>>>>> Yann de Vries >>>>>> >>>>>> This e-mail message contains confidential information which is for >>>>>> the sole attention and use of the intended recipient. Please notify us at >>>>>> once if you think that it may not be intended for you and delete it >>>>>> immediately. >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Mario Pastorelli | TERALYTICS >>>> >>>> *software engineer* >>>> >>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland >>>> phone: +41794381682 >>>> email: [email protected] >>>> www.teralytics.net >>>> >>>> Company registration number: CH-020.3.037.709-7 | Trade register Canton >>>> Zurich >>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, >>>> Yann de Vries >>>> >>>> This e-mail message contains confidential information which is for the >>>> sole attention and use of the intended recipient. Please notify us at once >>>> if you think that it may not be intended for you and delete it immediately. >>>> >>> >>> >> >> >> -- >> Mario Pastorelli | TERALYTICS >> >> *software engineer* >> >> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland >> phone: +41794381682 >> email: [email protected] >> www.teralytics.net >> >> Company registration number: CH-020.3.037.709-7 | Trade register Canton >> Zurich >> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, >> Yann de Vries >> >> This e-mail message contains confidential information which is for the >> sole attention and use of the intended recipient. Please notify us at once >> if you think that it may not be intended for you and delete it immediately. >> >
