I've been successful with this same model, HDFS and TServers on the same host, to take advantage of those shortcircuit settings. They make a major difference if your calculation problem is read-I/O bound, which for my MapReduce/Spark applications, was the case. Depending on my row count or precomputed table split, i have seen anywhere from 5% to 62% improvement in overall job execution time.
On Mon, May 23, 2016 at 6:09 AM, Josh Elser <[email protected]> wrote: > This probably isn't a big issue unless you're running into stability > issues with Accumulo. They're both designed to scale horizontally. Unless > you have a reason that they can't be colocated, it's fine. > On May 21, 2016 2:29 PM, "David Medinets" <[email protected]> > wrote: > >> Why are you sharing the machines accumulo and Spark? Does Spark give you >> any kind of data locality that accumlo does? Could it be better to use the >> full amount of memory for each? >> On May 21, 2016 1:15 PM, "Mario Pastorelli" < >> [email protected]> wrote: >> >>> Currently setting the number of threads to both the number of servers >>> and the number of cores yield to the similar performance for scanning with >>> BatchScanner. Thanks for the advice, I will try to use half of cores of >>> each machines on the cluster. >>> >>> Anything else? >>> >>> On Sat, May 21, 2016 at 5:03 AM, David Medinets < >>> [email protected]> wrote: >>> >>>> It's been a few years so I don't remember the specific property names. >>>> Set one thread count to the number of servers times the number of cores to >>>> start. Divide by .5 if spark is equally as active as accumulo. Look in >>>> properties.java for the property names. >>>> >>>> On Fri, May 20, 2016 at 10:09 AM, Mario Pastorelli < >>>> [email protected]> wrote: >>>> >>>>> Machines have 32 cores shared between Accumulo and Spark. Each machine >>>>> has 5 disks on which there is HDFS and that Accumulo can use. How many >>>>> threads I should used? >>>>> >>>>> On Fri, May 20, 2016 at 3:49 PM, David Medinets < >>>>> [email protected]> wrote: >>>>> >>>>>> How many cores are on your servers? There are several thread counts >>>>>> you can change. Even +1 thread per server counts at some point if you >>>>>> have >>>>>> enough servers in the cluster. >>>>>> >>>>>> On Fri, May 20, 2016 at 2:54 AM, Mario Pastorelli < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> You mean the BatchScanner number of threads? I've made it parametric >>>>>>> and usually I use 1 or 2 threads per tablet server. Going up doesn't >>>>>>> seem >>>>>>> to do anything for the performance. >>>>>>> >>>>>>> On Thu, May 19, 2016 at 6:21 PM, David Medinets < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Have you tuned thread counts? >>>>>>>> On May 19, 2016 11:08 AM, "Mario Pastorelli" < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hey people, >>>>>>>>> I'm trying to tune a bit the query performance to see how fast it >>>>>>>>> can go and I thought it would be great to have comments from the >>>>>>>>> community. >>>>>>>>> The problem that I'm trying to solve in Accumulo is the following: we >>>>>>>>> want >>>>>>>>> to store the entities that have been in a certain location in a >>>>>>>>> certain >>>>>>>>> day. The location is a Long and the entity id is a Long. I want to be >>>>>>>>> able >>>>>>>>> to scan ~1M of rows in few seconds, possibly less than one. Right >>>>>>>>> now, I'm >>>>>>>>> doing the following things: >>>>>>>>> >>>>>>>>> 1. I'm using a sharding byte at the start of the rowId to keep >>>>>>>>> the data in the same range distributed in the cluster >>>>>>>>> 2. all the records are encoded, one single record is composed >>>>>>>>> by >>>>>>>>> 1. rowId: 1 shard byte + 3 bytes for the day >>>>>>>>> 2. column family: 8 byte for the long corresponding to the >>>>>>>>> hash of the location >>>>>>>>> 3. column qualifier: 8 byte corresponding to the identifier >>>>>>>>> of the entity >>>>>>>>> 4. value: 2 bytes for some additional information >>>>>>>>> 3. I use a batch scanner because I don't need sorting and it's >>>>>>>>> faster >>>>>>>>> >>>>>>>>> As expected, it takes few seconds to scan 1M rows but now I'm >>>>>>>>> wondering if I can improve it. My ideas are the following: >>>>>>>>> >>>>>>>>> 1. set table.compaction.major.ration to 1 because I don't care >>>>>>>>> about the ingestion performance and this should improve the query >>>>>>>>> performance >>>>>>>>> 2. pre-split tables to match the number of servers and then >>>>>>>>> use a byte of shard as first byte of the rowId. This should >>>>>>>>> improve both >>>>>>>>> writing and reading the data because both should work in parallel >>>>>>>>> for what >>>>>>>>> I understood >>>>>>>>> 3. enable bloom filter on the table >>>>>>>>> >>>>>>>>> Do you think those ideas make sense? Furthermore, I have two >>>>>>>>> questions: >>>>>>>>> >>>>>>>>> 1. considering that a single entry is only 22 bytes but I'm >>>>>>>>> going to scan ~1M records per query, do you think I should change >>>>>>>>> the >>>>>>>>> BatchScanner buffers somehow? >>>>>>>>> 2. anything else to improve the scan speed? Again, I don't >>>>>>>>> care about the ingestion time >>>>>>>>> >>>>>>>>> Thanks for the help! >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Mario Pastorelli | TERALYTICS >>>>>>>>> >>>>>>>>> *software engineer* >>>>>>>>> >>>>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland >>>>>>>>> phone: +41794381682 >>>>>>>>> email: [email protected] >>>>>>>>> www.teralytics.net >>>>>>>>> >>>>>>>>> Company registration number: CH-020.3.037.709-7 | Trade register >>>>>>>>> Canton Zurich >>>>>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark >>>>>>>>> Schmitz, Yann de Vries >>>>>>>>> >>>>>>>>> This e-mail message contains confidential information which is for >>>>>>>>> the sole attention and use of the intended recipient. Please notify >>>>>>>>> us at >>>>>>>>> once if you think that it may not be intended for you and delete it >>>>>>>>> immediately. >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Mario Pastorelli | TERALYTICS >>>>>>> >>>>>>> *software engineer* >>>>>>> >>>>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland >>>>>>> phone: +41794381682 >>>>>>> email: [email protected] >>>>>>> www.teralytics.net >>>>>>> >>>>>>> Company registration number: CH-020.3.037.709-7 | Trade register >>>>>>> Canton Zurich >>>>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark >>>>>>> Schmitz, Yann de Vries >>>>>>> >>>>>>> This e-mail message contains confidential information which is for >>>>>>> the sole attention and use of the intended recipient. Please notify us >>>>>>> at >>>>>>> once if you think that it may not be intended for you and delete it >>>>>>> immediately. >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Mario Pastorelli | TERALYTICS >>>>> >>>>> *software engineer* >>>>> >>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland >>>>> phone: +41794381682 >>>>> email: [email protected] >>>>> www.teralytics.net >>>>> >>>>> Company registration number: CH-020.3.037.709-7 | Trade register >>>>> Canton Zurich >>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, >>>>> Yann de Vries >>>>> >>>>> This e-mail message contains confidential information which is for the >>>>> sole attention and use of the intended recipient. Please notify us at once >>>>> if you think that it may not be intended for you and delete it >>>>> immediately. >>>>> >>>> >>>> >>> >>> >>> -- >>> Mario Pastorelli | TERALYTICS >>> >>> *software engineer* >>> >>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland >>> phone: +41794381682 >>> email: [email protected] >>> www.teralytics.net >>> >>> Company registration number: CH-020.3.037.709-7 | Trade register Canton >>> Zurich >>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, >>> Yann de Vries >>> >>> This e-mail message contains confidential information which is for the >>> sole attention and use of the intended recipient. Please notify us at once >>> if you think that it may not be intended for you and delete it immediately. >>> >>
