Hi Maxim,
What you're seeing is an artifact of the threading model that Accumulo
uses. When you launch a query, Accumulo tablet servers will coordinate RPCs
via Thrift in one thread pool (which grows unbounded) and queue up scans
(rfile lookups, decryption/decompression, iterators, etc.) in
Watch out for ACCUMULO-4578 if you're using --cancel on one of the affected
versions (1.7.2 or 1.8.0 or earlier).
Adam
On Tue, Dec 12, 2017 at 7:57 AM, Mike Walch wrote:
> There should be a mention of the --cancel option in the docs. I created a
> PR to add it to the 2.0
Sven,
You might consider using a combination of AccumuloInputFormat and
AccumuloFileOutputFormat in a map/reduce job. The job will run in parallel,
speeding up your transformation, the map/reduce framework should help with
hiccups, and the bulk load at the end provides a atomic, eventually
cache hit rate was?
Adam
On Mon, Sep 12, 2016 at 9:14 AM, Josh Elser <josh.el...@gmail.com> wrote:
> 5 iterations, figured that would be apparent from the log messages :)
>
> The code is already posted in my original message.
>
> Adam Fuchs wrote:
>
>> Josh,
>>
Josh,
Two questions:
1. How many iterations did you do? I would like to see an absolute number
of lookups per second to compare against other observations.
2. Can you post your code somewhere so I can run it?
Thanks,
Adam
On Sat, Sep 10, 2016 at 3:01 PM, Josh Elser
Cyrille,
I think you're going to have to do a few things to get the nodes to act as
a cluster:
1. How would you like your Zookeeper cluster to be set up? If you're
planning on using a one-node Zookeeper instance on the master node, then
you may need to turn zookeeper off on your second node and
I'll be there.
Adam
On Thu, May 19, 2016 at 11:01 AM, Josh Elser wrote:
> Out of curiosity, are there going to be any Accumulo-folks at Hadoop
> Summit in San Jose, CA at the end of June?
>
> - Josh
>
Nice writeup!
Thanks,
Adam
On Tue, Jan 12, 2016 at 11:59 AM, Keith Turner wrote:
> We just completed a three day test of Fluo using Common Crawl data that
> went pretty well.
>
> http://fluo.io/webindex-long-run/
>
>
>
I totally agree, Christopher. I have also run into a few situations where
it would have been nice to have something like a mutation listener hook.
Particularly in generating indexing and stats records.
Adam
On Tue, Dec 8, 2015 at 5:59 PM, Christopher wrote:
> In the
Mike,
I suspect if you get rid of the "localhost" line and restart Accumulo then
you will get services listening on the non-loopback IPs. Right now you have
some of your processes accessible outside your VM and others only
accessible from inside, and you probably have two tablet servers when you
Josef,
If these are intermittent failures, you might consider turning on the
watcher [1] to automatically restart your processes. This should keep your
cluster from atrophying over time. You'll still have to take administrative
action to fix the DNS problem, but your availability should be
I bet what you're seeing is more efficient batching in the latter case.
BatchWriter goes through a binning phase whenever it fills up half of its
buffer, binning everything in the buffer into tablets. If you give it
sorted data it will probably be binning into a subset of the tablets
instead of
Rob,
I would use something like an IteratorChain [1] and fead it
Scanner.iterator() objects. If you setReadaheadThreshold(0) on the scanner
then calling Scanner.iterator() is a fairly lightweight operation, and
you'll be able to plop a bunch of iterators into the IteratorChain so that
they are
Try using the Range.exact(...) and Range.prefix(...) helper methods to
generate specific ranges. Key.followingKey(...) might also be helpful.
Cheers,
Adam
On Wed, Oct 14, 2015 at 9:59 AM, Lu Qin wrote:
> In my accumulo cluster ,the table has this data:
> 0 cf0:cq0 []v0
Here are a few other factors to consider:
1. Tablets may not be used uniformly. If there is a temporal element to the
row key then writes and reads may be skewed to go to a portion of the
tablets. If some tables are big but more archival in nature then they will
skew the stats as well. It's
;> 2) Checked that against changes I know my system has made
>>
>> 3) If my system is not the originator of the change, update
>> internal state to reflect the change.
>>
>>
>>
>> Examples of state I may need to update include an ElasticSearch i
Hi Tom,
Sqrrl uses a document-distributed indexing strategy extensively. On top of
the reasons you mentioned, we also like the ability to explicitly structure
our index entries in both information content and sort order. This gives us
the ability to do interesting things like build custom indexes
Jon,
You might think about putting a constraint on your table. I think the API
for constraints is flexible enough for your purpose, but I'm not exactly
sure how you would want to manage the results / side effects of your
observations.
Adam
On Tue, Sep 29, 2015 at 5:41 PM, Parise, Jonathan
You could cat the splits to a temp file, then use the -sf option of
createtable, piping the command to the accumulo shell's standard in:
$ echo "createtable ycsb_tablename -sf /tmp/ycsb_splits.txt" | accumulo
shell -u user -p password -z instancename zoohost:2181
Not sure if the row keys are
Hi Roman,
What's the used for in your previous key design?
As I'm sure you've figured out, it's generally a bad idea to have a fully
unique hash in your key, especially if you're trying to support extensive
secondary indexing. What we've found is that it's not just the size of the
key but also
Sqrrl uses a hybrid approach. For records that are relatively static we use
a compacted form, but for maintaining aggregates and for making updates to
the compacted form documents we use a more explicit form. This is done
mostly through iterators and a fairly complex type system. The big
trade-off
Hey Accumulopers,
I thought you might like to know that the Rya project just proposed to join
the incubator. Rya is a mature project that supports RDF on top of
Accumulo. Feel free to join the discussion or show support on the incubator
general list.
Cheers,
Adam
Vaibhav,
I have included some answers below.
Cheers,
Adam
On Mon, Jul 13, 2015 at 11:19 AM, vaibhav thapliyal
vaibhav.thapliyal...@gmail.com wrote:
Dear all,
I have the following questions on intersecting iterator and partition ids
used in document sharded indexing:
1. Can we run a
I think this might be the same concept as in-mapper combining, but applied
to data being sent to a BatchWriter rather than an OutputCollector. See
[1], section 3.1.1. A similar performance analysis and probably a lot of
the same code should apply here.
Cheers,
Adam
[1]
This can also be done with a row-doesn't-fit-into-memory constraint. You
won't need to hold the second column in-memory if your iterator tree deep
copies, filters, transforms and merges. Exhibit A:
[HeapIterator-derivative]
|_
| \
.
-Met with some great folks (special shout out to Josh Elsner and
Adam Fuchs for their time and patience answering questions).
-Can’t wait for next year’s summit.
Any idea when the slides for the presentations will be available?
Thanks,
Mike G.
This communication, along
On Wed, Apr 15, 2015 at 10:20 AM, Keith Turner ke...@deenlo.com wrote:
Random thought on revamp. Immutable key values with enough primitives to
make most operations efficient (avoid constant alloc/copy) might be
something to consider for the iterator API
So, is this a tradeoff in the
Dylan,
The effect of a major compaction is never seen in queries before the major
compaction completes. At the end of the major compaction there is a
multi-phase commit which eventually replaces all of the old files with the
new file. At that point the major compaction will have completely
, Adam Fuchs afu...@apache.org wrote:
Dylan,
The effect of a major compaction is never seen in queries before the
major compaction completes. At the end of the major compaction there is a
multi-phase commit which eventually replaces all of the old files with the
new file. At that point the major
Dylan,
If I recall correctly (which I give about 30% odds), the original purpose
of the side channel was to split up things like delete tombstone entries
from regular entries so that other iterators sitting on top of a
bifurcating iterator wouldn't have to handle the special tombstone
of adding another
data stream as a top-level source, but Fig. B is possible too.
Regards,
Dylan Hutchison
On Mon, Feb 16, 2015 at 11:34 AM, Adam Fuchs scubafu...@gmail.com wrote:
Dylan,
If I recall correctly (which I give about 30% odds), the original purpose
of the side channel was to split
Hi Dave,
As long as your combiner is associative and commutative both of the
values should be represented in the combined result. The
non-determinism is really around ordering, which generally doesn't
matter for a combiner.
Adam
On Mon, Feb 9, 2015 at 3:49 PM, Dave Hardcastle
Ara,
What kind of query load are you generating within your batch scanners?
Are you using an iterator that seeks around a lot? Are you grabbing
many small batches (only a few keys per range) from the batch scanner?
As a wild guess, this could be the result of lots of seeks with a low
cache hit
On Mon, Jan 12, 2015 at 4:10 PM, Josh Elser josh.el...@gmail.com wrote:
seek()'ing doesn't always imply an increase in performance -- remember that
RFiles (the files that back Accumulo tables), are composed of multiple
blocks/sections with an index of them. A seek is comprised of using that
Neato!
Adam
On Mon, Dec 15, 2014 at 3:25 PM, Christopher ctubb...@apache.org wrote:
Accumulators,
Fedora Linux now ships with Accumulo 1.6 packaged and available in its yum
repositories, as of Fedora 21. Simply run yum install accumulo to get
started. You can also just install
Jeff,
Density is an interesting measure here, because RFiles are going to
be sorted such that, even when the file is split between tablets, a
read of the file is going to be (mostly) a sequential scan. I think
instead you might want to look at a few other metrics: network
overhead, name node
Accumulo tservers typically listen on a single interface. If you have a
server with multiple interfaces (e.g. loopback and eth0), you might have a
problem in which the tablet servers are not listening on externally
reachable interfaces. Tablet servers will list the interfaces that they are
Paul,
Here are a few suggestions:
1. Reduce the number of concurrent compaction threads
(tserver.compaction.major.concurrent.max, and
tserver.compaction.minor.concurrent.max). You probably want to lean
towards twice as many major compaction threads as minor, but that
somewhat depends on how
You can change compression codecs at any time on a per-table basis. This
only affects how new files are written. Existing files will still be read
the same way. See the table.file.compress.type parameter.
One caveat is that you need to make sure your codec is supported before
switching to it or
, 2014 4:42 PM, Mike Hugo m...@piragua.com wrote:
On Tue, Apr 8, 2014 at 4:35 PM, Adam Fuchs afu...@apache.org wrote:
MIke,
What version of Accumulo are you using, how many tablets do you have, and
how many threads are you using for minor and major compaction pools? Also,
how big are the keys
Maybe this could be used to speed up WAL recovery for use cases that demand
really high availability and low latency?
Adam
On Feb 25, 2014 10:50 AM, Donald Miner dmi...@clearedgeit.com wrote:
HDFS caching is part of the new Hadoop 2.3 release. From what I
understand, it allows you to mark
One thing you can do is reduce the replication factor for the WAL. We have
found that makes a pretty significant different in write performance. That
can be modified with the tserver.wal.replication property. Setting it to 2
instead of the default (probably 3) should give you some performance
Never underestimate the power of ascii art!
Adam
On Oct 2, 2013 11:28 PM, Eric Newton eric.new...@gmail.com wrote:
I'll use ASCII graphics to demonstrate the size of a tablet.
Small: []
Medium: [ ]
Large: [ ]
Think of it like this... if you are running age-off... you probably have
lots
To follow up on this, I think maybe the config should be
namedfs.datanode.synconclosename, not namedfs.data.synconclosename.
Was that a typo, Eric?
Thanks,
Adam
On Thu, Sep 12, 2013 at 2:31 PM, Eric Newton eric.new...@gmail.com wrote:
Add:
property
namedfs.support.append/name
Heath,
In your case, the question that you are effectively asking is within each
partition, which documents' index entries include all of the given terms.
Since you have partitions aligned by field and only a single index entry
per field you will not get any matches for queries with more than one
Matt,
Did you include any patches that have not been committed to the 1.5 branch
in your snapshot?
Adam
On Sep 30, 2013 6:25 PM, Dickson, Matt MR matt.dick...@defence.gov.au
wrote:
**
*UNOFFICIAL*
1.5.1-SNAPSHOT from 20/09/13.
--
*From:* Sean Busbey
The addMutations method blocks when the client-side buffer fills up, so you
may see a lot of time spent in that method due to a bottleneck downstream.
There are a number of things you could try to speed that up. Here are a few:
1. Increase the BatchWriter's buffer size. This can smooth out the
Seems like a question a common and complex as which IP address to listen on
would have a fair amount of precedent in open-source projects that we could
pull from. Are we reinventing the wheel? Does anyone have an example of an
application like ours with the same set of supported platforms that has
Chris,
Did you copy the conf/accumulo.policy.example to conf/accumulo.policy? If
so, you may need to make some changes to account for changes to hadoop
security. I suspect the problem is that the codebase
file:${hadoop.home.dir}/lib/* reference doesn't include your CDH3
libraries. You could
Looks like the src part of the distribution is
accumulo-project-1.5.0-src.tar.gz.
For the same reasons that we removed the assemble tag form the bin
package, shouldn't we remove the project tag from the src package? This
also has implications as to whether we can just untar both the bin and src
Thanks for putting up with us picky people, Chris!
Adam
On May 17, 2013 6:15 PM, Christopher ctubb...@apache.org wrote:
So,
I've fixed the problem with the src tarball including binaries, and I
believe I've satisfied all the concerns regarding the naming
conventions.
I'm going to go ahead
Terry,
To properly secure you Accumulo install it's important that the shared
secret in the Accumulo configs only be shared with the Accumulo processes,
so I would recommend using a separate accumulo user.
In HDFS you can create the directory that Accumulo writes to (/accumulo by
default) and
At sqrrl, we tend to use a Tuple class that implements ListString
(ListByteBuffer would also work), and has conversions to and from
ByteBuffer. To encode the tuple into a byte buffer, change all the \1s to
\1\2, change all the \0s to \1\1, and put a \0 byte between
elements. \1 is used as an
There are a few primary reasons why your tablet server would die:
1. Lost lock in Zookeeper. If the tablet server and zookeeper can't
communicate with each other then the lock will timeout and the tablet
server will kill itself. This should show up as several messages in the
tserver log. If this
never did see anything in out log files or .out / .err logs indicating
the source of the problem, but the above is my best guess as to what was
going on.
Thanks again for all the tips and pointers!
Mike
On Wed, Feb 27, 2013 at 11:24 AM, Adam Fuchs afu...@apache.org wrote:
There are a few
Is that related to https://issues.apache.org/jira/browse/ACCUMULO-837? Do
you have a stack trace you can share?
Adam
On Fri, Feb 8, 2013 at 10:34 AM, David Medinets david.medin...@gmail.comwrote:
I am running a map-reduce job. As soon as my mapper tried to serialize
a Mutation I run into a
Mike,
The way to do that is to remove the versioning iterator entirely. Just
delete the configuration parameters for that iterator: something like
config -t tablename -d table.iterator.scan.vers in the accumulo shell,
for each of the six configuration parameters.
Adam
On Mon, Jan 28, 2013 at
David,
The core challenge here is to be able to continue scans under failure
conditions. There are several places where we tear down the iterator tree
and rebuild it, including when tablet servers die, when we need to free
resources to support concurrency, and a few others. In order to continue a
Using the Java API through JRuby or Jython would be another option. With
Jython, that would look something like this:
export
am definitely using the same key to update and retrieve the data.
At least update the timestamp to the current time (or old timestamp + 1).
-Eric
On Thu, Nov 29, 2012 at 10:38 AM, Adam Fuchs afu...@apache.org wrote:
Josh,
Can you share your junit test code so I can replicate this behavior
+1
The only problem I have found is that the example policy file is still not
included (ACCUMULO-364), but that has been corrected for the next version
for real this time. The release notes are slightly wrong in that respect,
but I don't think this should delay release.
Checked signatures,
4. In supporting dynamic column families, was there a design trade-off
with
respect to the original BigTable or current HBase design? What might
be a
benefit of doing it the other way?
One trade-off is that pinning locality groups in memory (i.e. making them
ephemeral) would be
Krishmin,
There are a few extremes to keep in mind when choosing a manual
partitioning strategy:
1. Parallelism and balance at ingest time. You need to find a happy medium
between too few partitions (not enough parallelism) and too many partitions
(tablet server resource contention and
Oops, looks like Eric and I owe donuts.
Anyone know how to get vim to automatically add license headers? ;-)
Adam
On Fri, Oct 26, 2012 at 11:14 AM, Billie Rinaldi bil...@apache.org wrote:
-1
These files don't have licenses:
For the bulk load of one file, shouldn't it be roughly O(log(n) * log(P) *
p), where n is the size of the file, P is the total number of tablets
(proportional to tablet servers), and p is the number of tablets that get
assigned that file?
For the BatchWriter case, there's a client-side
Another way to say this is that cross-data center replication for Accumulo
is left to a layer on top of Accumulo (or the application space). Cassandra
supports a mode in which you can have a bigger write replication than write
quorum, allowing writes to eventually propagate and reads to happen on
John is referring to the streaming ingest, not the bulk ingest. Dave is
correct on this one. Basically, we don't count the records when you bulk
ingest so that we can get sub-linear runtime on the bulk ingest operation.
Adam
On Fri, Sep 21, 2012 at 4:22 PM, ameet kini ameetk...@gmail.com wrote:
had wanted.
** **
Matt
** **
** **
*From:* user-return-1330-MATTHEW.J.MOORE=saic@accumulo.apache.org[mailto:
user-return-1330-MATTHEW.J.MOORE=saic@accumulo.apache.org] *On Behalf
Of *Adam Fuchs
*Sent:* Tuesday, September 11, 2012 5:30 PM
*To:* user@accumulo.apache.org
Matthew,
I don't know of anyone who has done this, but I believe you could:
1. mount a RAM disk
2. point the hdfs core-site.xml fs.default.name property to file:///
3. point the accumulo-site.xml instance.dfs.dir property to a directory on
the RAM disk
4. disable the WAL for all tables by setting
fetchColumn is agglomerative, so if you call it multiple times it will
fetch multiple columns.
Adam
On Sep 10, 2012 6:25 PM, bob.thor...@l-3com.com wrote:
Billie
** **
That’s what I’m doing at the moment, but I’d like to give the iterator a
collection of CF/CQ to filter on. Is that
Fred,
One tracer is fine, and you can set that to be the same as the master node.
You also need to set the username and password for the tracer in
accumulo-site.xml if you haven't already.
Adam
On Sep 5, 2012 1:22 PM, Fred Wolfinger fred.wolfin...@g2-inc.com wrote:
Hey Marc,
I can't tell you
*SNIP
3. Compressed reverse-timestamp using Unicode tricks?
--
I see code in Accumulo like
// We're past the index column family, so return a term that will sort
// lexicographically last. The last unicode character should suffice
Jim,
The HdfsZooInstance looks for accumulo-site.xml on the classpath to find
the directory in HDFS to look for the instance ID. If accumulo-site.xml is
not on the classpath then it will default to /accumulo, which is probably
different from the directory you are using. accumulo-site.xml also
Sounds like a good upgrade to me. Could even be done as part of that
warning message.
Adam
On Thu, Jul 12, 2012 at 9:28 PM, David Medinets david.medin...@gmail.comwrote:
I am seeing the following output in my monitor_lasho.log file. Would
it be possible to display the host and port that is
John,
This was a fun one, but we figured it out. Thanks for providing code --
that helped a lot. The quick workaround is to set the priority of the
WholeRowIterator to 21, above the VersioningIterator. Turns out the two
iterators are not commutative, so order matters.
Solution: when you set up
Hi Patrick,
The short answer is yes, but there are a few caveats:
1. As you said, information that is sitting in the in-memory map and in the
write-ahead log will not be in those files. You can periodically call flush
(Connector.getTableOperations().flush(...)) to guarantee that your data has
+1
Signature looks good
Hashes look good
Installs and runs well (configured, installed, started, attached with
shell, created table, inserted, scanned, flushed, compacted, shutdowned)
Adam
On Tue, Jul 3, 2012 at 1:57 PM, Eric Newton eric.new...@gmail.com wrote:
I've recreated the build
You can't scan backwards in Accumulo, but you probably don't need to. What
you can do instead is use the last timestamp in the range as the key like
this:
key=2 value= {a.1 b.1 c.2 d.2}
key=5 value= {m.3 n.4 o.5}
key=7 value={x.6 y.6 z.7}
As long as your ranges are
The tradeoff would be convenience versus complexity in the API. I would
lean towards having fewer ways to create a Key.
Has this debate played out before?
http://www.wikivs.com/wiki/Python_vs_Ruby#Philosophy
Adam
On Tue, Jun 26, 2012 at 9:17 AM, David Medinets david.medin...@gmail.comwrote:
, Jun 26, 2012 at 10:20 AM, Adam Fuchs afu...@apache.org wrote:
The tradeoff would be convenience versus complexity in the API. I would
lean
towards having fewer ways to create a Key.
Has this debate played out
before? http://www.wikivs.com/wiki/Python_vs_Ruby#Philosophy
Adam
There's also the concern of elements of the document that are too large by
themselves. A general purpose streaming solution would include support for
any kind of objects passed in, not just XML with small elements. I think
the fact that it is an XML document is probably a red herring in this case.
Nope, we currently only support one sort order. The closest you can come is
by using an encoding the flips the sort order. In this case, you would take
every byte and subtract it from 255 to get your new row, so:
void convert(byte[] row)
{
for(int i = 0; i row.length; i++)
row[i] =
One of the differences you'll see between WholeRowIterator and RowFilter is
that WholeRowIterator buffers an entire row in memory while RowFilter does
not. Each includes a boolean method that you would override in a subclass
-- acceptRow(...) in RowFilter or filter(...) in WholeRowIterator. In
One issue here is you are mixing Iterator and Iterable in the same object.
Usually, an Iterable will return an iterator at the beginning of some
logical sequence, but your iterable returns the same iterator object over
and over again. This state sharing would make it so that you can really
only
Small correction: the branching factor would not have to be exactly 1, but
it would be small on average (close to 1).
Adam
On Thu, Apr 12, 2012 at 12:50 PM, Adam Fuchs adam.p.fu...@ugov.gov wrote:
This probably won't work, unless all node names are unique at a given
level. For example, given
Sam,
Yes, Accumulo 1.4.0 should be compatible with Hadoop 1.0.1 after you remove
that check. We've run with it some, but mostly we've tested with 0.20.x.
Please let us know if you see any compatibility problems.
There are two possibilities for why your second tablet server did not
start. Either
86 matches
Mail list logo