Thanks Eric, I was afraid that would be the case. If I understand you correctly, putting a GB file into Accumulo would be a bad idea. Given that fact, are there any strategies available to ensure that a given file in HDFS is co-located with the index info for that file in Accumulo? (I would assume not). In my case, I could use Accumulo to store my indexes for fast query, but then have them return a URL/URI to the actual file. However, I have to process each of those files further to get to my final result, and I was hoping to do the second stage of processing without having to return intermediate results. Am I correct in assuming that this can't be done?
Thanks, Tejay From: Eric Newton [mailto:[email protected]] Sent: Thursday, August 23, 2012 3:06 PM To: [email protected] Subject: Re: EXTERNAL: Re: Large files in Accumulo An entire mutation needs to fit in memory several times, so you should not attempt to push in a single mutation larger than a 100MB unless you have a lot of memory in your tserver/logger. And while I'm at it, large keys will create large indexes, so try to keep your (row,cf,cq,cv) under 100K. -Eric On Thu, Aug 23, 2012 at 4:37 PM, Cardon, Tejay E <[email protected]<mailto:[email protected]>> wrote: In my case I'll be doing a document based index store (like the wikisearch example), but my documents may be as large as several GB. I just wanted to pick the collective brain of the group to see if I'm walking into a major headache. If it's never been tried before, then I'll give it a shot and report back. Tejay From: William Slacum [mailto:[email protected]<mailto:[email protected]>] Sent: Thursday, August 23, 2012 2:07 PM To: [email protected]<mailto:[email protected]> Subject: EXTERNAL: Re: Large files in Accumulo Are these RFiles as a whole? I know at some point HBase needed to have entire rows fit into memory; Accumulo does not have this restriction. On Thu, Aug 23, 2012 at 12:55 PM, Cardon, Tejay E <[email protected]<mailto:[email protected]>> wrote: Alright, this one's a quick question. I've been told that HBase does not perform well if large (> 100MB) files are stored in it). Does Accumulo have similar trouble? If so, can it be overcome by storing the large files in their own locality group? Thanks, Tejay
