Hi all, I have another question about dealing with large amounts of data…

I'm trying to store large blobs of data inside of accumulo, by means of doing 
directory imports.  These blobs are binary and are referenced by other tables.  
They can also get quite large.  In an effort to cut down on the amount of time 
spent doing compactions on this data, I've taken to using what amounts to an 
increasing sequence number for the rowID's, so now a major compaction amounts 
to a copy of the data, but no merging has to happen.  I can also play with the 
table.split.threshold property for the table to keep tablets from splitting.  
But sometimes a compaction will occur, which results in a lot of data being 
unnecessarily copied from one rfile to another.  So, my question…is there any 
way to signal to accumulo that rfiles that I'm trying to do an importdirectory 
on should just be used as is and no compaction is desired (I.e. Just move the 
rfiles into the table directory rather than moving them to a temp directory for 
later merging upon compaction)?  The paradigm I'm shooting for here is like 
oracle partitioned tables, where you can fill a tmp table with new data, and 
then swap that tmp table with an empty partition on the target table….the whole 
process taking seconds since no data moves, just pointers in the guts of the DB.

If there's no current way to do this, would such a mechanism be desirable to 
anyone other than me?  I wouldn't mind taking a stab at implementing this, but 
don't want to start if it's a feature that no one would want or is thought to 
be totally stupid  in the first place :) (As an aside, yes, I've though of 
storing the data in hdfs and keeping a pointer to it in accumulo, but the way I 
want to interact w/ the data is way easier if it's all in accumulo tables.)

Thanks,
Ed

Reply via email to