>On Fri, Jul 11, 2014 at 7:25 AM, <[email protected]> wrote: > >The entirety of both data corpuses were re-loaded every night?
Yes. >What did the users do while the data was dropped and reloaded? The technique of 'dropping and reloading' was not used. Users were not impacted. For the original system, we used a combination of Sqoop and the Fair Scheduler in Hadoop to throttle the export. For Accumulo, we created new tables using a date-based naming convention. Accumulo queries used a lookup process to find the current table. When the new table was ready it was automatically used. >What happened in the middle of night if the job failed? Why has this conversation topic changed to "Is David competent to design an ingest system"? >Couldn’t you identify the incremental updates to the two sources >and incrementally load the new data into the combined target? Yes, we could. But, for reasons not germane to this conversation, we pulled the whole corpus. >This brute force implementation is only applicable to a few use >cases with lax SLAs. Ok. >>From: David Medinets [mailto:[email protected]] >>Last year, I used Accumulo's rapid ingest ability to join two data >>silos into one dataset. Every field was fully indexed. Having all >>of the data in one place allowed cross-referencing queries to be >>executed. For various reason, this kind of query was not possible >>using the existing technology. The rapid ingest was important >>because a new copy of the data silos was pulled every night.
