And to make things a bit more explicit, Sqoop is the name of a project "SQL to Hadopp". http://sqoop.apache.org/
As for crunch, I guess Chris used the generic term. You could use MapReduce jobs with the Java API, Pig, Hive, Cascading, Crunch (indeed)... Regards Bertrand On Thu, Dec 19, 2013 at 10:59 PM, Vinay Bagare <[email protected]> wrote: > I would also look at current setup. > I agree with Chris that 500 GB is fairly insignificant. > > > Best, > Vinay Bagare > > > > On Dec 19, 2013, at 12:51 PM, Chris Embree <[email protected]> wrote: > > > In big data terms, 500G isn't big. But, moving that much data around > > every night is not trivial either. I'm going to guess at a lot here, > > but at a very high level. > > > > 1. Sqoop the data required to build the summary tables into Hadoop. > > 2. Crunch the summaries into new tables (really just files on Hadoop) > > 3. Sqoop the summarized data back out to Oracle > > 4. Build Indices as needed. > > > > Depending on the size of the data being sqoop'd, this might help. It > > might also take longer. A real solution would require more details > > and analysis. > > > > Chris > > > > On 12/19/13, Jay Vee <[email protected]> wrote: > >> We have a large relational database ( ~ 500 GB, hundreds of tables ). > >> > >> We have summary tables that we rebuild from scratch each night that > takes > >> about 10 hours. > >> From these summary tables, we have a web interface that accesses the > >> summary tables to build reports. > >> > >> There is a business reason for doing a complete rebuild of the summary > >> tables each night, and using > >> views (as in the sense of Oracle views) is not an option at this time. > >> > >> If I wanted to leverage Big Data technologies to speed up the summary > table > >> rebuild, what would be the first step into getting all data into some > big > >> data storage technology? > >> > >> Ideally in the end, we want to retain the summary tables in a relational > >> database and have reporting work the same without modifications. > >> > >> It's just the crunching of the data and building these relational > summary > >> tables where we need a significant performance increase. > >> > >
