Hi, once again that is all about tooling.
Regards, Gourav Sengupta On Sun, Mar 6, 2016 at 7:52 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > > > What is the current size of your relational database? > > > > Are we talking about a row based RDBMS (Oracle, Sybase) or a columnar one > (Teradata/ Sybase IQ)? > > > > I assume that you will be using SQL wherever you migrate to. The > SQL-on-Hadoop tools are divided between well thought out solutions like > Hive which actually have the use case of being your Data Warehouse > infrastructure, to others which are relational database replacements to > just query engines. Many SQL query engines whether they are Impala, Drill, > Spark SQL or Presto have varying capabilities to query data in Hive. So > here Spark is effectively a query engine. However, you still have to > migrate your data to it. You can easily use Sqoop to migrate data from your > RDBMS to Hive pretty straight forward (it will do table creation and > population in Hive via JDBC). You mentioned HazelCast but that is just Data > Grid much like Oracle Coherence Cache. You can of course push your data > from your RDBMS to JMS or something similar in XML format using triggers or > replication server (GoldenGate/SAP Replication server) and eventually you > will want to store that data somewhere in Big Data once in Data Grid. I > have explained the architecture here > <https://www.linkedin.com/pulse/data-grid-big-architecture-hadoop-hive-mich-talebzadeh-ph-d-?trk=pulse_spock-articles> > > > So there are few questions to be asked: > > > > 1. Choose a Data Warehouse in Big Data. The likelihood will be > something like Hive that supports ACID properties and has the nearest to > Ansi-SQL on Big data. Your users will be productive on it assuming they > know SQL (which they ought) > 2. Once you have chosen your target Data Warehouse, you will need to > consider various query tools like Spark that provides Spark-shell and > Spark-sql tools among other things. It provides SQL interface plus > functional programming through Scala etc. It is a pretty impressive query > engine with in-memory calculation and DAG > 3. You can also use other visualisation tools like Tableau etc for > user interface. > > > HTH > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 6 March 2016 at 19:14, Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > >> Hi, >> >> SPARK is just tooling, and its even not tooling. You can consider SPARK a >> distributed operating system like YARN. You should read books like HADOOP >> Application Architecture, Big Data (Nathan Marz) and other disciplines >> before starting to consider how the solution is built. >> >> Most of the big data projects (like any other BI projects) do not deliver >> value or turn extremely expensive to maintain because the approach is that >> tools solve the problem. >> >> >> Regards, >> Gourav Sengupta >> >> On Sun, Mar 6, 2016 at 5:25 PM, Guillaume Bilodeau < >> guillaume.bilod...@gmail.com> wrote: >> >>> The data is currently stored in a relational database, but a migration >>> to a document-oriented database such as MongoDb is something we are >>> definitely considering. How does this factor in? >>> >>> On Sun, Mar 6, 2016 at 12:23 PM, Gourav Sengupta < >>> gourav.sengu...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> That depends on a lot of things, but as a starting point I would ask >>>> whether you are planning to store your data in JSON format? >>>> >>>> >>>> Regards, >>>> Gourav Sengupta >>>> >>>> On Sun, Mar 6, 2016 at 5:17 PM, Laumegui Deaulobi < >>>> guillaume.bilod...@gmail.com> wrote: >>>> >>>>> Our problem space is survey analytics. Each survey comprises a set of >>>>> questions, with each question having a set of possible answers. Survey >>>>> fill-out tasks are sent to users, who have until a certain date to >>>>> complete >>>>> it. Based on these survey fill-outs, reports need to be generated. >>>>> Each >>>>> report deals with a subset of the survey fill-outs, and comprises a >>>>> set of >>>>> data points (average rating for question 1, min/max for question 2, >>>>> etc.) >>>>> >>>>> We are dealing with rather large data sets - although reading the >>>>> internet >>>>> we get the impression that everyone is analyzing petabytes of data... >>>>> >>>>> Users: up to 100,000 >>>>> Surveys: up to 100,000 >>>>> Questions per survey: up to 100 >>>>> Possible answers per question: up to 10 >>>>> Survey fill-outs / user: up to 10 >>>>> Reports: up to 100,000 >>>>> Data points per report: up to 100 >>>>> >>>>> Data is currently stored in a relational database but a migration to a >>>>> different kind of store is possible. >>>>> >>>>> The naive algorithm for report generation can be summed up as this: >>>>> >>>>> for each report to be generated { >>>>> for each report data point to be calculated { >>>>> calculate data point >>>>> add data point to report >>>>> } >>>>> publish report >>>>> } >>>>> >>>>> In order to deal with the upper limits of these values, we will need to >>>>> distribute this algorithm to a compute / data cluster as much as >>>>> possible. >>>>> >>>>> I've read about frameworks such as Apache Spark but also Hadoop, >>>>> GridGain, >>>>> HazelCast and several others, and am still confused as to how each of >>>>> these >>>>> can help us and how they fit together. >>>>> >>>>> Is Spark the right framework for us? >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> >>>> >>> >> >