Re: Is Spark right for us?

Gourav Sengupta Sun, 06 Mar 2016 12:21:21 -0800

Hi,

once again that is all about tooling.


Regards,
Gourav Sengupta

On Sun, Mar 6, 2016 at 7:52 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
>
>
> What is the current size of your relational database?
>
>
>
> Are we talking about a row based RDBMS (Oracle, Sybase) or a columnar one
> (Teradata/ Sybase IQ)?
>
>
>
> I assume that you will be using SQL wherever you migrate to. The
> SQL-on-Hadoop tools are divided between well thought out solutions like
> Hive which actually have the use case of being your Data Warehouse
> infrastructure, to others which are relational database replacements to
> just query engines. Many SQL query engines whether they are Impala, Drill,
> Spark SQL or Presto have varying capabilities to query data in Hive. So
> here Spark is effectively a query engine. However, you still have to
> migrate your data to it. You can easily use Sqoop to migrate data from your
> RDBMS to Hive pretty straight forward (it will do table creation and
> population in Hive via JDBC). You mentioned HazelCast but that is just Data
> Grid much like Oracle Coherence Cache. You can of course push your data
> from your RDBMS to JMS or something similar in XML format using triggers or
> replication server (GoldenGate/SAP Replication server)  and eventually you
> will want to store that data somewhere in Big Data once in Data Grid. I
> have explained the architecture here
> <https://www.linkedin.com/pulse/data-grid-big-architecture-hadoop-hive-mich-talebzadeh-ph-d-?trk=pulse_spock-articles>
>
>
> So there are few questions to be asked:
>
>
>
>    1. Choose a Data Warehouse in Big Data. The likelihood will be
>    something like Hive that supports ACID properties and has the nearest to
>    Ansi-SQL on Big data. Your users will be productive on it assuming they
>    know SQL (which they ought)
>    2. Once you have chosen your target Data Warehouse, you will need to
>    consider various query tools like Spark that provides Spark-shell and
>    Spark-sql tools among other things. It provides SQL interface plus
>    functional programming through Scala etc. It is a pretty impressive query
>    engine with in-memory calculation and DAG
>    3. You can also use other visualisation tools like Tableau etc for
>    user interface.
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 6 March 2016 at 19:14, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
>> Hi,
>>
>> SPARK is just tooling, and its even not tooling. You can consider SPARK a
>> distributed operating system like YARN. You should read books like HADOOP
>> Application Architecture, Big Data (Nathan Marz) and other disciplines
>> before starting to consider how the solution is built.
>>
>> Most of the big data projects (like any other BI projects) do not deliver
>> value or turn extremely expensive to maintain because the approach is that
>> tools solve the problem.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Sun, Mar 6, 2016 at 5:25 PM, Guillaume Bilodeau <
>> guillaume.bilod...@gmail.com> wrote:
>>
>>> The data is currently stored in a relational database, but a migration
>>> to a document-oriented database such as MongoDb is something we are
>>> definitely considering.  How does this factor in?
>>>
>>> On Sun, Mar 6, 2016 at 12:23 PM, Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> That depends on a lot of things, but as a starting point I would ask
>>>> whether you are planning to store your data in JSON format?
>>>>
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>> On Sun, Mar 6, 2016 at 5:17 PM, Laumegui Deaulobi <
>>>> guillaume.bilod...@gmail.com> wrote:
>>>>
>>>>> Our problem space is survey analytics.  Each survey comprises a set of
>>>>> questions, with each question having a set of possible answers.  Survey
>>>>> fill-out tasks are sent to users, who have until a certain date to
>>>>> complete
>>>>> it.  Based on these survey fill-outs, reports need to be generated.
>>>>> Each
>>>>> report deals with a subset of the survey fill-outs, and comprises a
>>>>> set of
>>>>> data points (average rating for question 1, min/max for question 2,
>>>>> etc.)
>>>>>
>>>>> We are dealing with rather large data sets - although reading the
>>>>> internet
>>>>> we get the impression that everyone is analyzing petabytes of data...
>>>>>
>>>>> Users: up to 100,000
>>>>> Surveys: up to 100,000
>>>>> Questions per survey: up to 100
>>>>> Possible answers per question: up to 10
>>>>> Survey fill-outs / user: up to 10
>>>>> Reports: up to 100,000
>>>>> Data points per report: up to 100
>>>>>
>>>>> Data is currently stored in a relational database but a migration to a
>>>>> different kind of store is possible.
>>>>>
>>>>> The naive algorithm for report generation can be summed up as this:
>>>>>
>>>>> for each report to be generated {
>>>>>   for each report data point to be calculated {
>>>>>     calculate data point
>>>>>     add data point to report
>>>>>   }
>>>>>   publish report
>>>>> }
>>>>>
>>>>> In order to deal with the upper limits of these values, we will need to
>>>>> distribute this algorithm to a compute / data cluster as much as
>>>>> possible.
>>>>>
>>>>> I've read about frameworks such as Apache Spark but also Hadoop,
>>>>> GridGain,
>>>>> HazelCast and several others, and am still confused as to how each of
>>>>> these
>>>>> can help us and how they fit together.
>>>>>
>>>>> Is Spark the right framework for us?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-right-for-us-tp26412.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Is Spark right for us?

Reply via email to