Well the batch layer will be able to read streaming data from flume files
if needed using Spark csv. It may take a bit longer but that is not the
focus of batch layer.
All real time data will be through the speed layer using Spark streaming
where the real time alerts/notification will also be produced. Case in
point immediate notification with regard to liquidity risk associated with
a certain security.
A combined data will be on offer through the Serving Layer and there we may
need to create pre-aggregated data in batch layer to be combined with real
time data from the speed layer.
Dr Mich Talebzadeh
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On 18 September 2016 at 11:08, Jörn Franke <jornfra...@gmail.com> wrote:
> Ignite has a special cache for HDFS data (which is not a Java cache), for
> rdds etc. So you are right it is in this sense very different.
> Besides caching, from what I see from data scientists is that for
> interactive queries and models evaluation they anyway do not browse the
> complete data. Even with in-memory solutions this is painful slow if you
> receive several TB of data by hour.
> What they do is sampling, e.g.select relevant small subset of data,
> evaluate several different models on the sampled data in "real time" and
> then calculate the winning model as batch later.
> Additionally probabilistic data structures are employed in some cases. For
> example if you want to count the number of unique viewers of a web site it
> does not make sense to browse through the logs for userids all the time, by
> employ a hyperloglog structure which needs little money and can be
> accessed in real time.
> For the case of visualizations, I think in the area of big data it makes
> also very sense to visualize aggregations based on sampling. If you need
> really the last 0,0001% of precision then you can click on the
> visualization and the system takes some time to calculate it.
> On 18 Sep 2016, at 10:54, Mich Talebzadeh <mich.talebza...@gmail.com>
> Thanks everyone for ideas.
> Sounds like Ignite has been taken by GridGain so becomes similar to
> HazelCast open source by name only. However, an in-memory Java Cache may or
> may not help.
> The other options like faster databases are on the table depending who
> wants what (that are normally decisions that includes more than technical
> criteria). Example if the customer already had Tableau, persuading them to
> go for QlickView (as an example) may not work.
> So my view is to build the batch layer foundation and leave these finer
> choices to the customer. We will offer Zeppelin with Parquet and ORC with a
> certain refresh of these tables and let the customer decide. I stand
> corrected otherwise.
> BTW I did these simple test on using Zeppelin (running on Spark Standalone
> 1) Read data using Spark sql from Flume text files on HDFS (real time)
> 2) Read data using Spark sql from ORC table in Hive (lagging by 15 min)
> 3) Read data using Spark sql from Parquet table in Hive(lagging by 15 min)
> 1) 2 min, 16 sec
> 2) 1 min, 1 sec
> 3) 1 min, 6 sec
> So unless one splits the atom, ORC or Parquet on Hive look similar
> In all probability customer has a data warehouse that use Tableau or
> QlikView or similar. Their BAs will carry on using these tools. If they
> have data scientist then they will either use R that has in built UI or can
> use Spark sql with Zeppelin. Also one can fire Zeppelin on each node of
> Spark or even on the same node with different Port. Then of coursed one has
> to think about adequate response in a concurrent environment.
> Dr Mich Talebzadeh
> LinkedIn *
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> On 18 September 2016 at 08:52, Sean Owen <so...@cloudera.com> wrote:
>> Alluxio isn't a database though; it's storage. I may be still harping
>> on the wrong solution for you, but as we discussed offline, that's
>> also what Impala, Drill et al are for.
>> Sorry if this was mentioned before but Ignite is what GridGain became,
>> if that helps.
>> On Sat, Sep 17, 2016 at 11:00 PM, Mich Talebzadeh
>> <mich.talebza...@gmail.com> wrote:
>> > Thanks Todd
>> > As I thought Apache Ignite is a data fabric much like Oracle Coherence
>> > or HazelCast.
>> > The use case is different between an in-memory-database (IMDB) and Data
>> > Fabric. The build that I am dealing with has a 'database centric' view
>> > its data (i.e. it accesses its data using Spark sql and JDBC) so an
>> > in-memory database will be a better fit. On the other hand If the
>> > application deals solely with Java objects and does not have any notion
>> of a
>> > 'database', does not need SQL style queries and really just wants a
>> > distributed, high performance object storage grid, then I think Ignite
>> > likely be the preferred choice.
>> > So will likely go if needed for an in-memory database like Alluxio. I
>> > seen a rather debatable comparison between Spark and Ignite that looks
>> to be
>> > like a one sided rant.
>> > HTH
>> > Dr Mich Talebzadeh
>> > LinkedIn
>> > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ
>> > http://talebzadehmich.wordpress.com
>> > Disclaimer: Use it at your own risk. Any and all responsibility for any
>> > loss, damage or destruction of data or any other property which may
>> > from relying on this email's technical content is explicitly
>> disclaimed. The
>> > author will in no case be liable for any monetary damages arising from
>> > loss, damage or destruction.