This is for fishing for some ideas.

In the design we get prices directly through Kafka into Flume and store
it on HDFS as text files
We can then use Spark with Zeppelin to present data to the users.

This works. However, I am aware that once the volume of flat files rises
one needs to do housekeeping. You don't want to read all files every time.

A more viable alternative would be to read data into some form of tables
(Hive etc) periodically through an hourly cron set up so batch process will
have up to date and accurate data up to last hour.

That certainly be an easier option for the users as well.

I was wondering what would be the best strategy here. Druid, Hive others?

The business case here is that users may want to access older data so a
database of some sort will be a better solution? In all likelihood they
want a week's data.


Dr Mich Talebzadeh

LinkedIn * 


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Reply via email to