analytics server proposal
=========================

tl;dr
-----
measurement team is setting up a server for you to play with raw metrics data.
please check below if it meets your requirements and if you have any 
suggestions for the setup.


Motivation + Setup
------------------
An idea came up during the Berlin dev meeting: it would be good to have all 
measurement data on one server, loaded into a database, with a suite of 
analytics software readily set up, available to everybody who needs to do 
aggregation, analytics, research etc on the data. 

Right now everybody who wants to analyze Tor metrics data first has to downlod 
raw data from the metrics website, load it into a local database, massage it, 
setup a query environment etc. That can be cumbersome and use siginificant 
computing resources, making your notebook unusable for hours or eating up 
hundreds of gigabytes of disk space if you don’t optimize properly. That makes 
on the fly quick and dirty prototyping and analysis quite unfeasable and 
prohibitively tedious.

We plan to change this by setting up an analytics server free for everybody to 
use:
* containing as much raw metrics data as possible
* in a HBase database on Hadoop/HDFS 
* with analytic tools like Spark and Drill on top
* supporting diverse languages - SQL, R, Python, Java, Scala, Clojure -
* with gateways to desktop-ish tools like Tableau and MongoDB 

Please see the detailed description below, tell us if you have any suggestions 
concerning the setup and request a login (to be available soon)!


Data
----
The data will be available as raw as possible, like it is served by CollecTor, 
preloaded into a database and maybe also in some pre-aggregated sets that 
concentrate on “popular” aspects of the data. The complete dataset is about 50 
GB compressed, or 5 TB uncompressed. For a start we may need to concentrate on 
the last 2 or 3 years.


Analytics
————
The amount of data practically rules out the usual RDBMS suspects (PostgreSQL 
etc). We also want to have easy access to the analytical tools that are 
available in the big data ecosystem.

Hadoop [0] is a popular candidate for Big Data processing, but has a reputation 
of being slightly outdated, not so well engineered [1] and is not available in 
Debian stable (which likely won’t change [2]). It seems like choosing Hadoop 
means risking to run into nasty little problems down the road. Most solutions 
outlined below are or can be based on Hadoop File System (HDFS) but that is 
available independently of the Hadoop MapReduce system.

Disco [3] is another MapReduce engine. Seems solid, heavyly uses Python, but 
relies on it’s own file system [4] which would make it harder or impossible to 
use with other tools like Spark, that require HDFS (or compatible). Would need 
more research/googling to be really sure but right now it doesn't seems worth 
the effort.

Spark [5] is faster than Hadoop MapReduce when working in memory and can work 
equally well from disk. Storage is based on Hadoop and "works with any Hadoop 
compatible data source including HDFS, HBase, Cassandra, etc.” [6]. It can work 
with Amazons Elastic MapReduce EMR instances (which in turn allows running 
Spark on EMR). EMR pricing is not prohibitive [7]. Ooni uses this setup [8].
Spark supports R, Scala, Java, Python and Clojure. It provides an interactive 
shell for Python and Scala and plugins for SQL-like queries, MLib (machine 
learning lib) and some graph processing. Its performance but also the SQL 
plugin and the wide range of supported programming languages make Spark quite 
attractive.

Hive [9] is another analytics engine on top of Hadoop that has similar features 
to Spark [10] (its feature set was lagging behind Spark for some time but has 
catched up recently). Since Ooni uses Spark and the two are not very different 
with respect to feature set we favor Spark (and hope for synergies). 

MongoDB [11], which is available in Debian stable and supports Map Reduce [12], 
may not be performant enough and is in some ways rather shoddy.

Redis [13] has a good reputation and while optimized for in-memory operations 
it can work from disk too. It is available as Debian stable [14]. Map reduce 
operations are available as add ons [15] but not in Debian stable. Like MongoDB 
there is not much interesting analytics software available besides MapReduce.

Drill [16] is a nice analytic tool: a 'schema-free SQL query engine for Big 
Data’ which accepts a variety of data sources including HDFS, MongoDB, Amazon 
S3, with impressive performance and a wide range of possible applications. 

Both Spark and Drill support Tableau, a nice Desktop data visualization tool 
that allows to work with data in a rather intuitive way.

In summary it seems like Spark and Drill are the aggregation and analytic 
softwares that we want to have. Together they support a nice array of languages 
(SQL, R, Java, Python, Scala, Clojure) and desktop-ish tools (MongoDB, Tableau) 
which should make them readily useable for a lot of people.


Storage
-------
Amazons EMR is an option but for now we’ll install our own box. We could go 
with CSV files in the Hadoop Distributed File System HDFS to keep complexity 
down but given the amount of data a proper DBMS seems justified. Two databases 
are supported by both Spark and Drill: Apache Cassandra [17][18] and Apache 
HBase [19][20][21]. Both are column-oriented key-value-store. HBase follows the 
same design principles as Cassandra but comes out of the Hadoop community and 
is better integrated wiht other offerings. Also while Cassandra is more 
optimized for easy scaling and write performance HBase has the emphasis on fast 
read [22]. It also seems like getting Drill to work with Cassandra is not as 
easy as advertized [23]. So HBase seems like the best fit.

Hopefully Drill and Spark will work from the same data in HBase and not want to 
import it into their own store (thereby duplicating it) but we’re not sure 
about that right now. If they do import the data into their own stores anyway 
then HBase might be overkill and CSV files on HDFS sufficient. Or Parquet 
[24][25]? We plan to figure that out during the next days. Any help and remarks 
appreciated!

Redis and MongoDB, the 2 systems that are available as Debian stable packages, 
are not well enough supported by analytic softwares to be useful.


Conclusion
----------
Measurement team will set up a hosted server with 32 GB RAM and 5 TB storage 
(partly as SSD), install Apache HBase as database and Apache Spark and Apache 
Drill as analytics softwares and import a lot of raw measurement data into 
HBase. We then give login to the machine on request and hope that you have fun 
with map reducing, data sciencing, visualizing...
This is a test: we plan to keep the server running for at least the next 3 
months or so. If it doesn’t prove useful we will probably shut it down 
afterwards, but we hope otherwise.


---
[0] http://hadoop.apache.org/
[1] https://news.ycombinator.com/item?id=4298284 - see the 4. comment thread 
for a nice rundown on the cons of Hadoop. 
[2] https://wiki.debian.org/Hadoop
[3] http://discoproject.org/
[4] http://disco.readthedocs.org/en/develop/howto/ddfs.html?highlight=hdfs
[5] http://spark.apache.org/
[6] http://www.infoq.com/articles/apache-spark-introduction
[7] https://aws.amazon.com/elasticmapreduce/pricing/
[8] https://github.com/TheTorProject/ooni-pipeline
[9] http://hive.apache.org/
[10] http://www.zdnet.com/article/sql-and-hadoop-its-complicated/
[11] https://www.mongodb.org/
[12] https://docs.mongodb.org/manual/core/map-reduce/
[13] https://redis.io 
[14] 
https://packages.debian.org/search?suite=jessie&arch=any&searchon=names&keywords=redis
[15] http://heynemann.github.io/r3/ - maybe also others although Google didn’t 
really help
[16] https://drill.apache.org/, https://drill.apache.org/faq/
[17] http://cassandra.apache.org/
[18] 
http://www.planetcassandra.org/getting-started-with-apache-spark-and-cassandra/
[19] http://hbase.apache.org/
[20] 
http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
[21] http://www.slideshare.net/cloudera/5-h-base-schemahbasecon2012
[22] 
http://www.infoworld.com/article/2848722/nosql/mongodb-cassandra-hbase-three-nosql-databases-to-watch.html
[23] http://stackoverflow.com/questions/31017755
[24] https://parquet.apache.org/documentation/latest/
[25] http://arnon.me/2015/08/spark-parquet-s3/




_______________________________________________
tor-dev mailing list
[email protected]
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Reply via email to