Hi

We have configured total* 11 nodes*. Each node contains 8 cores and 32 GB
RAM

*Technologies and their version:*
Apache Spark 1.5.2 and YARN              : 6 nodes
DSE 4.7 [Cassandra 2.1.8 and Solr]      : 5 nodes
HDFS (Hadoop version 2.7.1)                : 3 nodes

*Stack:*
3 separate nodes for HDFS
3 separate nodes for Spark + YARN
2 separate seed nodes for DSE Cassandra
3 nodes share Cassandra and Spark both

HDFS and Cassandra Replication factor : 3
Used DSE Solr for indexing records in cassandra.
Programming Codi in Java.


*Job flow:*

   1. Driver program to initialize spark and cassandra with 2 seed nodes
   2. Fetch json file from HDFS.
   3. mappartitions on files and using FlatMap function to iterate over data
   4. Each line from file represents a record. In FlatMap function, We use
   gson to convert json to POJO
   5. Invoke solr HTTP GET request based on the fields of POJO. We invoke
   roughly 10 HTTP requests per POJO constructed in previous step. HTTP
   request have any one of 5 Cassandra IPs for distributing GET request load
   across nodes.
   6. These POJOs are collected in an arraylist and returned to driver
   7. We then invoke the mapToRow function to insert these RDDs into
   cassandra.


*Queries:*

   1. Deployment- From the deployment standpoint, does the technology stack
   on each node make sense?
   2. How to determine the partitions size. We are currently using formula
   => size in MB / 16. Should we determine the number of cores, executors and
   memory based on data size or number of rows in the file.
   3. TableWriter issue - While writing RDDs into cassandra, computation
   processes halt and take more time to complete. We are using YJP-profiler
   for monitoring these stats.How to overcome this latency.
   4. Are there any performance related parameters in Spark, Cassandra,
   Solr which will reduce the job time


Any help to increase the performance will be appreciated.
Thanks


-- 
Ashish Gadkari

Reply via email to