Zeppelin spark app is in "WAITING" state at the SPARK standalone cluster

Ravi Gurram Sun, 20 Mar 2016 22:29:08 -0700

Folks,
    I am trying to install zeppelin in our environment. I am really excited
about the promise of zeppelin and in fact if it does what it says on the
tin. I can get rid of the "reporting engine" of my application.


I have a cluster of 3 machines with

- Spark 1.5
- Cassandra 2.1.8
- libthrift 0.9.2
- cassandra-thrift 2.1.8
- spark-cassandra-connector_2.10-1.5.0-RC1.jar
- spark-cassandra-connector-java_2.10-1.5.0-RC1.jar

I cannot change this setup as my core application is built on the above
versions.

I did a git clone of master and built it using the command
 mvn clean package -Pcassandra-spark-1.5 -Dhadoop.version=2.6.0
-Phadoop-2.6 -DskipTests

so I suppose I have zepelin 0.6.0

The spark script in zeppelin is a simple fetch from cassandra and display
the records

%spark
import com.datastax.spark.connector._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat
import com.datastax.spark.connector.CassandraRow
import com.cybersparta.athena.events.checkpointfw.CheckpointfwlogEvent
import com.cybersparta.athena.events.correlation.CorrelatedEvent
import com.cybersparta.athena.events.checkpointfw.CheckpointfwlogEventSerde
import com.google.gson.JsonParser
import com.google.gson.JsonObject


val KEYSPACE = "cybersparta"
val TABLE    = "correlatedevents"

var stratioQueryStr = "{filter: {upper: \"2016-03-18 22:17:01.145+0000\",
lower: \"2016-03-16 22:17:01.145+0000\", include_lower: \"true\","
stratioQueryStr += "field: \"TIMESTAMP\", include_upper: \"true\", type:
\"range\" }, "
stratioQueryStr += "query: {field: \"RULE_CODE\", type: \"match\", value:
\"B:0010\" }}"

val rdd = sc.cassandraTable(KEYSPACE,TABLE).where("stratio_col=?",
stratioQueryStr)


The query runs fine and returns 539 records. Now I am trying to print the
elements just so that I can manipulate the fields and draw charts.


The issue is with the next set of lines


rdd.count()

val nonStdRdd = rdd.map{
    case(row:CassandraRow) =>
        val DATE_TIME_STORE_FORMATTER =
DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss.SSSZ")
        val evtTime =
DateTime.parse(row.getString("TIMESTAMP"),DATE_TIME_STORE_FORMATTER)

        val lst = row.getList[String]("CORRELATED_EVENTS").map{
            case (obj) =>
                obj.asInstanceOf[String]
        }
        lst(0)
}

nonStdRdd.take(10).foreach(println)


1) If I run this on a "spark standalone cluster", the app gets accepted at
the SPARK master as I see it in the MASTER UI but it is in a WAITING state.
There is nothing else running on the cluster at the moment and I just can't
seem to get it out of this state. There is nothing in the spark logs other
than saying the application was accepted. None of the workers have the app.

2) If I run it with "local[*]" it works fine but if I replace
'nonStdRdd.take(10).foreach(println)'
with nonStdRdd.foreach(println), the browser hangs, it is taking up 30% of
CPU. Looks like the "sparkContext" in this scenario is running within the
browser and it is strange that to print 539 strings it should hang.

I am stuck either ways, any help would be appreciated.

I suspect it is a library version mis-match but not able to pinpoint it, I
am confident that the above listed libs and the versions are what zeppelin
is picking up as I have removed all other libs. Also if I get rid of any of
those zeppelin does not initialize.


----------------------------------- output
-------------------------------------------------


rdd:
com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow]
= CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15

stuck here forever



Regards
-Ravi Gurram

Zeppelin spark app is in "WAITING" state at the SPARK standalone cluster

Reply via email to