Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

2017-10-27 Thread Cassa L
No, I dont use Yarn.  This is standalone spark that comes with DataStax
Enterprise version of Cassandra.

On Thu, Oct 26, 2017 at 11:22 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Do you use yarn ? Then you need to configure the queues with the right
> scheduler and method.
>
> On 27. Oct 2017, at 08:05, Cassa L <lcas...@gmail.com> wrote:
>
> Hi,
> I have a spark job that has use case as below:
> RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some
> transformation and after that I do a count on transformed data.
>
> Code somewhat  looks like this:
>
> RDD1=JavaFunctions.cassandraTable(...)
> RDD2=JavaFunctions.cassandraTable(...)
> RDD3 = RDD1.flatMap(..)
> RDD4 = RDD2.flatMap()
>
> RDD3.count
> RDD4.count
>
> In Spark UI I see count() functions are getting called one after another.
> How do I make it parallel? I also looked at below discussion from Cloudera,
> but it does not show how to run driver functions in parallel. Do I just add
> Executor and run them in threads?
>
> https://community.cloudera.com/t5/Advanced-Analytics-
> Apache-Spark/Getting-Spark-stages-to-run-in-parallel-
> inside-an-application/td-p/38515
>
> Attaching UI snapshot here?
>
>
> Thanks.
> LCassa
>
>


Re: Spark 2.0 and Oracle 12.1 error

2017-07-24 Thread Cassa L
Hi Another related question to this. Has anyone tried transactions using
Oracle JDBC and spark. How do you do it given that code will be distributed
on workers. Do I combine certain queries to make sure they don't get
distributed?

Regards,
Leena

On Fri, Jul 21, 2017 at 1:50 PM, Cassa L <lcas...@gmail.com> wrote:

> Hi Xiao,
> I am trying JSON sample table provided by Oracle 12C. It is on the website
> -https://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6371
>
> CREATE TABLE j_purchaseorder
>(id  RAW (16) NOT NULL,
> date_loaded TIMESTAMP WITH TIME ZONE,
> po_document CLOB
> CONSTRAINT ensure_json CHECK (po_document IS JSON));
>
> Data that I inserted was -
>
> { "PONumber" : 1600,
>   "Reference": "ABULL-20140421",
>   "Requestor": "Alexis Bull",
>   "User" : "ABULL",
>   "CostCenter"   : "A50",
>   "ShippingInstructions" : { "name"   : "Alexis Bull",
>  "Address": { "street"  : "200 Sporting Green",
>   "city": "South San Francisco",
>   "state"   : "CA",
>   "zipCode" : 99236,
>   "country" : "United States of 
> America" },
>  "Phone" : [ { "type" : "Office", "number" : 
> "909-555-7307 <(909)%20555-7307>" },
>  { "type" : "Mobile", "number" : 
> "415-555-1234 <(415)%20555-1234>" } ] },
>   "Special Instructions" : null,
>   "AllowPartialShipment" : false,
>   "LineItems": [ { "ItemNumber" : 1,
>"Part"   : { "Description" : "One Magic 
> Christmas",
> "UnitPrice"   : 19.95,
> "UPCCode" : 13131092899 },
>"Quantity"   : 9.0 },
>      { "ItemNumber" : 2,
>"Part"   : { "Description" : "Lethal 
> Weapon",
> "UnitPrice"   : 19.95,
> "UPCCode" : 85391628927 },
>"Quantity"   : 5.0 } ] }
>
>
> On Fri, Jul 21, 2017 at 10:12 AM, Xiao Li <gatorsm...@gmail.com> wrote:
>
>> Could you share the schema of your Oracle table and open a JIRA?
>>
>> Thanks!
>>
>> Xiao
>>
>>
>> 2017-07-21 9:40 GMT-07:00 Cassa L <lcas...@gmail.com>:
>>
>>> I am using 2.2.0. I resolved the problem by removing SELECT * and adding
>>> column names to the SELECT statement. That works. I'm wondering why SELECT
>>> * will not work.
>>>
>>> Regards,
>>> Leena
>>>
>>> On Fri, Jul 21, 2017 at 8:21 AM, Xiao Li <gatorsm...@gmail.com> wrote:
>>>
>>>> Could you try 2.2? We fixed multiple Oracle related issues in the
>>>> latest release.
>>>>
>>>> Thanks
>>>>
>>>> Xiao
>>>>
>>>>
>>>> On Wed, 19 Jul 2017 at 11:10 PM Cassa L <lcas...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>> I am trying to use Spark to read from Oracle (12.1) table using Spark
>>>>> 2.0. My table has JSON data.  I am getting below exception in my code. Any
>>>>> clue?
>>>>>
>>>>> >>>>>
>>>>> java.sql.SQLException: Unsupported type -101
>>>>>
>>>>> at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.o
>>>>> rg$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$ge
>>>>> tCatalystType(JdbcUtils.scala:233)
>>>>> at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$a
>>>>> nonfun$8.apply(JdbcUtils.scala:290)
>>>>> at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$a
>>>>> nonfun$8.apply(JdbcUtils.scala:290)
>>>>> at scala.Option.getOrElse(Option.scala:121)
>>>>> at
>>>>>
>>>>> ==
>>>>> My code is very simple.
>>>>>
>>>>> SparkSession spark = SparkSession
>>>>> .builder()
>>>>> .appName("Oracle Example")
>>>>> .master("local[4]")
>>>>> .getOrCreate();
>>>>>
>>>>> final Properties connectionProperties = new Properties();
>>>>> connectionProperties.put("user", *"some_user"*));
>>>>> connectionProperties.put("password", "some_pwd"));
>>>>>
>>>>> final String dbTable =
>>>>> "(select *  from  MySampleTable)";
>>>>>
>>>>> Dataset jdbcDF = spark.read().jdbc(*URL*, dbTable, 
>>>>> connectionProperties);
>>>>>
>>>>>
>>>
>>
>


Fwd: Spark Structured Streaming - Spark Consumer does not display messages

2017-07-21 Thread Cassa L
Hi,
This is first time I am trying structured streaming with Kafka. I have
simple code to read from Kafka and display it on the console. Message is in
JSON format. However, when I run my code nothin after below line gets
printed.

17/07/21 13:43:41 INFO AppInfoParser: Kafka commitId : a7a17cdec9eaa6c5
17/07/21 13:43:41 INFO StreamExecution: Starting new streaming query.
17/07/21 13:43:42 INFO AbstractCoordinator: Discovered coordinator XXX:9092
(id: 2147483647 <(214)%20748-3647> rack: null) for group
spark-kafka-source-085b8fda-7c01-435d-99db-b67a94dafa3f-1814340906-driver-0.
17/07/21 13:43:42 INFO AbstractCoordinator: Marking the coordinatorXXX:9092
(id: 2147483647 <(214)%20748-3647> rack: null) dead for group
spark-kafka-source-085b8fda-7c01-435d-99db-b67a94dafa3f-1814340906-driver-0


Code is -

   Dataset kafkaStream = spark.readStream()
.format("kafka")
.option("kafka.bootstrap.servers",
config.getString("kafka.host") + ":" + config.getString("kafka.port"))
.option("subscribe", "test")
.load();
//kafkaStream.printSchema();
//JSON ::: {"id":1,"name":"MySelf"}
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("id", DataTypes.IntegerType, false),
DataTypes.createStructField("name",
DataTypes.StringType, false)});

Dataset streamingSelectDF =
kafkaStream.selectExpr("CAST(value AS STRING) as message")
.select(functions.from_json(functions.col("message"),
schema).as("json"))
.select("json.*")
.as(Encoders.bean(KafkaMessage.class));
streamingSelectDF.createOrReplaceTempView("MyView");
Dataset streamData = spark.sql("SELECT count(*) from MyView");


StreamingQuery streamingQuery = streamData.writeStream()
.format("console")
.outputMode("complete")
.trigger(Trigger.ProcessingTime("10 seconds")).start();
try {
streamingQuery.awaitTermination();
} catch (StreamingQueryException e) {
e.printStackTrace();
}


Regards,

Leena


Spark Structured Streaming - Spark Consumer does not display messages

2017-07-21 Thread Cassa L
Hi,
This is first time I am trying structured streaming with Kafka. I have
simple code to read from Kafka and display it on the console. Message is in
JSON format. However, when I run my code nothin after below line gets
printed.

17/07/21 13:43:41 INFO AppInfoParser: Kafka commitId : a7a17cdec9eaa6c5
17/07/21 13:43:41 INFO StreamExecution: Starting new streaming query.
17/07/21 13:43:42 INFO AbstractCoordinator: Discovered coordinator XXX:9092
(id: 2147483647 rack: null) for group
spark-kafka-source-085b8fda-7c01-435d-99db-b67a94dafa3f-1814340906-driver-0.
17/07/21 13:43:42 INFO AbstractCoordinator: Marking the coordinatorXXX:9092
(id: 2147483647 rack: null) dead for group
spark-kafka-source-085b8fda-7c01-435d-99db-b67a94dafa3f-1814340906-driver-0


Code is -

   Dataset kafkaStream = spark.readStream()
.format("kafka")
.option("kafka.bootstrap.servers",
config.getString("kafka.host") + ":" + config.getString("kafka.port"))
.option("subscribe", "test")
.load();
//kafkaStream.printSchema();
//JSON ::: {"id":1,"name":"MySelf"}
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("id", DataTypes.IntegerType, false),
DataTypes.createStructField("name",
DataTypes.StringType, false)});

Dataset streamingSelectDF =
kafkaStream.selectExpr("CAST(value AS STRING) as message")
.select(functions.from_json(functions.col("message"),
schema).as("json"))
.select("json.*")
.as(Encoders.bean(KafkaMessage.class));
streamingSelectDF.createOrReplaceTempView("MyView");
Dataset streamData = spark.sql("SELECT count(*) from MyView");


StreamingQuery streamingQuery = streamData.writeStream()
.format("console")
.outputMode("complete")
.trigger(Trigger.ProcessingTime("10 seconds")).start();
try {
streamingQuery.awaitTermination();
} catch (StreamingQueryException e) {
e.printStackTrace();
}


Regards,

Leena


Re: Spark 2.0 and Oracle 12.1 error

2017-07-21 Thread Cassa L
Hi Xiao,
I am trying JSON sample table provided by Oracle 12C. It is on the website -
https://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6371

CREATE TABLE j_purchaseorder
   (id  RAW (16) NOT NULL,
date_loaded TIMESTAMP WITH TIME ZONE,
po_document CLOB
CONSTRAINT ensure_json CHECK (po_document IS JSON));

Data that I inserted was -

{ "PONumber" : 1600,
  "Reference": "ABULL-20140421",
  "Requestor": "Alexis Bull",
  "User" : "ABULL",
  "CostCenter"   : "A50",
  "ShippingInstructions" : { "name"   : "Alexis Bull",
 "Address": { "street"  : "200 Sporting Green",
  "city": "South San Francisco",
  "state"   : "CA",
  "zipCode" : 99236,
  "country" : "United States
of America" },
 "Phone" : [ { "type" : "Office", "number"
: "909-555-7307" },
 { "type" : "Mobile", "number"
: "415-555-1234" } ] },
  "Special Instructions" : null,
  "AllowPartialShipment" : false,
  "LineItems": [ { "ItemNumber" : 1,
   "Part"   : { "Description" : "One
Magic Christmas",
"UnitPrice"   : 19.95,
"UPCCode" : 13131092899 },
   "Quantity"   : 9.0 },
 { "ItemNumber" : 2,
   "Part"   : { "Description" : "Lethal Weapon",
"UnitPrice"   : 19.95,
"UPCCode" : 85391628927 },
   "Quantity"   : 5.0 } ] }


On Fri, Jul 21, 2017 at 10:12 AM, Xiao Li <gatorsm...@gmail.com> wrote:

> Could you share the schema of your Oracle table and open a JIRA?
>
> Thanks!
>
> Xiao
>
>
> 2017-07-21 9:40 GMT-07:00 Cassa L <lcas...@gmail.com>:
>
>> I am using 2.2.0. I resolved the problem by removing SELECT * and adding
>> column names to the SELECT statement. That works. I'm wondering why SELECT
>> * will not work.
>>
>> Regards,
>> Leena
>>
>> On Fri, Jul 21, 2017 at 8:21 AM, Xiao Li <gatorsm...@gmail.com> wrote:
>>
>>> Could you try 2.2? We fixed multiple Oracle related issues in the latest
>>> release.
>>>
>>> Thanks
>>>
>>> Xiao
>>>
>>>
>>> On Wed, 19 Jul 2017 at 11:10 PM Cassa L <lcas...@gmail.com> wrote:
>>>
>>>> Hi,
>>>> I am trying to use Spark to read from Oracle (12.1) table using Spark
>>>> 2.0. My table has JSON data.  I am getting below exception in my code. Any
>>>> clue?
>>>>
>>>> >>>>>
>>>> java.sql.SQLException: Unsupported type -101
>>>>
>>>> at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.o
>>>> rg$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$
>>>> getCatalystType(JdbcUtils.scala:233)
>>>> at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$a
>>>> nonfun$8.apply(JdbcUtils.scala:290)
>>>> at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$a
>>>> nonfun$8.apply(JdbcUtils.scala:290)
>>>> at scala.Option.getOrElse(Option.scala:121)
>>>> at
>>>>
>>>> ==
>>>> My code is very simple.
>>>>
>>>> SparkSession spark = SparkSession
>>>> .builder()
>>>> .appName("Oracle Example")
>>>> .master("local[4]")
>>>> .getOrCreate();
>>>>
>>>> final Properties connectionProperties = new Properties();
>>>> connectionProperties.put("user", *"some_user"*));
>>>> connectionProperties.put("password", "some_pwd"));
>>>>
>>>> final String dbTable =
>>>> "(select *  from  MySampleTable)";
>>>>
>>>> Dataset jdbcDF = spark.read().jdbc(*URL*, dbTable, 
>>>> connectionProperties);
>>>>
>>>>
>>
>


Re: Spark 2.0 and Oracle 12.1 error

2017-07-21 Thread Cassa L
I am using 2.2.0. I resolved the problem by removing SELECT * and adding
column names to the SELECT statement. That works. I'm wondering why SELECT
* will not work.

Regards,
Leena

On Fri, Jul 21, 2017 at 8:21 AM, Xiao Li <gatorsm...@gmail.com> wrote:

> Could you try 2.2? We fixed multiple Oracle related issues in the latest
> release.
>
> Thanks
>
> Xiao
>
>
> On Wed, 19 Jul 2017 at 11:10 PM Cassa L <lcas...@gmail.com> wrote:
>
>> Hi,
>> I am trying to use Spark to read from Oracle (12.1) table using Spark
>> 2.0. My table has JSON data.  I am getting below exception in my code. Any
>> clue?
>>
>> >>>>>
>> java.sql.SQLException: Unsupported type -101
>>
>> at org.apache.spark.sql.execution.datasources.jdbc.
>> JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$
>> getCatalystType(JdbcUtils.scala:233)
>> at org.apache.spark.sql.execution.datasources.jdbc.
>> JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:290)
>> at org.apache.spark.sql.execution.datasources.jdbc.
>> JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:290)
>> at scala.Option.getOrElse(Option.scala:121)
>> at
>>
>> ==
>> My code is very simple.
>>
>> SparkSession spark = SparkSession
>> .builder()
>> .appName("Oracle Example")
>> .master("local[4]")
>> .getOrCreate();
>>
>> final Properties connectionProperties = new Properties();
>> connectionProperties.put("user", *"some_user"*));
>> connectionProperties.put("password", "some_pwd"));
>>
>> final String dbTable =
>> "(select *  from  MySampleTable)";
>>
>> Dataset jdbcDF = spark.read().jdbc(*URL*, dbTable, 
>> connectionProperties);
>>
>>


How to use Update statement or call stored procedure of Oracle from Spark

2017-07-20 Thread Cassa L
Hi,
I want to use Spark to parallelize some update operations on Oracle
database. However I could not find a way to call Update statements (Update
Employee WHERE ???) , use transactions or call stored procedures from
Spark/JDBC.
Has anyone had this use case before and how did you solve it?

Thanks,
Leena


Spark-2.0 and Oracle 12.1 error: Unsupported type -101

2017-07-20 Thread Cassa L
Hi,
I am trying to read data into Spark from Oracle using ojdb7 driver. The
data is in JSON format. I am getting below error. Any idea on how to
resolve it?

ava.sql.SQLException: Unsupported type -101

at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystTy


My code is very simple

SparkSession spark = SparkSession
.builder()
.appName("Oracle Example")
.master("local[4]")
.getOrCreate();

final Properties connectionProperties = new Properties();
connectionProperties.put("user",*"XXX"*);
connectionProperties.put("password", "XXX"));

final String dbTable =
"(select *  from  MyTable)";

Dataset jdbcDF = spark.read().jdbc(*X*, dbTable, connectionProperties);
//jdbcDF.show();


Spark 2.0 and Oracle 12.1 error

2017-07-20 Thread Cassa L
Hi,
I am trying to use Spark to read from Oracle (12.1) table using Spark 2.0.
My table has JSON data.  I am getting below exception in my code. Any clue?

>
java.sql.SQLException: Unsupported type -101

at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:233)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:290)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:290)
at scala.Option.getOrElse(Option.scala:121)
at

==
My code is very simple.

SparkSession spark = SparkSession
.builder()
.appName("Oracle Example")
.master("local[4]")
.getOrCreate();

final Properties connectionProperties = new Properties();
connectionProperties.put("user", *"some_user"*));
connectionProperties.put("password", "some_pwd"));

final String dbTable =
"(select *  from  MySampleTable)";

Dataset jdbcDF = spark.read().jdbc(*URL*, dbTable, connectionProperties);


Spark job server pros and cons

2016-12-09 Thread Cassa L
Hi,
So far, I ran spark jobs directly using spark-submit options.  I have a use
case to use Spark Job server to run the job. I wanted to find out PROS and
CONs of using this job server? If anyone can share it, it will be great.
My jobs usually connected to multiple data sources like Kafka, Custom
receiver, Cassandra etc. Will these use cases work as is in job server?

Thanks,
Leena


Custom receiver for WebSocket in Spark not working

2016-11-02 Thread Cassa L
Hi,
I am using spark 1.6. I wrote a custom receiver to read from WebSocket. But
when I start my spark job, it  connects to the WebSocket but  doesn't get
any message. Same code, if I write as separate scala class, it works and
prints messages from WebSocket. Is anything missing in my Spark Code? There
are no errors in spark console.

Here is my receiver -

import org.apache.spark.Logging
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver
import org.jfarcand.wcs.{MessageListener, TextListener, WebSocket}

/**
  * Custom receiver for WebSocket
  */
class WebSocketReceiver extends
Receiver[String](StorageLevel.MEMORY_ONLY) with Runnable with Logging
{

  private var webSocket: WebSocket = _

  @transient
  private var thread: Thread = _

  override def onStart(): Unit = {
thread = new Thread(this)
thread.start()
  }

  override def onStop(): Unit = {
setWebSocket(null)
thread.interrupt()
  }

  override def run(): Unit = {
println("Received ")
receive()
  }

  private def receive(): Unit = {


val connection = WebSocket().open("ws://localhost:3001")
println("WebSocket  Connected ..." )
println("Connected --- " + connection)
setWebSocket(connection)

   connection.listener(new TextListener {

 override def onMessage(message: String) {
 System.out.println("Message in Spark client is --> " + message)
   }
})


}

private def setWebSocket(newWebSocket: WebSocket) = synchronized {
if (webSocket != null) {
webSocket.shutDown
}
webSocket = newWebSocket
}

}


=

Here is code for Spark job


object WebSocketTestApp {

  def main(args: Array[String]) {
val conf = new SparkConf()
  .setAppName("Test Web Socket")
  .setMaster("local[20]")
  .set("test", "")
val ssc = new StreamingContext(conf, Seconds(5))


val stream: ReceiverInputDStream[String] = ssc.receiverStream(new
WebSocketReceiver())
stream.print()

ssc.start()
ssc.awaitTermination()
  }


==
}


Thanks,

LCassa


Reading old tweets from twitter in spark

2016-10-26 Thread Cassa L
Hi,
I am using Spark Streaming to read tweets from twitter. It works fine. Now
I want to be able to fetch older tweets in my spark code. Twitter4j has API
to set date
http://twitter4j.org/oldjavadocs/4.0.4/twitter4j/Query.html

Is there a way to set this using TwitterUtils or do I need to write
different code?


Thanks.
LCassa


Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-16 Thread Cassa L
On Thu, Jun 16, 2016 at 5:27 AM, Deepak Goel <deic...@gmail.com> wrote:

> What is your hardware configuration like which you are running Spark on?
>
> It  is 24core, 128GB RAM

> Hey
>
> Namaskara~Nalama~Guten Tag~Bonjour
>
>
>--
> Keigu
>
> Deepak
> 73500 12833
> www.simtree.net, dee...@simtree.net
> deic...@gmail.com
>
> LinkedIn: www.linkedin.com/in/deicool
> Skype: thumsupdeicool
> Google talk: deicool
> Blog: http://loveandfearless.wordpress.com
> Facebook: http://www.facebook.com/deicool
>
> "Contribute to the world, environment and more :
> http://www.gridrepublic.org
> "
>
> On Thu, Jun 16, 2016 at 5:33 PM, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi,
>>
>> What do you see under Executors and Details for Stage (for the
>> affected stages)? Anything weird memory-related?
>>
>> How does your "I am reading data from Kafka into Spark and writing it
>> into Cassandra after processing it." pipeline look like?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Mon, Jun 13, 2016 at 11:56 PM, Cassa L <lcas...@gmail.com> wrote:
>> > Hi,
>> >
>> > I'm using spark 1.5.1 version. I am reading data from Kafka into Spark
>> and
>> > writing it into Cassandra after processing it. Spark job starts fine and
>> > runs all good for some time until I start getting below errors. Once
>> these
>> > errors come, job start to lag behind and I see that job has scheduling
>> and
>> > processing delays in streaming  UI.
>> >
>> > Worker memory is 6GB, executor-memory is 5GB, I also tried to tweak
>> > memoryFraction parameters. Nothing works.
>> >
>> >
>> > 16/06/13 21:26:02 INFO MemoryStore: ensureFreeSpace(4044) called with
>> > curMem=565394, maxMem=2778495713
>> > 16/06/13 21:26:02 INFO MemoryStore: Block broadcast_69652_piece0 stored
>> as
>> > bytes in memory (estimated size 3.9 KB, free 2.6 GB)
>> > 16/06/13 21:26:02 INFO TorrentBroadcast: Reading broadcast variable
>> 69652
>> > took 2 ms
>> > 16/06/13 21:26:02 WARN MemoryStore: Failed to reserve initial memory
>> > threshold of 1024.0 KB for computing block broadcast_69652 in memory.
>> > 16/06/13 21:26:02 WARN MemoryStore: Not enough space to cache
>> > broadcast_69652 in memory! (computed 496.0 B so far)
>> > 16/06/13 21:26:02 INFO MemoryStore: Memory use = 556.1 KB (blocks) +
>> 2.6 GB
>> > (scratch space shared across 0 tasks(s)) = 2.6 GB. Storage limit = 2.6
>> GB.
>> > 16/06/13 21:26:02 WARN MemoryStore: Persisting block broadcast_69652 to
>> disk
>> > instead.
>> > 16/06/13 21:26:02 INFO BlockManager: Found block rdd_100761_1 locally
>> > 16/06/13 21:26:02 INFO Executor: Finished task 0.0 in stage 71577.0 (TID
>> > 452316). 2043 bytes result sent to driver
>> >
>> >
>> > Thanks,
>> >
>> > L
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-16 Thread Cassa L
Hi,

>
> What do you see under Executors and Details for Stage (for the
> affected stages)? Anything weird memory-related?
>
Under executor Tab, logs throw these warning -

16/06/16 20:45:40 INFO TorrentBroadcast: Reading broadcast variable
422145 took 1 ms
16/06/16 20:45:40 WARN MemoryStore: Failed to reserve initial memory
threshold of 1024.0 KB for computing block broadcast_422145 in memory.
16/06/16 20:45:40 WARN MemoryStore: Not enough space to cache
broadcast_422145 in memory! (computed 496.0 B so far)
16/06/16 20:45:40 INFO MemoryStore: Memory use = 147.9 KB (blocks) +
2.2 GB (scratch space shared across 0 tasks(s)) = 2.2 GB. Storage
limit = 2.2 GB.
16/06/16 20:45:40 WARN MemoryStore: Persisting block broadcast_422145
to disk instead.
16/06/16 20:45:40 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 70278, fetching them

16/06/16 20:45:40 INFO MapOutputTrackerWorker: Doing the fetch; tracker
endpoint = AkkaRpcEndpointRef(Actor[akka.tcp://
sparkDriver@17.40.240.71:46187/user/MapOutputTracker#-1794035569])

I dont see any memory related errors on 'stages' Tab.

>
> How does your "I am reading data from Kafka into Spark and writing it
> into Cassandra after processing it." pipeline look like?
>
> This part has no issues. Reading from Kafka is always up to date. There
are no offset lags. Writting to Cassandra is also fine with less than 1ms
to write data.


> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Mon, Jun 13, 2016 at 11:56 PM, Cassa L <lcas...@gmail.com> wrote:
> > Hi,
> >
> > I'm using spark 1.5.1 version. I am reading data from Kafka into Spark
> and
> > writing it into Cassandra after processing it. Spark job starts fine and
> > runs all good for some time until I start getting below errors. Once
> these
> > errors come, job start to lag behind and I see that job has scheduling
> and
> > processing delays in streaming  UI.
> >
> > Worker memory is 6GB, executor-memory is 5GB, I also tried to tweak
> > memoryFraction parameters. Nothing works.
> >
> >
> > 16/06/13 21:26:02 INFO MemoryStore: ensureFreeSpace(4044) called with
> > curMem=565394, maxMem=2778495713
> > 16/06/13 21:26:02 INFO MemoryStore: Block broadcast_69652_piece0 stored
> as
> > bytes in memory (estimated size 3.9 KB, free 2.6 GB)
> > 16/06/13 21:26:02 INFO TorrentBroadcast: Reading broadcast variable 69652
> > took 2 ms
> > 16/06/13 21:26:02 WARN MemoryStore: Failed to reserve initial memory
> > threshold of 1024.0 KB for computing block broadcast_69652 in memory.
> > 16/06/13 21:26:02 WARN MemoryStore: Not enough space to cache
> > broadcast_69652 in memory! (computed 496.0 B so far)
> > 16/06/13 21:26:02 INFO MemoryStore: Memory use = 556.1 KB (blocks) + 2.6
> GB
> > (scratch space shared across 0 tasks(s)) = 2.6 GB. Storage limit = 2.6
> GB.
> > 16/06/13 21:26:02 WARN MemoryStore: Persisting block broadcast_69652 to
> disk
> > instead.
> > 16/06/13 21:26:02 INFO BlockManager: Found block rdd_100761_1 locally
> > 16/06/13 21:26:02 INFO Executor: Finished task 0.0 in stage 71577.0 (TID
> > 452316). 2043 bytes result sent to driver
> >
> >
> > Thanks,
> >
> > L
>


Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-15 Thread Cassa L
Hi,
 I did set  --driver-memory 4G. I still run into this issue after 1 hour of
data load.

I also tried version 1.6 in test environment. I hit this issue much faster
than in 1.5.1 setup.
LCassa

On Tue, Jun 14, 2016 at 3:57 PM, Gaurav Bhatnagar <gaura...@gmail.com>
wrote:

> try setting the option --driver-memory 4G
>
> On Tue, Jun 14, 2016 at 3:52 PM, Ben Slater <ben.sla...@instaclustr.com>
> wrote:
>
>> A high level shot in the dark but in our testing we found Spark 1.6 a lot
>> more reliable in low memory situations (presumably due to
>> https://issues.apache.org/jira/browse/SPARK-1). If it’s an option,
>> probably worth a try.
>>
>> Cheers
>> Ben
>>
>> On Wed, 15 Jun 2016 at 08:48 Cassa L <lcas...@gmail.com> wrote:
>>
>>> Hi,
>>> I would appreciate any clue on this. It has become a bottleneck for our
>>> spark job.
>>>
>>> On Mon, Jun 13, 2016 at 2:56 PM, Cassa L <lcas...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm using spark 1.5.1 version. I am reading data from Kafka into Spark and 
>>>> writing it into Cassandra after processing it. Spark job starts fine and 
>>>> runs all good for some time until I start getting below errors. Once these 
>>>> errors come, job start to lag behind and I see that job has scheduling and 
>>>> processing delays in streaming  UI.
>>>>
>>>> Worker memory is 6GB, executor-memory is 5GB, I also tried to tweak 
>>>> memoryFraction parameters. Nothing works.
>>>>
>>>>
>>>> 16/06/13 21:26:02 INFO MemoryStore: ensureFreeSpace(4044) called with 
>>>> curMem=565394, maxMem=2778495713
>>>> 16/06/13 21:26:02 INFO MemoryStore: Block broadcast_69652_piece0 stored as 
>>>> bytes in memory (estimated size 3.9 KB, free 2.6 GB)
>>>> 16/06/13 21:26:02 INFO TorrentBroadcast: Reading broadcast variable 69652 
>>>> took 2 ms
>>>> 16/06/13 21:26:02 WARN MemoryStore: Failed to reserve initial memory 
>>>> threshold of 1024.0 KB for computing block broadcast_69652 in memory.
>>>> 16/06/13 21:26:02 WARN MemoryStore: Not enough space to cache 
>>>> broadcast_69652 in memory! (computed 496.0 B so far)
>>>> 16/06/13 21:26:02 INFO MemoryStore: Memory use = 556.1 KB (blocks) + 2.6 
>>>> GB (scratch space shared across 0 tasks(s)) = 2.6 GB. Storage limit = 2.6 
>>>> GB.
>>>> 16/06/13 21:26:02 WARN MemoryStore: Persisting block broadcast_69652 to 
>>>> disk instead.
>>>> 16/06/13 21:26:02 INFO BlockManager: Found block rdd_100761_1 locally
>>>> 16/06/13 21:26:02 INFO Executor: Finished task 0.0 in stage 71577.0 (TID 
>>>> 452316). 2043 bytes result sent to driver
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> L
>>>>
>>>>
>>> --
>> 
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>
>


Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-14 Thread Cassa L
Hi,
I would appreciate any clue on this. It has become a bottleneck for our
spark job.

On Mon, Jun 13, 2016 at 2:56 PM, Cassa L <lcas...@gmail.com> wrote:

> Hi,
>
> I'm using spark 1.5.1 version. I am reading data from Kafka into Spark and 
> writing it into Cassandra after processing it. Spark job starts fine and runs 
> all good for some time until I start getting below errors. Once these errors 
> come, job start to lag behind and I see that job has scheduling and 
> processing delays in streaming  UI.
>
> Worker memory is 6GB, executor-memory is 5GB, I also tried to tweak 
> memoryFraction parameters. Nothing works.
>
>
> 16/06/13 21:26:02 INFO MemoryStore: ensureFreeSpace(4044) called with 
> curMem=565394, maxMem=2778495713
> 16/06/13 21:26:02 INFO MemoryStore: Block broadcast_69652_piece0 stored as 
> bytes in memory (estimated size 3.9 KB, free 2.6 GB)
> 16/06/13 21:26:02 INFO TorrentBroadcast: Reading broadcast variable 69652 
> took 2 ms
> 16/06/13 21:26:02 WARN MemoryStore: Failed to reserve initial memory 
> threshold of 1024.0 KB for computing block broadcast_69652 in memory.
> 16/06/13 21:26:02 WARN MemoryStore: Not enough space to cache broadcast_69652 
> in memory! (computed 496.0 B so far)
> 16/06/13 21:26:02 INFO MemoryStore: Memory use = 556.1 KB (blocks) + 2.6 GB 
> (scratch space shared across 0 tasks(s)) = 2.6 GB. Storage limit = 2.6 GB.
> 16/06/13 21:26:02 WARN MemoryStore: Persisting block broadcast_69652 to disk 
> instead.
> 16/06/13 21:26:02 INFO BlockManager: Found block rdd_100761_1 locally
> 16/06/13 21:26:02 INFO Executor: Finished task 0.0 in stage 71577.0 (TID 
> 452316). 2043 bytes result sent to driver
>
>
> Thanks,
>
> L
>
>


Spark Memory Error - Not enough space to cache broadcast

2016-06-13 Thread Cassa L
Hi,

I'm using spark 1.5.1 version. I am reading data from Kafka into Spark
and writing it into Cassandra after processing it. Spark job starts
fine and runs all good for some time until I start getting below
errors. Once these errors come, job start to lag behind and I see that
job has scheduling and processing delays in streaming  UI.

Worker memory is 6GB, executor-memory is 5GB, I also tried to tweak
memoryFraction parameters. Nothing works.


16/06/13 21:26:02 INFO MemoryStore: ensureFreeSpace(4044) called with
curMem=565394, maxMem=2778495713
16/06/13 21:26:02 INFO MemoryStore: Block broadcast_69652_piece0
stored as bytes in memory (estimated size 3.9 KB, free 2.6 GB)
16/06/13 21:26:02 INFO TorrentBroadcast: Reading broadcast variable
69652 took 2 ms
16/06/13 21:26:02 WARN MemoryStore: Failed to reserve initial memory
threshold of 1024.0 KB for computing block broadcast_69652 in memory.
16/06/13 21:26:02 WARN MemoryStore: Not enough space to cache
broadcast_69652 in memory! (computed 496.0 B so far)
16/06/13 21:26:02 INFO MemoryStore: Memory use = 556.1 KB (blocks) +
2.6 GB (scratch space shared across 0 tasks(s)) = 2.6 GB. Storage
limit = 2.6 GB.
16/06/13 21:26:02 WARN MemoryStore: Persisting block broadcast_69652
to disk instead.
16/06/13 21:26:02 INFO BlockManager: Found block rdd_100761_1 locally
16/06/13 21:26:02 INFO Executor: Finished task 0.0 in stage 71577.0
(TID 452316). 2043 bytes result sent to driver


Thanks,

L


Re: Accessing Cassandra data from Spark Shell

2016-05-18 Thread Cassa L
I tried all combinations of spark-cassandra connector. Didn't work.
Finally, I downgraded spark to 1.5.1 and now it works.
LCassa

On Wed, May 18, 2016 at 11:11 AM, Mohammed Guller <moham...@glassbeam.com>
wrote:

> As Ben mentioned, Spark 1.5.2 does work with C*.  Make sure that you are
> using the correct version of the Spark Cassandra Connector.
>
>
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>
>
>
> *From:* Ben Slater [mailto:ben.sla...@instaclustr.com]
> *Sent:* Tuesday, May 17, 2016 11:00 PM
> *To:* u...@cassandra.apache.org; Mohammed Guller
> *Cc:* user
>
> *Subject:* Re: Accessing Cassandra data from Spark Shell
>
>
>
> It definitely should be possible for 1.5.2 (I have used it with
> spark-shell and cassandra connector with 1.4.x). The main trick is in
> lining up all the versions and building an appropriate connector jar.
>
>
>
> Cheers
>
> Ben
>
>
>
> On Wed, 18 May 2016 at 15:40 Cassa L <lcas...@gmail.com> wrote:
>
> Hi,
>
> I followed instructions to run SparkShell with Spark-1.6. It works fine.
> However, I need to use spark-1.5.2 version. With it, it does not work. I
> keep getting NoSuchMethod Errors. Is there any issue running Spark Shell
> for Cassandra using older version of Spark?
>
>
>
>
>
> Regards,
>
> LCassa
>
>
>
> On Tue, May 10, 2016 at 6:48 PM, Mohammed Guller <moham...@glassbeam.com>
> wrote:
>
> Yes, it is very simple to access Cassandra data using Spark shell.
>
>
>
> Step 1: Launch the spark-shell with the spark-cassandra-connector package
>
> $SPARK_HOME/bin/spark-shell --packages
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0
>
>
>
> Step 2: Create a DataFrame pointing to your Cassandra table
>
> val dfCassTable = sqlContext.read
>
>
> .format("org.apache.spark.sql.cassandra")
>
>  .options(Map(
> "table" -> "your_column_family", "keyspace" -> "your_keyspace"))
>
>  .load()
>
>
>
> From this point onward, you have complete access to the DataFrame API. You
> can even register it as a temporary table, if you would prefer to use
> SQL/HiveQL.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>
>
>
> *From:* Ben Slater [mailto:ben.sla...@instaclustr.com]
> *Sent:* Monday, May 9, 2016 9:28 PM
> *To:* u...@cassandra.apache.org; user
> *Subject:* Re: Accessing Cassandra data from Spark Shell
>
>
>
> You can use SparkShell to access Cassandra via the Spark Cassandra
> connector. The getting started article on our support page will probably
> give you a good steer to get started even if you’re not using Instaclustr:
> https://support.instaclustr.com/hc/en-us/articles/213097877-Getting-Started-with-Instaclustr-Spark-Cassandra-
>
>
>
> Cheers
>
> Ben
>
>
>
> On Tue, 10 May 2016 at 14:08 Cassa L <lcas...@gmail.com> wrote:
>
> Hi,
>
> Has anyone tried accessing Cassandra data using SparkShell? How do you do
> it? Can you use HiveContext for Cassandra data? I'm using community version
> of Cassandra-3.0
>
>
>
> Thanks,
>
> LCassa
>
> --
>
> 
>
> Ben Slater
>
> Chief Product Officer, Instaclustr
>
> +61 437 929 798
>
>
>
> --
>
> 
>
> Ben Slater
>
> Chief Product Officer, Instaclustr
>
> +61 437 929 798
>


Re: Accessing Cassandra data from Spark Shell

2016-05-17 Thread Cassa L
Hi,
I followed instructions to run SparkShell with Spark-1.6. It works fine.
However, I need to use spark-1.5.2 version. With it, it does not work. I
keep getting NoSuchMethod Errors. Is there any issue running Spark Shell
for Cassandra using older version of Spark?


Regards,
LCassa

On Tue, May 10, 2016 at 6:48 PM, Mohammed Guller <moham...@glassbeam.com>
wrote:

> Yes, it is very simple to access Cassandra data using Spark shell.
>
>
>
> Step 1: Launch the spark-shell with the spark-cassandra-connector package
>
> $SPARK_HOME/bin/spark-shell --packages
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0
>
>
>
> Step 2: Create a DataFrame pointing to your Cassandra table
>
> val dfCassTable = sqlContext.read
>
>
> .format("org.apache.spark.sql.cassandra")
>
>  .options(Map(
> "table" -> "your_column_family", "keyspace" -> "your_keyspace"))
>
>  .load()
>
>
>
> From this point onward, you have complete access to the DataFrame API. You
> can even register it as a temporary table, if you would prefer to use
> SQL/HiveQL.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>
>
>
> *From:* Ben Slater [mailto:ben.sla...@instaclustr.com]
> *Sent:* Monday, May 9, 2016 9:28 PM
> *To:* u...@cassandra.apache.org; user
> *Subject:* Re: Accessing Cassandra data from Spark Shell
>
>
>
> You can use SparkShell to access Cassandra via the Spark Cassandra
> connector. The getting started article on our support page will probably
> give you a good steer to get started even if you’re not using Instaclustr:
> https://support.instaclustr.com/hc/en-us/articles/213097877-Getting-Started-with-Instaclustr-Spark-Cassandra-
>
>
>
> Cheers
>
> Ben
>
>
>
> On Tue, 10 May 2016 at 14:08 Cassa L <lcas...@gmail.com> wrote:
>
> Hi,
>
> Has anyone tried accessing Cassandra data using SparkShell? How do you do
> it? Can you use HiveContext for Cassandra data? I'm using community version
> of Cassandra-3.0
>
>
>
> Thanks,
>
> LCassa
>
> --
>
> 
>
> Ben Slater
>
> Chief Product Officer, Instaclustr
>
> +61 437 929 798
>


Accessing Cassandra data from Spark Shell

2016-05-09 Thread Cassa L
Hi,
Has anyone tried accessing Cassandra data using SparkShell? How do you do
it? Can you use HiveContext for Cassandra data? I'm using community version
of Cassandra-3.0

Thanks,
LCassa


Re: Regarding sliding window example from Databricks for DStream

2016-01-12 Thread Cassa L
Any thoughts over this? I want to know when  window duration is complete
and not the sliding window.  Is there a way I can catch end of Window
Duration or do I need to keep track of it and how?

LCassa

On Mon, Jan 11, 2016 at 3:09 PM, Cassa L <lcas...@gmail.com> wrote:

> Hi,
>  I'm trying to work with sliding window example given by databricks.
>
> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/windows.html
>
> It works fine as expected.
> My question is how do I determine when the last phase of of slider has
> reached. I want to perform final operation and notify other system when end
> of the slider has reched to the window duarions. e.g. in below example
> from databricks,
>
>
> JavaDStream windowDStream =
> accessLogDStream.window(WINDOW_LENGTH, SLIDE_INTERVAL);
> windowDStream.foreachRDD(accessLogs -> {
>   if (accessLogs.count() == 0) {
> System.out.println("No access logs in this time interval");
> return null;
>   }
>
>   // Insert code verbatim from LogAnalyzer.java or LogAnalyzerSQL.java here.
>
>   // Calculate statistics based on the content size.
>   JavaRDD contentSizes =
>   accessLogs.map(ApacheAccessLog::getContentSize).cache();
>   System.out.println(String.format("Content Size Avg: %s, Min: %s, Max: %s",
>   contentSizes.reduce(SUM_REDUCER) / contentSizes.count(),
>   contentSizes.min(Comparator.naturalOrder()),
>   contentSizes.max(Comparator.naturalOrder(;
>
>//.
> }
>
> I want to check if entire average at the end of window falls below certain 
> value and send alert. How do I get this?
>
>
> Thanks,
> LCassa
>
>


Regarding sliding window example from Databricks for DStream

2016-01-11 Thread Cassa L
Hi,
 I'm trying to work with sliding window example given by databricks.
https://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter1/windows.html

It works fine as expected.
My question is how do I determine when the last phase of of slider has
reached. I want to perform final operation and notify other system when end
of the slider has reched to the window duarions. e.g. in below example
from databricks,


JavaDStream windowDStream =
accessLogDStream.window(WINDOW_LENGTH, SLIDE_INTERVAL);
windowDStream.foreachRDD(accessLogs -> {
  if (accessLogs.count() == 0) {
System.out.println("No access logs in this time interval");
return null;
  }

  // Insert code verbatim from LogAnalyzer.java or LogAnalyzerSQL.java here.

  // Calculate statistics based on the content size.
  JavaRDD contentSizes =
  accessLogs.map(ApacheAccessLog::getContentSize).cache();
  System.out.println(String.format("Content Size Avg: %s, Min: %s, Max: %s",
  contentSizes.reduce(SUM_REDUCER) / contentSizes.count(),
  contentSizes.min(Comparator.naturalOrder()),
  contentSizes.max(Comparator.naturalOrder(;

   //.
}

I want to check if entire average at the end of window falls below
certain value and send alert. How do I get this?


Thanks,
LCassa


Spark streaming job hangs

2015-11-30 Thread Cassa L
Hi,
 I am reading data from Kafka into spark. It runs fine for sometime but
then hangs forever with following output. I don't see and errors in logs.
How do I debug this?

2015-12-01 06:04:30,697 [dag-scheduler-event-loop] INFO  (Logging.scala:59)
- Adding task set 19.0 with 4 tasks
2015-12-01 06:04:30,872 [pool-13-thread-1] INFO  (Logging.scala:59) -
Disconnected from Cassandra cluster: APG DEV Cluster
2015-12-01 06:04:35,060 [JobGenerator] INFO  (Logging.scala:59) - Added
jobs for time 1448949875000 ms
2015-12-01 06:04:40,054 [JobGenerator] INFO  (Logging.scala:59) - Added
jobs for time 144894988 ms
2015-12-01 06:04:45,034 [JobGenerator] INFO  (Logging.scala:59) - Added
jobs for time 1448949885000 ms
2015-12-01 06:04:50,100 [JobGenerator] INFO  (Logging.scala:59) - Added
jobs for time 144894989 ms
2015-12-01 06:04:55,064 [JobGenerator] INFO  (Logging.scala:59) - Added
jobs for time 1448949895000 ms
2015-12-01 06:05:00,125 [JobGenerator] INFO  (Logging.scala:59) - Added
jobs for time 144894990 ms


Thanks
LCassa


how to group timestamp data and filter on it

2015-11-18 Thread Cassa L
Hi,
I have a data stream (JavaDStream) in following format-
timestamp=second1,  map(key1=value1, key2=value2)
timestamp=second2,map(key1=value3, key2=value4)
timestamp=second2, map(key1=value1, key2=value5)


I want to group data by 'timestamp' first and then filter each RDD for
Key1=value1 or key1=value3 etc.

Each of above row represent POJO in RDD like:
public class Data{
long timestamp;
Map map;
}

How do do this in spark? I am trying to figure out if I need to use map or
flatMap etc?

Thanks,
LCassa


Re: Rule Engine for Spark

2015-11-04 Thread Cassa L
Thanks for reply. How about DROOLs. Does it worj with Spark?


LCassa

On Wed, Nov 4, 2015 at 3:02 AM, Adrian Tanase <atan...@adobe.com> wrote:

> Another way to do it is to extract your filters as SQL code and load it in
> a transform – which allows you to change the filters at runtime.
>
> Inside the transform you could apply the filters by goind RDD -> DF -> SQL
> -> RDD.
>
> Lastly, depending on how complex your filters are, you could skip SQL and
> create your own mini-DSL that runs inside transform. I’d definitely start
> here if the filter predicates are simple enough…
>
> -adrian
>
> From: Stefano Baghino
> Date: Wednesday, November 4, 2015 at 10:15 AM
> To: Cassa L
> Cc: user
> Subject: Re: Rule Engine for Spark
>
> Hi LCassa,
> unfortunately I don't have actual experience on this matter, however for a
> similar use case I have briefly evaluated Decision
> <https://github.com/Stratio/Decision> (then called literally Streaming
> CEP Engine) and it looked interesting. I hope it may help.
>
> On Wed, Nov 4, 2015 at 1:42 AM, Cassa L <lcas...@gmail.com> wrote:
>
>> Hi,
>>  Has anyone used rule engine with spark streaming? I have a case where
>> data is streaming from Kafka and I need to apply some rules on it (instead
>> of hard coding in a code).
>>
>> Thanks,
>> LCassa
>>
>
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit
>


Protobuff 3.0 for Spark

2015-11-04 Thread Cassa L
Hi,
 Does spark support protobuff 3.0? I used protobuff 2.5 with spark-1.4
built for HDP 2.3. Given that protobuff has compatibility issues , want to
know if spark supports protbuff 3.0

LCassa


Re: Rule Engine for Spark

2015-11-04 Thread Cassa L
ok. Let me try it.

Thanks,
LCassa

On Wed, Nov 4, 2015 at 4:44 PM, Cheng, Hao <hao.ch...@intel.com> wrote:

> Or try Streaming SQL? Which is a simple layer on top of the Spark
> Streaming. J
>
>
>
> https://github.com/Intel-bigdata/spark-streamingsql
>
>
>
>
>
> *From:* Cassa L [mailto:lcas...@gmail.com]
> *Sent:* Thursday, November 5, 2015 8:09 AM
> *To:* Adrian Tanase
> *Cc:* Stefano Baghino; user
> *Subject:* Re: Rule Engine for Spark
>
>
>
> Thanks for reply. How about DROOLs. Does it worj with Spark?
>
> LCassa
>
>
>
> On Wed, Nov 4, 2015 at 3:02 AM, Adrian Tanase <atan...@adobe.com> wrote:
>
> Another way to do it is to extract your filters as SQL code and load it in
> a transform – which allows you to change the filters at runtime.
>
>
>
> Inside the transform you could apply the filters by goind RDD -> DF -> SQL
> -> RDD.
>
>
>
> Lastly, depending on how complex your filters are, you could skip SQL and
> create your own mini-DSL that runs inside transform. I’d definitely start
> here if the filter predicates are simple enough…
>
>
>
> -adrian
>
>
>
> *From: *Stefano Baghino
> *Date: *Wednesday, November 4, 2015 at 10:15 AM
> *To: *Cassa L
> *Cc: *user
> *Subject: *Re: Rule Engine for Spark
>
>
>
> Hi LCassa,
>
> unfortunately I don't have actual experience on this matter, however for a
> similar use case I have briefly evaluated Decision
> <https://github.com/Stratio/Decision> (then called literally Streaming
> CEP Engine) and it looked interesting. I hope it may help.
>
>
>
> On Wed, Nov 4, 2015 at 1:42 AM, Cassa L <lcas...@gmail.com> wrote:
>
> Hi,
>
>  Has anyone used rule engine with spark streaming? I have a case where
> data is streaming from Kafka and I need to apply some rules on it (instead
> of hard coding in a code).
>
> Thanks,
>
> LCassa
>
>
>
>
>
> --
>
> BR,
>
> Stefano Baghino
>
> Software Engineer @ Radicalbit
>
>
>


Rule Engine for Spark

2015-11-03 Thread Cassa L
Hi,
 Has anyone used rule engine with spark streaming? I have a case where data
is streaming from Kafka and I need to apply some rules on it (instead of
hard coding in a code).

Thanks,
LCassa


Re: How to send RDD result to REST API?

2015-08-31 Thread Cassa L
Hi Ted,
  My server is  expecting JSON. Can I just write HttpClient in spark job
and push result of RDD action to the server?  I'm trying to figure out how
to achieve this.

LCassa

On Fri, Aug 28, 2015 at 9:45 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> What format does your REST server expect ?
>
> You may have seen this:
>
> https://www.paypal-engineering.com/2014/02/13/hello-newman-a-rest-client-for-scala/
>
> On Fri, Aug 28, 2015 at 9:35 PM, Cassa L <lcas...@gmail.com> wrote:
>
>> Hi,
>> If I have RDD that counts something e.g.:
>>
>> JavaPairDStream<String, Integer> successMsgCounts = successMsgs
>> .flatMap(buffer -> Arrays.asList(buffer.getType()))
>> .mapToPair(txnType -> new Tuple2<String,
>> Integer>("Success " + txnType, 1))
>> .reduceByKey((count1, count2) -> count1 + count2);
>>
>> successMsgCounts.print();
>>
>> Instead of printing this output how can I push it to REST API? I have a
>> server which needs this information to be fed via REST.
>>
>>
>> thanks
>> LCassa
>>
>
>


How to send RDD result to REST API?

2015-08-28 Thread Cassa L
Hi,
If I have RDD that counts something e.g.:

JavaPairDStreamString, Integer successMsgCounts = successMsgs
.flatMap(buffer - Arrays.asList(buffer.getType()))
.mapToPair(txnType - new Tuple2String, Integer(Success
 + txnType, 1))
.reduceByKey((count1, count2) - count1 + count2);

successMsgCounts.print();

Instead of printing this output how can I push it to REST API? I have a
server which needs this information to be fed via REST.


thanks
LCassa


Re: SSL between Kafka and Spark Streaming API

2015-08-28 Thread Cassa L
Just to confirm, is this what you are mentioning about? Is there any
example on how to set it? I believe it is for 0.8.3 version?

https://cwiki.apache.org/confluence/display/KAFKA/Multiple+Listeners+for+Kafka+Brokers


On Fri, Aug 28, 2015 at 12:52 PM, Sriharsha Chintalapani ka...@harsha.io
wrote:

 You can configure PLAINTEXT listener as well with the broker and use that
 port for spark.

 --
 Harsha


 On August 28, 2015 at 12:24:45 PM, Sourabh Chandak (sourabh3...@gmail.com)
 wrote:

 Can we use the existing kafka spark streaming jar to connect to a kafka
 server running in SSL mode?

 We are fine with non SSL consumer as our kafka cluster and spark cluster
 are in the same network


 Thanks,
 Sourabh

 On Fri, Aug 28, 2015 at 12:03 PM, Gwen Shapira g...@confluent.io wrote:
 I can't speak for the Spark Community, but checking their code,
 DirectKafkaStream and KafkaRDD use the SimpleConsumer API:


 https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala

 https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala

 On Fri, Aug 28, 2015 at 11:32 AM, Cassa L lcas...@gmail.com wrote:

  Hi I am using below Spark jars with Direct Stream API.
spark-streaming-kafka_2.10
 
  When I look at its pom.xml, Kafka libraries that its pulling in is
 groupIdorg.apache.kafka/groupId
 artifactIdkafka_${scala.binary.version}/artifactId
 version0.8.2.1/version
 
 
  I believe this DirectStream API uses SimpleConsumer API. Can someone from
  Spark community confirm too?
 
  Thanks,
  LCassa.
 
  On Fri, Aug 28, 2015 at 11:12 AM, Sriharsha Chintalapani 
 ka...@harsha.io
  wrote:
 
   SSL is supported for new producer and consumer api and old api (simple
   consumer and high-level consumer) is not supported.
   I think spark uses simple consumer? if so its not supported.
  
   Thanks,
   Harsha
  
  
   On August 28, 2015 at 11:00:30 AM, Cassa L (lcas...@gmail.com) wrote:
  
   Hi,
   I was going through SSL setup of Kafka.
  
 
 https://cwiki.apache.org/confluence/display/KAFKA/Deploying+SSL+for+Kafka
   However, I am also using Spark-Kafka streaming to read data from Kafka.
  Is
   there a way to activate SSL for spark streaming API or not possible at
   all?
  
   Thanks,
   LCassa
  
  
 




SSL between Kafka and Spark Streaming API

2015-08-28 Thread Cassa L
Hi,
 I was going through SSL setup of Kafka.
https://cwiki.apache.org/confluence/display/KAFKA/Deploying+SSL+for+Kafka
However, I am also using Spark-Kafka streaming to read data from Kafka.  Is
there a way to activate SSL for spark streaming API or not possible at all?

Thanks,
LCassa


Re: SSL between Kafka and Spark Streaming API

2015-08-28 Thread Cassa L
Hi I am using below Spark jars with Direct Stream API.
  spark-streaming-kafka_2.10

When I look at its pom.xml, Kafka libraries that its pulling in is
   groupIdorg.apache.kafka/groupId
   artifactIdkafka_${scala.binary.version}/artifactId
   version0.8.2.1/version


I believe this DirectStream API uses SimpleConsumer API. Can someone from
Spark community confirm too?

Thanks,
LCassa.

On Fri, Aug 28, 2015 at 11:12 AM, Sriharsha Chintalapani ka...@harsha.io
wrote:

 SSL is supported for new producer and consumer api and old api (simple
 consumer and high-level consumer) is not supported.
 I think spark uses simple consumer? if so its not supported.

 Thanks,
 Harsha


 On August 28, 2015 at 11:00:30 AM, Cassa L (lcas...@gmail.com) wrote:

 Hi,
 I was going through SSL setup of Kafka.
 https://cwiki.apache.org/confluence/display/KAFKA/Deploying+SSL+for+Kafka
 However, I am also using Spark-Kafka streaming to read data from Kafka. Is
 there a way to activate SSL for spark streaming API or not possible at
 all?

 Thanks,
 LCassa




Re: Protobuf error when streaming from Kafka

2015-08-25 Thread Cassa L
Hi,
 I am using Spark-1.4 and Kafka-0.8.2.1
As per google suggestions, I rebuilt all the classes with protobuff-2.5
dependencies. My new protobuf is compiled using 2.5. However now, my spark
job does not start. Its throwing different error. Does Spark or any other
its dependencies uses old protobuff-2.4?

Exception in thread main java.lang.VerifyError: class
com.apple.ist.retail.xcardmq.serializers.SampleProtobufMessage$ProtoBuff
overrides final method
getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at
com.apple.ist.retail.spark.kafka.consumer.SparkMQProcessor.start(SparkProcessor.java:68)
at
com.apple.ist.retail.spark.kafka.consumer.SparkMQConsumer.main(SparkConsumer.java:43)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


On Mon, Aug 24, 2015 at 6:53 PM, Ted Yu yuzhih...@gmail.com wrote:

 Can you show the complete stack trace ?

 Which Spark / Kafka release are you using ?

 Thanks

 On Mon, Aug 24, 2015 at 4:58 PM, Cassa L lcas...@gmail.com wrote:

 Hi,
  I am storing messages in Kafka using protobuf and reading them into
 Spark. I upgraded protobuf version from 2.4.1 to 2.5.0. I got
 java.lang.UnsupportedOperationException for older messages. However, even
 for new messages I get the same error. Spark does convert it though. I see
 my messages. How do I get rid of this error?
 java.lang.UnsupportedOperationException: This is supposed to be
 overridden by subclasses.
 at
 com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
 at
 org.apache.hadoop.hdfs.protocol.proto.HdfsProtos$FsPermissionProto.getSerializedSize(HdfsProtos.java:5407)
 at
 com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)





Re: Protobuf error when streaming from Kafka

2015-08-25 Thread Cassa L
I downloaded below binary version of spark.
spark-1.4.1-bin-cdh4

On Tue, Aug 25, 2015 at 1:03 PM, java8964 java8...@hotmail.com wrote:

 Did your spark build with Hive?

 I met the same problem before because the hive-exec jar in the maven
 itself include protobuf class, which will be included in the Spark jar.

 Yong

 --
 Date: Tue, 25 Aug 2015 12:39:46 -0700
 Subject: Re: Protobuf error when streaming from Kafka
 From: lcas...@gmail.com
 To: yuzhih...@gmail.com
 CC: user@spark.apache.org


 Hi,
  I am using Spark-1.4 and Kafka-0.8.2.1
 As per google suggestions, I rebuilt all the classes with protobuff-2.5
 dependencies. My new protobuf is compiled using 2.5. However now, my spark
 job does not start. Its throwing different error. Does Spark or any other
 its dependencies uses old protobuff-2.4?

 Exception in thread main java.lang.VerifyError: class
 com.apple.ist.retail.xcardmq.serializers.SampleProtobufMessage$ProtoBuff
 overrides final method
 getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
 at
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at
 com.apple.ist.retail.spark.kafka.consumer.SparkMQProcessor.start(SparkProcessor.java:68)
 at
 com.apple.ist.retail.spark.kafka.consumer.SparkMQConsumer.main(SparkConsumer.java:43)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
 at
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


 On Mon, Aug 24, 2015 at 6:53 PM, Ted Yu yuzhih...@gmail.com wrote:

 Can you show the complete stack trace ?

 Which Spark / Kafka release are you using ?

 Thanks

 On Mon, Aug 24, 2015 at 4:58 PM, Cassa L lcas...@gmail.com wrote:

 Hi,
  I am storing messages in Kafka using protobuf and reading them into
 Spark. I upgraded protobuf version from 2.4.1 to 2.5.0. I got
 java.lang.UnsupportedOperationException for older messages. However, even
 for new messages I get the same error. Spark does convert it though. I see
 my messages. How do I get rid of this error?
 java.lang.UnsupportedOperationException: This is supposed to be overridden
 by subclasses.
 at
 com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
 at
 org.apache.hadoop.hdfs.protocol.proto.HdfsProtos$FsPermissionProto.getSerializedSize(HdfsProtos.java:5407)
 at
 com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)






Re: Protobuf error when streaming from Kafka

2015-08-25 Thread Cassa L
Do you think this binary would have issue? Do I need to build spark from
source code?

On Tue, Aug 25, 2015 at 1:06 PM, Cassa L lcas...@gmail.com wrote:

 I downloaded below binary version of spark.
 spark-1.4.1-bin-cdh4

 On Tue, Aug 25, 2015 at 1:03 PM, java8964 java8...@hotmail.com wrote:

 Did your spark build with Hive?

 I met the same problem before because the hive-exec jar in the maven
 itself include protobuf class, which will be included in the Spark jar.

 Yong

 --
 Date: Tue, 25 Aug 2015 12:39:46 -0700
 Subject: Re: Protobuf error when streaming from Kafka
 From: lcas...@gmail.com
 To: yuzhih...@gmail.com
 CC: user@spark.apache.org


 Hi,
  I am using Spark-1.4 and Kafka-0.8.2.1
 As per google suggestions, I rebuilt all the classes with protobuff-2.5
 dependencies. My new protobuf is compiled using 2.5. However now, my spark
 job does not start. Its throwing different error. Does Spark or any other
 its dependencies uses old protobuff-2.4?

 Exception in thread main java.lang.VerifyError: class
 com.apple.ist.retail.xcardmq.serializers.SampleProtobufMessage$ProtoBuff
 overrides final method
 getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
 at
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at
 com.apple.ist.retail.spark.kafka.consumer.SparkMQProcessor.start(SparkProcessor.java:68)
 at
 com.apple.ist.retail.spark.kafka.consumer.SparkMQConsumer.main(SparkConsumer.java:43)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
 at
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


 On Mon, Aug 24, 2015 at 6:53 PM, Ted Yu yuzhih...@gmail.com wrote:

 Can you show the complete stack trace ?

 Which Spark / Kafka release are you using ?

 Thanks

 On Mon, Aug 24, 2015 at 4:58 PM, Cassa L lcas...@gmail.com wrote:

 Hi,
  I am storing messages in Kafka using protobuf and reading them into
 Spark. I upgraded protobuf version from 2.4.1 to 2.5.0. I got
 java.lang.UnsupportedOperationException for older messages. However, even
 for new messages I get the same error. Spark does convert it though. I see
 my messages. How do I get rid of this error?
 java.lang.UnsupportedOperationException: This is supposed to be
 overridden by subclasses.
 at
 com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
 at
 org.apache.hadoop.hdfs.protocol.proto.HdfsProtos$FsPermissionProto.getSerializedSize(HdfsProtos.java:5407)
 at
 com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)







Protobuf error when streaming from Kafka

2015-08-24 Thread Cassa L
Hi,
 I am storing messages in Kafka using protobuf and reading them into Spark.
I upgraded protobuf version from 2.4.1 to 2.5.0. I got
java.lang.UnsupportedOperationException for older messages. However, even
for new messages I get the same error. Spark does convert it though. I see
my messages. How do I get rid of this error?
java.lang.UnsupportedOperationException: This is supposed to be overridden
by subclasses.
at
com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
at
org.apache.hadoop.hdfs.protocol.proto.HdfsProtos$FsPermissionProto.getSerializedSize(HdfsProtos.java:5407)
at
com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)