Re: Spark/HIVE Insert Into values Error

2014-10-26 Thread arthur.hk.c...@gmail.com
Hi,

I have already found the way about how to “insert into HIVE_TABLE values (…..)

Regards
Arthur

On 18 Oct, 2014, at 10:09 pm, Cheng Lian lian.cs@gmail.com wrote:

 Currently Spark SQL uses Hive 0.12.0, which doesn't support the INSERT INTO 
 ... VALUES ... syntax.
 
 On 10/18/14 1:33 AM, arthur.hk.c...@gmail.com wrote:
 Hi,
 
 When trying to insert records into HIVE, I got error,
 
 My Spark is 1.1.0 and Hive 0.12.0
 
 Any idea what would be wrong?
 Regards
 Arthur
 
 
 
 hive CREATE TABLE students (name VARCHAR(64), age INT, gpa int);  
 OK
 
 hive INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1);
 NoViableAltException(26@[])
  at 
 org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:693)
  at 
 org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:31374)
  at 
 org.apache.hadoop.hive.ql.parse.HiveParser.regular_body(HiveParser.java:29083)
  at 
 org.apache.hadoop.hive.ql.parse.HiveParser.queryStatement(HiveParser.java:28968)
  at 
 org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:28762)
  at 
 org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1238)
  at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:938)
  at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:190)
  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
  at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
  at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
  at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
  at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
 FAILED: ParseException line 1:27 cannot recognize input near 'VALUES' '(' 
 ''fred flintstone'' in select clause
 
 
 



Create table error from Hive in spark-assembly-1.0.2.jar

2014-10-26 Thread Jacob Chacko - Catalyst Consulting
Hi All

We are trying to create a table in Hive from spark-assembly-1.0.2.jar file.


CREATE TABLE IF NOT EXISTS src (key INT, value STRING)

  JavaSparkContext sc = CC2SparkManager.sharedInstance().getSparkContext();
  JavaHiveContext sqlContext = new JavaHiveContext(sc);
  sqlContext.sql(CREATE TABLE  srcd (key INT, value STRING));

On Spark 1.1 we get the following error:

FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to 
instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

On Spark 1.0.2 we get the following error:

 failure: ``UNCACHE'' expected but identifier CREATE found


Could you help understand why we get this error

Thanks
Jacob



Re: scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0

2014-10-26 Thread Marius Soutier
I tried that already, same exception. I also tried using an accumulator to 
collect all filenames. The filename is not the problem.

Even this crashes with the same exception:

sc.parallelize(files.value).map { fileName =
  println(sScanning $fileName)
  try {
println(sScanning $fileName)
sc.textFile(fileName).take(1)
sSuccessfully scanned $fileName
  } catch {
case t: Throwable = sFailed to process $fileName, reason 
${t.getStackTrace.head}
  }
}
.saveAsTextFile(output)

The output file contains “Failed to process… for each file.


On 26.10.2014, at 00:10, Buttler, David buttl...@llnl.gov wrote:

 This sounds like expected behavior to me.  The foreach call should be 
 distributed on the workers.  perhaps you want to use map instead, and then 
 collect the failed file names locally, or save the whole thing out to a file
 
 From: Marius Soutier [mps@gmail.com]
 Sent: Friday, October 24, 2014 6:35 AM
 To: user@spark.apache.org
 Subject: scala.collection.mutable.ArrayOps$ofRef$.length$extension since 
 Spark 1.1.0
 
 Hi,
 
 I’m running a job whose simple task it is to find files that cannot be read 
 (sometimes our gz files are corrupted).
 
 With 1.0.x, this worked perfectly. Since 1.1.0 however, I’m getting an 
 exception: 
 scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
 
sc.wholeTextFiles(input)
  .foreach { case (fileName, _) =
try {
  println(sScanning $fileName)
  sc.textFile(fileName).take(1)
  println(sSuccessfully scanned $fileName)
} catch {
  case t: Throwable = println(sFailed to process $fileName, reason 
 ${t.getStackTrace.head})
}
  }
 
 
 Also since 1.1.0, the printlns are no longer visible on the console, only in 
 the Spark UI worker output.
 
 
 Thanks for any help
 - Marius
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Implement Count by Minute in Spark Streaming

2014-10-26 Thread Ji ZHANG
Hi,

Suppose I have a stream of logs and I want to count them by minute.
The result is like:

2014-10-26 18:38:00 100
2014-10-26 18:39:00 150
2014-10-26 18:40:00 200

One way to do this is to set the batch interval to 1 min, but each
batch would be quite large.

Or I can use updateStateByKey where key is like '2014-10-26 18:38:00',
but I have two questions:

1. How to persist the result to MySQL? Do I need to flush them every batch?
2. How to delete the old state? For example, now is 18:50 but the
18:40's state is still in Spark. One solution is to set the key's
state to None when there's no data of this key in this batch. But what
if the log is not so much, and some batches get zero logs? For
instance

18:40:00~18:40:10 has 10 logs - key 18:40's value is set to 10
18:40:10~18:40:20 has no log - key 18:40 is deleted
18:40:20~18:40:30 has 5 logs - key 18:40's value is set to 5

You can see the result is wrong. Maybe I can use an 'update' approach
when flushing, i.e. check MySQL whether there's already an entry of
18:40 and add the result to that. But how about a unique count? I
can't store all unique values in MySQL per se.

So I'm looking for a better way to store count-by-minute result into
rdbms (or nosql?). Any idea would be appreciated.

Thanks.

-- 
Jerry

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Bug in Accumulators...

2014-10-26 Thread octavian.ganea
Sorry, I forgot to say that this gives the above error just when run on a
cluster, not in local mode.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263p17277.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark as Relational Database

2014-10-26 Thread Peter Wolf
My understanding is the SparkSQL allows one to access Spark data as if it
were stored in a relational database.  It compiles SQL queries into a
series of calls to the Spark API.

I need the performance of a SQL database, but I don't care about doing
queries with SQL.

I create the input to MLib by doing a massive JOIN query.  So, I am
creating a single collection by combining many collections.  This sort of
operation is very inefficient in Mongo, Cassandra or HDFS.

I could store my data in a relational database, and copy the query results
to Spark for processing.  However, I was hoping I could keep everything in
Spark.

On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:

 1. What data store do you want to store your data in ? HDFS, HBase,
 Cassandra, S3 or something else?
 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)?

 One option is to process the data in Spark and then store it in the
 relational database of your choice.




 On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf opus...@gmail.com wrote:

 Hello all,

 We are considering Spark for our organization.  It is obviously a superb
 platform for processing massive amounts of data... how about retrieving it?

 We are currently storing our data in a relational database in a star
 schema.  Retrieving our data requires doing many complicated joins across
 many tables.

 Can we use Spark as a relational database?  Or, if not, can we put Spark
 on top of a relational database?

 Note that we don't care about SQL.  Accessing our data via standard
 queries is nice, but we are equally happy (or more happy) to write Scala
 code.

 What is important to us is doing relational queries on huge amounts of
 data.  Is Spark good at this?

 Thank you very much in advance
 Peter





Re: Implement Count by Minute in Spark Streaming

2014-10-26 Thread Asit Parija
Hi ,
  You can use Redis to store the keys  and value as count  by doing an update 
function  whenever you receive that minute key , being an in memory database it 
would faster than SQL .You can do an update at the end of each batch to update 
the count of the key if it exists or create in case of a new entry .

Sent from my iPhone

 On Oct 26, 2014, at 4:33 PM, Ji ZHANG zhangj...@gmail.com wrote:ah
 
 Hi,
 
 Suppose I have a stream of logs and I want to count them by minute.
 The result is like:
 
 2014-10-26 18:38:00 100
 2014-10-26 18:39:00 150
 2014-10-26 18:40:00 200
 
 One way to do this is to set the batch interval to 1 min, but each
 batch would be quite large.
 
 Or I can use updateStateByKey where key is like '2014-10-26 18:38:00',
 but I have two questions:
 
 1. How to persist the result to MySQL? Do I need to flush them every batch?
 2. How to delete the old state? For example, now is 18:50 but the
 18:40's state is still in Spark. One solution is to set the key's
 state to None when there's no data of this key in this batch. But what
 if the log is not so much, and some batches get zero logs? For
 instance
 
 18:40:00~18:40:10 has 10 logs - key 18:40's value is set to 10
 18:40:10~18:40:20 has no log - key 18:40 is deleted
 18:40:20~18:40:30 has 5 logs - key 18:40's value is set to 5
 
 You can see the result is wrong. Maybe I can use an 'update' approach
 when flushing, i.e. check MySQL whether there's already an entry of
 18:40 and add the result to that. But how about a unique count? I
 can't store all unique values in MySQL per se.
 
 So I'm looking for a better way to store count-by-minute result into
 rdbms (or nosql?). Any idea would be appreciated.
 
 Thanks.
 
 -- 
 Jerry
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark as Relational Database

2014-10-26 Thread Rick Richardson
Spark's API definitely covers all of the things that a relational database
can do. It will probably outperform a relational star schema if all of your
*working* data set can fit into RAM on your cluster. It will still perform
quite well if most of the data fits and some has to spill over to disk.

What are your requirements exactly?
What is massive amounts of data exactly?
How big is your cluster?

Note that Spark is not for data storage, only data analysis. It pulls data
into working data sets called RDD's.

As a migration path, you could probably pull the data out of a relational
database to analyze. But in the long run, I would recommend using a more
purpose built, huge storage database such as Cassandra. If your data is
very static, you could also just store it in files.
 On Oct 26, 2014 9:19 AM, Peter Wolf opus...@gmail.com wrote:

 My understanding is the SparkSQL allows one to access Spark data as if it
 were stored in a relational database.  It compiles SQL queries into a
 series of calls to the Spark API.

 I need the performance of a SQL database, but I don't care about doing
 queries with SQL.

 I create the input to MLib by doing a massive JOIN query.  So, I am
 creating a single collection by combining many collections.  This sort of
 operation is very inefficient in Mongo, Cassandra or HDFS.

 I could store my data in a relational database, and copy the query results
 to Spark for processing.  However, I was hoping I could keep everything in
 Spark.

 On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta soumya.sima...@gmail.com
  wrote:

 1. What data store do you want to store your data in ? HDFS, HBase,
 Cassandra, S3 or something else?
 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)?

 One option is to process the data in Spark and then store it in the
 relational database of your choice.




 On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf opus...@gmail.com wrote:

 Hello all,

 We are considering Spark for our organization.  It is obviously a superb
 platform for processing massive amounts of data... how about retrieving it?

 We are currently storing our data in a relational database in a star
 schema.  Retrieving our data requires doing many complicated joins across
 many tables.

 Can we use Spark as a relational database?  Or, if not, can we put Spark
 on top of a relational database?

 Note that we don't care about SQL.  Accessing our data via standard
 queries is nice, but we are equally happy (or more happy) to write Scala
 code.

 What is important to us is doing relational queries on huge amounts of
 data.  Is Spark good at this?

 Thank you very much in advance
 Peter






Re: Accumulators : Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-10-26 Thread Akhil Das
Just tried the below code and works for me, not sure why is sparkContext
being sent inside the mapPartitions function in your case. Can you try with
simple map() instead of mapPartition?

val ac = sc.accumulator(0)
 val or = sc.parallelize(1 to 1)
 val ps = or.map(x = (x,x+2)).map(x = ac +=1)
 val test = ps.collect()
 println(ac.value)




Thanks
Best Regards

On Sun, Oct 26, 2014 at 1:44 AM, octavian.ganea octavian.ga...@inf.ethz.ch
wrote:

 Hi all,

 I tried to use accumulators without any success so far.

 My code is simple:

   val sc = new SparkContext(conf)
   val accum = sc.accumulator(0)
   val partialStats = sc.textFile(f.getAbsolutePath())
 .map(line = { val key = line.split(\t).head; (key ,
 line)} )
 .groupByKey(128)
 .mapPartitions{iter = { accum += 1; foo(iter)}}
 .reduce(_ + _)
   println(accum.value)

 Now, if I remove the 'accum += 1', everything works fine. If I keep it, I
 get this weird error:

 Exception in thread main 14/10/25 21:58:56 INFO TaskSchedulerImpl:
 Cancelling stage 0
 org.apache.spark.SparkException: Job aborted due to stage failure: Task not
 serializable: java.io.NotSerializableException:
 org.apache.spark.SparkContext
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:890)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply(DAGScheduler.scala:887)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply(DAGScheduler.scala:887)
 at scala.Option.foreach(Option.scala:236)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16.apply(DAGScheduler.scala:887)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16.apply(DAGScheduler.scala:886)
 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at

 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:886)
 at

 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at

 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at

 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at

 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Can someone please help!

 Thank you!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Accumulators-Task-not-serializable-java-io-NotSerializableException-org-apache-spark-SparkContext-tp17262.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark as Relational Database

2014-10-26 Thread Soumya Simanta
@Peter - as Rick said - Spark's main usage is data analysis and not
storage.

Spark allows you to plugin different storage layers based on your use cases
and quality attribute requirements. So in essence if your relational
database is meeting your storage requirements you should think about how to
use that with Spark. Because even if you decide not to use your
relational database you will have to select some storage layer, most likely
a distributed storage layer.

Another option to think about is - can you possible restructure your data
schema so that you don't have to do that a large number of joins?. If this
is an option then you can potentially think about using stores such as
Cassandra, HBase, HDFS etc.

Spark really excels at processing large volumes of data really fast (given
enough memory) on horizontally scalable commodity hardware. As Rick pointed
out - It will probably outperform a relational star schema if all of your
*working* data set can fit into RAM on your cluster. However, if you data
size is much larger than your cluster memory you don't have a choice but to
select a datastore.

HTH

-Soumya








On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson rick.richard...@gmail.com
 wrote:

 Spark's API definitely covers all of the things that a relational database
 can do. It will probably outperform a relational star schema if all of your
 *working* data set can fit into RAM on your cluster. It will still perform
 quite well if most of the data fits and some has to spill over to disk.

 What are your requirements exactly?
 What is massive amounts of data exactly?
 How big is your cluster?

 Note that Spark is not for data storage, only data analysis. It pulls data
 into working data sets called RDD's.

 As a migration path, you could probably pull the data out of a relational
 database to analyze. But in the long run, I would recommend using a more
 purpose built, huge storage database such as Cassandra. If your data is
 very static, you could also just store it in files.
  On Oct 26, 2014 9:19 AM, Peter Wolf opus...@gmail.com wrote:

 My understanding is the SparkSQL allows one to access Spark data as if it
 were stored in a relational database.  It compiles SQL queries into a
 series of calls to the Spark API.

 I need the performance of a SQL database, but I don't care about doing
 queries with SQL.

 I create the input to MLib by doing a massive JOIN query.  So, I am
 creating a single collection by combining many collections.  This sort of
 operation is very inefficient in Mongo, Cassandra or HDFS.

 I could store my data in a relational database, and copy the query
 results to Spark for processing.  However, I was hoping I could keep
 everything in Spark.

 On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta 
 soumya.sima...@gmail.com wrote:

 1. What data store do you want to store your data in ? HDFS, HBase,
 Cassandra, S3 or something else?
 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)?

 One option is to process the data in Spark and then store it in the
 relational database of your choice.




 On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf opus...@gmail.com wrote:

 Hello all,

 We are considering Spark for our organization.  It is obviously a
 superb platform for processing massive amounts of data... how about
 retrieving it?

 We are currently storing our data in a relational database in a star
 schema.  Retrieving our data requires doing many complicated joins across
 many tables.

 Can we use Spark as a relational database?  Or, if not, can we put
 Spark on top of a relational database?

 Note that we don't care about SQL.  Accessing our data via standard
 queries is nice, but we are equally happy (or more happy) to write Scala
 code.

 What is important to us is doing relational queries on huge amounts of
 data.  Is Spark good at this?

 Thank you very much in advance
 Peter






Re: Spark as Relational Database

2014-10-26 Thread Helena Edelson
Hi,
It is very easy to integrate using Cassandra in a use case such as this. For 
instance, do your joins in Spark and do your data storage in Cassandra which 
allows a very flexible schema, unlike a relational DB, and is much faster, 
fault tolerant, and with spark and colocation WRT data locality, infinitely 
faster.
If you use the Spark Cassandra Connector, reading and writing to Cassandra is 
as simple as:

write - DStream or RDD
stream.map(RawData(_)).saveToCassandra(keyspace, table) 
 
read - SparkContext or StreamingContext
ssc.cassandraTable[Double](keyspace, dailytable)
  .select(precipitation)
  .where(weather_station = ? AND year = ?, wsid, year)
  .map(doWork)

In your build:
com.datastax.spark  %% spark-cassandra-connector  % 
1.1.0-alpha4”// our 1.1.0 is for spark 1.1
 
https://github.com/datastax/spark-cassandra-connector
docs: https://github.com/datastax/spark-cassandra-connector/tree/master/doc

- Helena
twitter: @helenaedelson

On Oct 26, 2014, at 10:05 AM, Rick Richardson rick.richard...@gmail.com wrote:

 Spark's API definitely covers all of the things that a relational database 
 can do. It will probably outperform a relational star schema if all of your 
 *working* data set can fit into RAM on your cluster. It will still perform 
 quite well if most of the data fits and some has to spill over to disk.
 
 What are your requirements exactly? 
 What is massive amounts of data exactly?
 How big is your cluster?
 
 Note that Spark is not for data storage, only data analysis. It pulls data 
 into working data sets called RDD's.
 
 As a migration path, you could probably pull the data out of a relational 
 database to analyze. But in the long run, I would recommend using a more 
 purpose built, huge storage database such as Cassandra. If your data is very 
 static, you could also just store it in files. 
 On Oct 26, 2014 9:19 AM, Peter Wolf opus...@gmail.com wrote:
 My understanding is the SparkSQL allows one to access Spark data as if it 
 were stored in a relational database.  It compiles SQL queries into a series 
 of calls to the Spark API.
 
 I need the performance of a SQL database, but I don't care about doing 
 queries with SQL.
 
 I create the input to MLib by doing a massive JOIN query.  So, I am creating 
 a single collection by combining many collections.  This sort of operation is 
 very inefficient in Mongo, Cassandra or HDFS.
 
 I could store my data in a relational database, and copy the query results to 
 Spark for processing.  However, I was hoping I could keep everything in Spark.
 
 On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta soumya.sima...@gmail.com 
 wrote:
 1. What data store do you want to store your data in ? HDFS, HBase, 
 Cassandra, S3 or something else? 
 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? 
 
 One option is to process the data in Spark and then store it in the 
 relational database of your choice.
 
 
 
 
 On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf opus...@gmail.com wrote:
 Hello all,
 
 We are considering Spark for our organization.  It is obviously a superb 
 platform for processing massive amounts of data... how about retrieving it?
 
 We are currently storing our data in a relational database in a star schema.  
 Retrieving our data requires doing many complicated joins across many tables.
 
 Can we use Spark as a relational database?  Or, if not, can we put Spark on 
 top of a relational database?
 
 Note that we don't care about SQL.  Accessing our data via standard queries 
 is nice, but we are equally happy (or more happy) to write Scala code. 
 
 What is important to us is doing relational queries on huge amounts of data.  
 Is Spark good at this?
 
 Thank you very much in advance
 Peter
 
 



what classes are needed to register in KryoRegistrator, e.g. Row?

2014-10-26 Thread Fengyun RAO
In Tuning Spark https://spark.apache.org/docs/latest/tuning.html, it says,
Spark automatically includes Kryo serializers for the *many commonly-used
core Scala classes* covered in the AllScalaRegistrar from the Twitter chill
https://github.com/twitter/chill library.

I looked into the AllScalaRegistrar Javadoc, it only says,

/** Registers all the scala (and java) serializers we have */

It seems to register only all scala and java primitive and collection
classes, is that right?

What about classes in Spark, do we need to register them ourselves,
especially Row, GenericRow, MutableRow in Spark SQL?
​


Re: Accumulators : Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-10-26 Thread octavian.ganea
Hi Akhil,

Please see this related message. 
http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-td17263.html

I am curious if this works for you also.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Accumulators-Task-not-serializable-java-io-NotSerializableException-org-apache-spark-SparkContext-tp17262p17287.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How do you use the thrift-server to get data from a Spark program?

2014-10-26 Thread Edward Sargisson
Hi all,
This feels like a dumb question but bespeaks my lack of understanding: what
is the Spark thrift-server for? Especially if there's an existing Hive
installation.

Background:
We want to use Spark to do some processing starting from files (in probably
MapRFS). We want to be able to read the result using SQL so that we can
report the results using Eclipse BIRT.

My confusion:
Spark 1.1 includes a thrift-server for accessing data via JDBC. However, I
don't understand how to make data available in it from the rest of Spark.

I have a small program that does what I want in spark-shell. It reads some
JSON, does some manipulation using SchemaRDDs and then has the data ready.
If I've started the shell with the hive-site.xml pointing to a Hive
installation I can use SchemaRDD.saveToTable to put it into Hive - and then
I can use beeline to read it.

But that's using the *Hive* thrift-server and not the Spark thrift-server.
That doesn't seem to be the intention of having a separate thrift-server in
Spark. Before I started on this I assumed that you could run a Spark
program (in, say, Java) and then make those results accessible for the JDBC
interface.

So, please, fill me in. What am I missing?

Many thanks,
Edward


Spark optimization

2014-10-26 Thread Morbious
I wonder if there is any tool to tweak spark (worker and master).
I have 6 workers  (192 GB RAM, 32 cores CPU each) with 2 masters and see
only small different between MapReduce from hadoop and Spark. 
I've tested word count on 50 GB file. During tests spark hung on 2 nodes for
few minuts with message:

14/10/26 21:38:52 INFO scheduler.DAGScheduler: Submitting 2 missing tasks
from Stage 0 (MappedRDD[8] at saveAsTextFile at
NativeMethodAccessorImpl.java:-2)
14/10/26 21:38:52 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with
2 tasks
14/10/26 21:38:52 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 0 to sp...@spark-s2.test.org:41437
14/10/26 21:38:52 INFO spark.MapOutputTrackerMaster: Size of output statuses
for shuffle 0 is 5942 bytes
14/10/26 21:38:52 INFO spark.MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 0 to sp...@spark-s4.test.org:34546

Best regards,

Morbious



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-optimization-tp17290.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How do you use the thrift-server to get data from a Spark program?

2014-10-26 Thread Michael Armbrust
This is very experimental and mostly unsupported, but you can start the
JDBC server from within your own programs
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L45
by
passing it the HiveContext.

On Sun, Oct 26, 2014 at 12:16 PM, Edward Sargisson ejsa...@gmail.com
wrote:

 Hi all,
 This feels like a dumb question but bespeaks my lack of understanding:
 what is the Spark thrift-server for? Especially if there's an existing Hive
 installation.

 Background:
 We want to use Spark to do some processing starting from files (in
 probably MapRFS). We want to be able to read the result using SQL so that
 we can report the results using Eclipse BIRT.

 My confusion:
 Spark 1.1 includes a thrift-server for accessing data via JDBC. However, I
 don't understand how to make data available in it from the rest of Spark.

 I have a small program that does what I want in spark-shell. It reads some
 JSON, does some manipulation using SchemaRDDs and then has the data ready.
 If I've started the shell with the hive-site.xml pointing to a Hive
 installation I can use SchemaRDD.saveToTable to put it into Hive - and then
 I can use beeline to read it.

 But that's using the *Hive* thrift-server and not the Spark thrift-server.
 That doesn't seem to be the intention of having a separate thrift-server in
 Spark. Before I started on this I assumed that you could run a Spark
 program (in, say, Java) and then make those results accessible for the JDBC
 interface.

 So, please, fill me in. What am I missing?

 Many thanks,
 Edward



Spark SQL configuration

2014-10-26 Thread Pagliari, Roberto
I'm a newbie with Spark. After installing it on all the machines I want to use, 
do I need to tell it about Hadoop configuration, or will it be able to find it 
himself?

Thank you,



Re: Spark as Relational Database

2014-10-26 Thread Rick Richardson
I agree with Soumya. A relational database is usually the worst kind of
database to receive a constant event stream.

That said, the best solution is one that already works :)

If your system is meeting your needs, then great.  When you get so many
events that your db can't keep up, I'd look into Cassandra to receive the
events, and spark to analyze them.
On Oct 26, 2014 9:14 PM, Soumya Simanta soumya.sima...@gmail.com wrote:

 Given that you are storing event data (which is basically things that have
 happened in the past AND cannot be modified) you should definitely look at
 Event sourcing.
 http://martinfowler.com/eaaDev/EventSourcing.html

 If all you are doing is storing events then I don't think you need a
 relational database. Rather an event log is ideal. Please see -
 http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

 There are many other datastores that can do a better job at storing your
 events. You can process your data and then store them in a relational
 database to query later.





 On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf opus...@gmail.com wrote:

 Thanks for all the useful responses.

 We have the usual task of mining a stream of events coming from our many
 users.  We need to store these events, and process them.  We use a standard
 Star Schema to represent our data.

 For the moment, it looks like we should store these events in SQL.  When
 appropriate, we will do analysis with relational queries.  Or, when
 appropriate we will extract data into working sets in Spark.

 I imagine this is a pretty common use case for Spark.

 On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson 
 rick.richard...@gmail.com wrote:

 Spark's API definitely covers all of the things that a relational
 database can do. It will probably outperform a relational star schema if
 all of your *working* data set can fit into RAM on your cluster. It will
 still perform quite well if most of the data fits and some has to spill
 over to disk.

 What are your requirements exactly?
 What is massive amounts of data exactly?
 How big is your cluster?

 Note that Spark is not for data storage, only data analysis. It pulls
 data into working data sets called RDD's.

 As a migration path, you could probably pull the data out of a
 relational database to analyze. But in the long run, I would recommend
 using a more purpose built, huge storage database such as Cassandra. If
 your data is very static, you could also just store it in files.
  On Oct 26, 2014 9:19 AM, Peter Wolf opus...@gmail.com wrote:

 My understanding is the SparkSQL allows one to access Spark data as if
 it were stored in a relational database.  It compiles SQL queries into a
 series of calls to the Spark API.

 I need the performance of a SQL database, but I don't care about doing
 queries with SQL.

 I create the input to MLib by doing a massive JOIN query.  So, I am
 creating a single collection by combining many collections.  This sort of
 operation is very inefficient in Mongo, Cassandra or HDFS.

 I could store my data in a relational database, and copy the query
 results to Spark for processing.  However, I was hoping I could keep
 everything in Spark.

 On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta 
 soumya.sima...@gmail.com wrote:

 1. What data store do you want to store your data in ? HDFS, HBase,
 Cassandra, S3 or something else?
 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)?

 One option is to process the data in Spark and then store it in the
 relational database of your choice.




 On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf opus...@gmail.com
 wrote:

 Hello all,

 We are considering Spark for our organization.  It is obviously a
 superb platform for processing massive amounts of data... how about
 retrieving it?

 We are currently storing our data in a relational database in a star
 schema.  Retrieving our data requires doing many complicated joins across
 many tables.

 Can we use Spark as a relational database?  Or, if not, can we put
 Spark on top of a relational database?

 Note that we don't care about SQL.  Accessing our data via standard
 queries is nice, but we are equally happy (or more happy) to write Scala
 code.

 What is important to us is doing relational queries on huge amounts
 of data.  Is Spark good at this?

 Thank you very much in advance
 Peter








Re: Spark LIBLINEAR

2014-10-26 Thread Chih-Jen Lin
Debasish Das writes:
  If the SVM is not already migrated to BFGS, that's the first thing you should
  try...Basically following LBFGS Logistic Regression come up with LBFGS based
  linear SVM...
  
  About integrating TRON in mllib, David already has a version of TRON in 
  breeze
  but someone needs to validate it for linear SVM and do experiment to see if 
  it
  can improve upon LBFGS based linear SVM...Based on lib-linear papers, it 
  should
  but I don't expect substantial difference...
  
  I am validating TRON for use-cases related to this PR (but I need more 
  features
  on top of TRON):
  
  https://github.com/apache/spark/pull/2705
  

We are also working on integrating TRON to MLlib, though we haven't cheked
the above work. 

We are also doing some serious comparison between quasi Newton and Newton,
though this will take some time

Chih-Jen
  On Fri, Oct 24, 2014 at 2:09 PM, k.tham kevins...@gmail.com wrote:
  
  Just wondering, any update on this? Is there a plan to integrate CJ's 
  work
  with mllib? I'm asking since  SVM impl in MLLib did not give us good
  results
  and we have to resort to training our svm classifier in a serial manner 
  on
  the driver node with liblinear.
 
  Also, it looks like CJ Lin is coming to the bay area in the coming weeks
  (http://www.meetup.com/sfmachinelearning/events/208078582/) might be a 
  good
  time to connect with him.
  
  --
  View this message in context: http://
  apache-spark-user-list.1001560.n3.nabble.com/
  Spark-LIBLINEAR-tp5546p17236.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
  

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.1.0 ClassNotFoundException issue when submit with multi jars using CLUSTER MODE

2014-10-26 Thread xing_bing
HI

 

   I am using Spark 1.1.0  config with STANDALONE clusterManager and
CLUSTER deployMode. The logic is I want to submit multi jars with
spark-submit , using the �C-jars optional, I got an ClassNotFoundException
,  by the way in my code I also use thread  context class loader to load
custom class . 

 

Strange things is that when I use CLIENT deployMode. the exception is not
throws. 

 

Can anyone explain the class loader logic of spark or the issue when using
cluster mode ?   

 

 

 

 

/10/24 14:18:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)

java.lang.RuntimeException: Cannot load class:
cn.cekasp.al.demo.SimpleInputFormat3

at
cn.cekasp.algorithm.util.ReflectUtil.findClass(ReflectUtil.java:12)

at
cn.cekasp.algorithm.util.ReflectUtil.newInstance(ReflectUtil.java:18)

at
cn.cekasp.algorithm.reader.JdbcSourceReader$1.call(JdbcSourceReader.java:95)

at
cn.cekasp.algorithm.reader.JdbcSourceReader$1.call(JdbcSourceReader.java:90)

at
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaP
airRDD.scala:923)

at
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at
org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1167)

at
org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:904)

at
org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:904)

at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:112
1)

at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:112
1)

at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

at org.apache.spark.scheduler.Task.run(Task.scala:54)

at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
45)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
15)

at java.lang.Thread.run(Thread.java:745)

   Caused by: java.lang.ClassNotFoundException:
cn.cekasp.al.demo.SimpleInputFormat3
at
java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at
java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native
Method)
at
java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at
cn.cekasp.algorithm.util.ReflectUtil.findClass(ReflectUtil.java:10)
... 16 more

 

 

 



Re: scalac crash when compiling DataTypeConversions.scala

2014-10-26 Thread guoxu1231
Any update?

I encountered same issue in my environment.

Here are my steps as usual: 
git clone https://github.com/apache/spark
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean
package

build successfully by maven.

import into IDEA as a maven project, click Build-Make Project,.

2 compile errors found, one of them are DataTypeConversions.scala





Error:scala: 
 while compiling:
/opt/Development/spark/sql/core/src/main/scala/org/apache/spark/sql/types/util/DataTypeConversions.scala
during phase: jvm
 library version: version 2.10.4
compiler version: version 2.10.4
  reconstructed args: -nobootcp -javabootclasspath : -deprecation -feature
-Xplugin:/home/shawguo/host/maven_repo/org/scalamacros/paradise_2.10.4/2.0.1/paradise_2.10.4-2.0.1.jar
-unchecked -classpath

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-26 Thread Stephen Boesch
Yes it is necessary to do a mvn clean when encountering this issue.
Typically you would have changed one or more of the profiles/options -
which leads to this occurring.

2014-10-22 22:00 GMT-07:00 Ryan Williams ryan.blake.willi...@gmail.com:

 I started building Spark / running Spark tests this weekend and on maybe
 5-10 occasions have run into a compiler crash while compiling
 DataTypeConversions.scala.

 Here https://gist.github.com/ryan-williams/7673d7da928570907f4d is a
 full gist of an innocuous test command (mvn test
 -Dsuites='*KafkaStreamSuite') exhibiting this behavior. Problem starts on
 L512
 https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L512
 and there’s a final stack trace at the bottom
 https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L671
 .

 mvn clean or ./sbt/sbt clean “fix” it (I believe I’ve observed the issue
 while compiling with each tool), but are annoying/time-consuming to do,
 obvs, and it’s happening pretty frequently for me when doing only small
 numbers of incremental compiles punctuated by e.g. checking out different
 git commits.

 Have other people seen this? This post
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-github-source-build-error-td10532.html
 on this list is basically the same error, but in TestSQLContext.scala and this
 SO post
 http://stackoverflow.com/questions/25211071/compilation-errors-in-spark-datatypeconversions-scala-on-intellij-when-using-m
 claims to be hitting it when trying to build in intellij.

 It seems likely to be a bug in scalac; would finding a consistent repro
 case and filing it somewhere be useful?
 ​



Re: scalac crash when compiling DataTypeConversions.scala

2014-10-26 Thread Ryan Williams
I heard from one person offline who regularly builds Spark on OSX and Linux
and they felt like they only ever saw this error on OSX; if anyone can
confirm whether they've seen it on Linux, that would be good to know.

Stephen: good to know re: profiles/options. I don't think changing them is
a necessary condition as I believe I've run into it without doing that, but
any set of steps to reproduce this would be welcome so that we could
escalate to Typesafe as appropriate.

On Sun, Oct 26, 2014 at 11:46 PM, Stephen Boesch java...@gmail.com wrote:

 Yes it is necessary to do a mvn clean when encountering this issue.
 Typically you would have changed one or more of the profiles/options -
 which leads to this occurring.

 2014-10-22 22:00 GMT-07:00 Ryan Williams ryan.blake.willi...@gmail.com:

 I started building Spark / running Spark tests this weekend and on maybe
 5-10 occasions have run into a compiler crash while compiling
 DataTypeConversions.scala.

 Here https://gist.github.com/ryan-williams/7673d7da928570907f4d is a
 full gist of an innocuous test command (mvn test
 -Dsuites='*KafkaStreamSuite') exhibiting this behavior. Problem starts
 on L512
 https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L512
 and there’s a final stack trace at the bottom
 https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L671
 .

 mvn clean or ./sbt/sbt clean “fix” it (I believe I’ve observed the issue
 while compiling with each tool), but are annoying/time-consuming to do,
 obvs, and it’s happening pretty frequently for me when doing only small
 numbers of incremental compiles punctuated by e.g. checking out different
 git commits.

 Have other people seen this? This post
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-github-source-build-error-td10532.html
 on this list is basically the same error, but in TestSQLContext.scala
 and this SO post
 http://stackoverflow.com/questions/25211071/compilation-errors-in-spark-datatypeconversions-scala-on-intellij-when-using-m
 claims to be hitting it when trying to build in intellij.

 It seems likely to be a bug in scalac; would finding a consistent repro
 case and filing it somewhere be useful?
 ​





Re: Setting only master heap

2014-10-26 Thread Keith Simmons
Hi Guys,

Here's some lines from the log file before the OOM.  They don't look that
helpful, so let me know if there's anything else I should be sending.  I am
running in standalone mode.

spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError:
Java heap space
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22
05:00:36 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkMaster-akka.actor.default-dispatcher-52] shutting down ActorSystem
[sparkMaster]
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError:
Java heap space
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:Exception
in thread qtp2057079871-30 java.lang.OutOfMemoryError: Java heap space
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22
05:00:07 WARN AbstractNioSelector: Unexpected exception in the selector
loop.
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22
05:02:51 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkMaster-8] shutting down ActorSystem [sparkMaster]
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError:
Java heap space
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22
05:03:22 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkMaster-akka.actor.default-dispatcher-38] shutting down ActorSystem
[sparkMaster]
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError:
Java heap space
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22
05:03:22 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkMaster-6] shutting down ActorSystem [sparkMaster]
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError:
Java heap space
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22
05:03:22 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkMaster-akka.actor.default-dispatcher-43] shutting down ActorSystem
[sparkMaster]
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError:
Java heap space
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22
05:03:22 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkMaster-akka.actor.default-dispatcher-13] shutting down ActorSystem
[sparkMaster]
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError:
Java heap space
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22
05:03:22 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkMaster-5] shutting down ActorSystem [sparkMaster]
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError:
Java heap space
spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22
05:03:22 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkMaster-akka.actor.default-dispatcher-12] shutting down ActorSystem
[sparkMaster]

On Thu, Oct 23, 2014 at 2:10 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

 h…

 my observation is that, master in Spark 1.1 has higher frequency of GC……

 Also, before 1.1, I never encounter GC overtime in Master, after upgrade
 to 1.1, I have met for 2 times (we upgrade soon after 1.1 release)….

 Best,

 --
 Nan Zhu

 On Thursday, October 23, 2014 at 1:08 PM, Andrew Or wrote:

 Yeah, as Sameer commented, there is unfortunately not an equivalent
 `SPARK_MASTER_MEMORY` that you can set. You can work around this by
 starting the master and the slaves separately with different settings of
 SPARK_DAEMON_MEMORY each time.

 AFAIK there haven't been any major changes in the standalone master in
 1.1.0, so I don't see an immediate explanation for what you're observing.
 In general the Spark master doesn't use that much memory, and even if there
 are many applications it will discard the old ones appropriately, so unless
 you have a ton (like thousands) of concurrently running applications
 connecting to it there's little likelihood for it to OOM. At least that's
 my understanding.

 -Andrew

 2014-10-22 15:51 GMT-07:00 Sameer Farooqui same...@databricks.com:

 Hi Keith,

 Would be helpful if you could post the error message.

 Are you running Spark in Standalone mode or with YARN?

 In general, the Spark Master is only used for scheduling and it should be
 fine with the default setting of 512 MB RAM.

 Is it actually the Spark Driver's memory that you intended to change?



 *++ If in Standalone mode ++*
 You're right that SPARK_DAEMON_MEMORY set the memory to allocate to the
 Spark Master, Worker and even HistoryServer daemons together.

 SPARK_WORKER_MEMORY is slightly confusing. In Standalone 

Spark SQL Exists Clause

2014-10-26 Thread agg212
Hey, I'm trying to run TPC-H Query 4 (shown below), and get the following
error:

Exception in thread main java.lang.RuntimeException: [11.25] failure:
``UNION'' expected but `select' found

It seems like Spark SQL doesn't support the exists clause. Is this true?

select
o_orderpriority,
count(*) as order_count
from
orders
where
o_orderdate = date '1993-07-01'
and o_orderdate  date '1993-10-01'
and exists (
select
*
from
lineitem
where
l_orderkey = o_orderkey
and l_commitdate  l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority;


Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exists-Clause-tp17307.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: scalac crash when compiling DataTypeConversions.scala

2014-10-26 Thread Stephen Boesch
I see the errors regularly on linux under the conditions of having changed
profiles.

2014-10-26 20:49 GMT-07:00 Ryan Williams ryan.blake.willi...@gmail.com:

 I heard from one person offline who regularly builds Spark on OSX and
 Linux and they felt like they only ever saw this error on OSX; if anyone
 can confirm whether they've seen it on Linux, that would be good to know.

 Stephen: good to know re: profiles/options. I don't think changing them is
 a necessary condition as I believe I've run into it without doing that, but
 any set of steps to reproduce this would be welcome so that we could
 escalate to Typesafe as appropriate.

 On Sun, Oct 26, 2014 at 11:46 PM, Stephen Boesch java...@gmail.com
 wrote:

 Yes it is necessary to do a mvn clean when encountering this issue.
 Typically you would have changed one or more of the profiles/options -
 which leads to this occurring.

 2014-10-22 22:00 GMT-07:00 Ryan Williams ryan.blake.willi...@gmail.com:

 I started building Spark / running Spark tests this weekend and on maybe
 5-10 occasions have run into a compiler crash while compiling
 DataTypeConversions.scala.

 Here https://gist.github.com/ryan-williams/7673d7da928570907f4d is a
 full gist of an innocuous test command (mvn test
 -Dsuites='*KafkaStreamSuite') exhibiting this behavior. Problem starts
 on L512
 https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L512
 and there’s a final stack trace at the bottom
 https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L671
 .

 mvn clean or ./sbt/sbt clean “fix” it (I believe I’ve observed the
 issue while compiling with each tool), but are annoying/time-consuming to
 do, obvs, and it’s happening pretty frequently for me when doing only small
 numbers of incremental compiles punctuated by e.g. checking out different
 git commits.

 Have other people seen this? This post
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-github-source-build-error-td10532.html
 on this list is basically the same error, but in TestSQLContext.scala
 and this SO post
 http://stackoverflow.com/questions/25211071/compilation-errors-in-spark-datatypeconversions-scala-on-intellij-when-using-m
 claims to be hitting it when trying to build in intellij.

 It seems likely to be a bug in scalac; would finding a consistent repro
 case and filing it somewhere be useful?
 ​






Re: Spark as Relational Database

2014-10-26 Thread Michael Hausenblas

 Given that you are storing event data (which is basically things that have 
 happened in the past AND cannot be modified) you should definitely look at 
 Event sourcing. 
 http://martinfowler.com/eaaDev/EventSourcing.html


Agreed. In this context: a lesser known fact is that the Lambda Architecture 
is, in a nutshell, an extension of Fowler’s ES, so you might also want to check 
out:

https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark


Cheers,
Michael

--
Michael Hausenblas
Ireland, Europe
http://mhausenblas.info/

 On 27 Oct 2014, at 01:14, Soumya Simanta soumya.sima...@gmail.com wrote:
 
 Given that you are storing event data (which is basically things that have 
 happened in the past AND cannot be modified) you should definitely look at 
 Event sourcing. 
 http://martinfowler.com/eaaDev/EventSourcing.html
 
 If all you are doing is storing events then I don't think you need a 
 relational database. Rather an event log is ideal. Please see - 
 http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
 
 There are many other datastores that can do a better job at storing your 
 events. You can process your data and then store them in a relational 
 database to query later. 
 
  
 
 
 
 On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf opus...@gmail.com wrote:
 Thanks for all the useful responses.
 
 We have the usual task of mining a stream of events coming from our many 
 users.  We need to store these events, and process them.  We use a standard 
 Star Schema to represent our data. 
 
 For the moment, it looks like we should store these events in SQL.  When 
 appropriate, we will do analysis with relational queries.  Or, when 
 appropriate we will extract data into working sets in Spark. 
 
 I imagine this is a pretty common use case for Spark.
 
 On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson rick.richard...@gmail.com 
 wrote:
 Spark's API definitely covers all of the things that a relational database 
 can do. It will probably outperform a relational star schema if all of your 
 *working* data set can fit into RAM on your cluster. It will still perform 
 quite well if most of the data fits and some has to spill over to disk.
 
 What are your requirements exactly? 
 What is massive amounts of data exactly?
 How big is your cluster?
 
 Note that Spark is not for data storage, only data analysis. It pulls data 
 into working data sets called RDD's.
 
 As a migration path, you could probably pull the data out of a relational 
 database to analyze. But in the long run, I would recommend using a more 
 purpose built, huge storage database such as Cassandra. If your data is very 
 static, you could also just store it in files. 
 On Oct 26, 2014 9:19 AM, Peter Wolf opus...@gmail.com wrote:
 My understanding is the SparkSQL allows one to access Spark data as if it 
 were stored in a relational database.  It compiles SQL queries into a series 
 of calls to the Spark API.
 
 I need the performance of a SQL database, but I don't care about doing 
 queries with SQL.
 
 I create the input to MLib by doing a massive JOIN query.  So, I am creating 
 a single collection by combining many collections.  This sort of operation is 
 very inefficient in Mongo, Cassandra or HDFS.
 
 I could store my data in a relational database, and copy the query results to 
 Spark for processing.  However, I was hoping I could keep everything in Spark.
 
 On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta soumya.sima...@gmail.com 
 wrote:
 1. What data store do you want to store your data in ? HDFS, HBase, 
 Cassandra, S3 or something else? 
 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? 
 
 One option is to process the data in Spark and then store it in the 
 relational database of your choice.
 
 
 
 
 On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf opus...@gmail.com wrote:
 Hello all,
 
 We are considering Spark for our organization.  It is obviously a superb 
 platform for processing massive amounts of data... how about retrieving it?
 
 We are currently storing our data in a relational database in a star schema.  
 Retrieving our data requires doing many complicated joins across many tables.
 
 Can we use Spark as a relational database?  Or, if not, can we put Spark on 
 top of a relational database?
 
 Note that we don't care about SQL.  Accessing our data via standard queries 
 is nice, but we are equally happy (or more happy) to write Scala code. 
 
 What is important to us is doing relational queries on huge amounts of data.  
 Is Spark good at this?
 
 Thank you very much in advance
 Peter
 
 
 
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Create table error from Hive in spark-assembly-1.0.2.jar

2014-10-26 Thread Cheng, Hao
Can you paste the hive-site.xml? Most of times I meet this exception, because 
the JDBC driver for hive metastore are not correct set or wrong driver classes 
are included in the assembly jar.

As default, the assembly jar contains the derby.jar, which is the embedded 
derby JDBC driver.

From: Jacob Chacko - Catalyst Consulting 
[mailto:jacob.cha...@catalystconsulting.be]
Sent: Sunday, October 26, 2014 3:37 PM
To: u...@spark.incubator.apache.org; user@spark.apache.org
Subject: Create table error from Hive in spark-assembly-1.0.2.jar

Hi All

We are trying to create a table in Hive from spark-assembly-1.0.2.jar file.


CREATE TABLE IF NOT EXISTS src (key INT, value STRING)

  JavaSparkContext sc = CC2SparkManager.sharedInstance().getSparkContext();
  JavaHiveContext sqlContext = new JavaHiveContext(sc);
  sqlContext.sql(CREATE TABLE  srcd (key INT, value STRING));

On Spark 1.1 we get the following error:

FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to 
instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

On Spark 1.0.2 we get the following error:

 failure: ``UNCACHE'' expected but identifier CREATE found


Could you help understand why we get this error

Thanks
Jacob



Re: RDD to DStream

2014-10-26 Thread Jianshi Huang
I have a similar requirement. But instead of grouping it by chunkSize, I
would have the timeStamp be part of the data. So the function I want has
the following signature:

  // RDD of (timestamp, value)
  def rddToDStream[T](data: RDD[(Long, T)], timeWindow: Long)(implicit ssc:
StreamingContext): DStream[T]

And DStream should respect the timestamp part. This is important for
simulation, right?

Do you have any good solution for this?

Jianshi


On Thu, Aug 7, 2014 at 9:32 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 Hey Aniket,

 Great thoughts! I understand the usecase. But as you have realized
 yourself it is not trivial to cleanly stream a RDD as a DStream. Since RDD
 operations are defined to be scan based, it is not efficient to define RDD
 based on slices of data within a partition of another RDD, using pure RDD
 transformations. What you have done is a decent, and probably the only
 feasible solution, with its limitations.

 Also the requirements of converting a batch of data to a stream of data
 can be pretty diverse. What rate, what # of events per batch, how many
 batches, is it efficient? Hence, it is not trivial to define a good, clean
 public API for that. If any one has any thoughts, ideas, etc on this, you
 are more than welcome to share them.

 TD


 On Mon, Aug 4, 2014 at 12:43 AM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 The use case for converting RDD into DStream is that I want to simulate a
 stream from an already persisted data for testing analytics. It is trivial
 to create a RDD from any persisted data but not so much for DStream.
 Therefore, my idea to create DStream from RDD. For example, lets say you
 are trying to implement analytics on time series data using Lambda
 architecture. This means you would have to implement the same analytics on
 streaming data (in streaming mode) as well as persisted data (in batch
 mode). The workflow for implementing the anlytics would be to first
 implement it in batch mode using RDD operations and then simulate stream to
 test the analytics in stream mode. The simulated stream should produce the
 elements at a specified rate. So the solution maybe to read data in a RDD,
 split (chunk) it into multiple RDDs with each RDD having the size of
 elements that need to be streamed per time unit and then finally stream
 each RDD using the compute function.

 The problem with using QueueInputDStream is that it will stream data as
 per the batch duration specified in the streaming context and one cannot
 specify a custom slide duration. Moreover, the class QueueInputDStream is
 private to streaming package, so I can't really use it/extend it from an
 external package. Also, I could not find a good solution split a RDD into
 equal sized smaller RDDs that can be fed into an extended version of
 QueueInputDStream.

 Finally, here is what I came up with:

 class RDDExtension[T: ClassTag](rdd: RDD[T]) {
   def toStream(streamingContext: StreamingContext, chunkSize: Int,
 slideDurationMilli: Option[Long] = None): DStream[T] = {
 new InputDStream[T](streamingContext) {

   private val iterator = rdd.toLocalIterator // WARNING: each
 partition much fit in RAM of local machine.
   private val grouped = iterator.grouped(chunkSize)

   override def start(): Unit = {}

   override def stop(): Unit = {}

   override def compute(validTime: Time): Option[RDD[T]] = {
 if (grouped.hasNext) {
   Some(rdd.sparkContext.parallelize(grouped.next()))
 } else {
   None
 }
   }

   override def slideDuration = {
 slideDurationMilli.map(duration = new Duration(duration)).
   getOrElse(super.slideDuration)
   }
 }
 }

 This aims to stream chunkSize elements every slideDurationMilli
 milliseconds (defaults to batch size in streaming context). It's still not
 perfect (for example, the streaming is not precise) but given that this
 will only be used for testing purposes, I don't look for ways to further
 optimize it.

 Thanks,
 Aniket



 On 2 August 2014 04:07, Mayur Rustagi mayur.rust...@gmail.com wrote:

 Nice question :)
 Ideally you should use a queuestream interface to push RDD into a queue
  then spark streaming can handle the rest.
 Though why are you looking to convert RDD to DStream, another workaround
 folks use is to source DStream from folders  move files that they need
 reprocessed back into the folder, its a hack but much less headache .

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Fri, Aug 1, 2014 at 10:21 AM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 Hi everyone

 I haven't been receiving replies to my queries in the distribution
 list. Not pissed but I am actually curious to know if my messages are
 actually going through or not. Can someone please confirm that my msgs are
 getting delivered via this distribution list?

 Thanks,
 Aniket


 On 1 August