Re: Spark/HIVE Insert Into values Error
Hi, I have already found the way about how to “insert into HIVE_TABLE values (…..) Regards Arthur On 18 Oct, 2014, at 10:09 pm, Cheng Lian lian.cs@gmail.com wrote: Currently Spark SQL uses Hive 0.12.0, which doesn't support the INSERT INTO ... VALUES ... syntax. On 10/18/14 1:33 AM, arthur.hk.c...@gmail.com wrote: Hi, When trying to insert records into HIVE, I got error, My Spark is 1.1.0 and Hive 0.12.0 Any idea what would be wrong? Regards Arthur hive CREATE TABLE students (name VARCHAR(64), age INT, gpa int); OK hive INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1); NoViableAltException(26@[]) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:693) at org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:31374) at org.apache.hadoop.hive.ql.parse.HiveParser.regular_body(HiveParser.java:29083) at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatement(HiveParser.java:28968) at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:28762) at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1238) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:938) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:190) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) FAILED: ParseException line 1:27 cannot recognize input near 'VALUES' '(' ''fred flintstone'' in select clause
Create table error from Hive in spark-assembly-1.0.2.jar
Hi All We are trying to create a table in Hive from spark-assembly-1.0.2.jar file. CREATE TABLE IF NOT EXISTS src (key INT, value STRING) JavaSparkContext sc = CC2SparkManager.sharedInstance().getSparkContext(); JavaHiveContext sqlContext = new JavaHiveContext(sc); sqlContext.sql(CREATE TABLE srcd (key INT, value STRING)); On Spark 1.1 we get the following error: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient On Spark 1.0.2 we get the following error: failure: ``UNCACHE'' expected but identifier CREATE found Could you help understand why we get this error Thanks Jacob
Re: scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0
I tried that already, same exception. I also tried using an accumulator to collect all filenames. The filename is not the problem. Even this crashes with the same exception: sc.parallelize(files.value).map { fileName = println(sScanning $fileName) try { println(sScanning $fileName) sc.textFile(fileName).take(1) sSuccessfully scanned $fileName } catch { case t: Throwable = sFailed to process $fileName, reason ${t.getStackTrace.head} } } .saveAsTextFile(output) The output file contains “Failed to process… for each file. On 26.10.2014, at 00:10, Buttler, David buttl...@llnl.gov wrote: This sounds like expected behavior to me. The foreach call should be distributed on the workers. perhaps you want to use map instead, and then collect the failed file names locally, or save the whole thing out to a file From: Marius Soutier [mps@gmail.com] Sent: Friday, October 24, 2014 6:35 AM To: user@spark.apache.org Subject: scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0 Hi, I’m running a job whose simple task it is to find files that cannot be read (sometimes our gz files are corrupted). With 1.0.x, this worked perfectly. Since 1.1.0 however, I’m getting an exception: scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114) sc.wholeTextFiles(input) .foreach { case (fileName, _) = try { println(sScanning $fileName) sc.textFile(fileName).take(1) println(sSuccessfully scanned $fileName) } catch { case t: Throwable = println(sFailed to process $fileName, reason ${t.getStackTrace.head}) } } Also since 1.1.0, the printlns are no longer visible on the console, only in the Spark UI worker output. Thanks for any help - Marius - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Implement Count by Minute in Spark Streaming
Hi, Suppose I have a stream of logs and I want to count them by minute. The result is like: 2014-10-26 18:38:00 100 2014-10-26 18:39:00 150 2014-10-26 18:40:00 200 One way to do this is to set the batch interval to 1 min, but each batch would be quite large. Or I can use updateStateByKey where key is like '2014-10-26 18:38:00', but I have two questions: 1. How to persist the result to MySQL? Do I need to flush them every batch? 2. How to delete the old state? For example, now is 18:50 but the 18:40's state is still in Spark. One solution is to set the key's state to None when there's no data of this key in this batch. But what if the log is not so much, and some batches get zero logs? For instance 18:40:00~18:40:10 has 10 logs - key 18:40's value is set to 10 18:40:10~18:40:20 has no log - key 18:40 is deleted 18:40:20~18:40:30 has 5 logs - key 18:40's value is set to 5 You can see the result is wrong. Maybe I can use an 'update' approach when flushing, i.e. check MySQL whether there's already an entry of 18:40 and add the result to that. But how about a unique count? I can't store all unique values in MySQL per se. So I'm looking for a better way to store count-by-minute result into rdbms (or nosql?). Any idea would be appreciated. Thanks. -- Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Bug in Accumulators...
Sorry, I forgot to say that this gives the above error just when run on a cluster, not in local mode. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263p17277.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark as Relational Database
My understanding is the SparkSQL allows one to access Spark data as if it were stored in a relational database. It compiles SQL queries into a series of calls to the Spark API. I need the performance of a SQL database, but I don't care about doing queries with SQL. I create the input to MLib by doing a massive JOIN query. So, I am creating a single collection by combining many collections. This sort of operation is very inefficient in Mongo, Cassandra or HDFS. I could store my data in a relational database, and copy the query results to Spark for processing. However, I was hoping I could keep everything in Spark. On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta soumya.sima...@gmail.com wrote: 1. What data store do you want to store your data in ? HDFS, HBase, Cassandra, S3 or something else? 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? One option is to process the data in Spark and then store it in the relational database of your choice. On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf opus...@gmail.com wrote: Hello all, We are considering Spark for our organization. It is obviously a superb platform for processing massive amounts of data... how about retrieving it? We are currently storing our data in a relational database in a star schema. Retrieving our data requires doing many complicated joins across many tables. Can we use Spark as a relational database? Or, if not, can we put Spark on top of a relational database? Note that we don't care about SQL. Accessing our data via standard queries is nice, but we are equally happy (or more happy) to write Scala code. What is important to us is doing relational queries on huge amounts of data. Is Spark good at this? Thank you very much in advance Peter
Re: Implement Count by Minute in Spark Streaming
Hi , You can use Redis to store the keys and value as count by doing an update function whenever you receive that minute key , being an in memory database it would faster than SQL .You can do an update at the end of each batch to update the count of the key if it exists or create in case of a new entry . Sent from my iPhone On Oct 26, 2014, at 4:33 PM, Ji ZHANG zhangj...@gmail.com wrote:ah Hi, Suppose I have a stream of logs and I want to count them by minute. The result is like: 2014-10-26 18:38:00 100 2014-10-26 18:39:00 150 2014-10-26 18:40:00 200 One way to do this is to set the batch interval to 1 min, but each batch would be quite large. Or I can use updateStateByKey where key is like '2014-10-26 18:38:00', but I have two questions: 1. How to persist the result to MySQL? Do I need to flush them every batch? 2. How to delete the old state? For example, now is 18:50 but the 18:40's state is still in Spark. One solution is to set the key's state to None when there's no data of this key in this batch. But what if the log is not so much, and some batches get zero logs? For instance 18:40:00~18:40:10 has 10 logs - key 18:40's value is set to 10 18:40:10~18:40:20 has no log - key 18:40 is deleted 18:40:20~18:40:30 has 5 logs - key 18:40's value is set to 5 You can see the result is wrong. Maybe I can use an 'update' approach when flushing, i.e. check MySQL whether there's already an entry of 18:40 and add the result to that. But how about a unique count? I can't store all unique values in MySQL per se. So I'm looking for a better way to store count-by-minute result into rdbms (or nosql?). Any idea would be appreciated. Thanks. -- Jerry - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark as Relational Database
Spark's API definitely covers all of the things that a relational database can do. It will probably outperform a relational star schema if all of your *working* data set can fit into RAM on your cluster. It will still perform quite well if most of the data fits and some has to spill over to disk. What are your requirements exactly? What is massive amounts of data exactly? How big is your cluster? Note that Spark is not for data storage, only data analysis. It pulls data into working data sets called RDD's. As a migration path, you could probably pull the data out of a relational database to analyze. But in the long run, I would recommend using a more purpose built, huge storage database such as Cassandra. If your data is very static, you could also just store it in files. On Oct 26, 2014 9:19 AM, Peter Wolf opus...@gmail.com wrote: My understanding is the SparkSQL allows one to access Spark data as if it were stored in a relational database. It compiles SQL queries into a series of calls to the Spark API. I need the performance of a SQL database, but I don't care about doing queries with SQL. I create the input to MLib by doing a massive JOIN query. So, I am creating a single collection by combining many collections. This sort of operation is very inefficient in Mongo, Cassandra or HDFS. I could store my data in a relational database, and copy the query results to Spark for processing. However, I was hoping I could keep everything in Spark. On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta soumya.sima...@gmail.com wrote: 1. What data store do you want to store your data in ? HDFS, HBase, Cassandra, S3 or something else? 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? One option is to process the data in Spark and then store it in the relational database of your choice. On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf opus...@gmail.com wrote: Hello all, We are considering Spark for our organization. It is obviously a superb platform for processing massive amounts of data... how about retrieving it? We are currently storing our data in a relational database in a star schema. Retrieving our data requires doing many complicated joins across many tables. Can we use Spark as a relational database? Or, if not, can we put Spark on top of a relational database? Note that we don't care about SQL. Accessing our data via standard queries is nice, but we are equally happy (or more happy) to write Scala code. What is important to us is doing relational queries on huge amounts of data. Is Spark good at this? Thank you very much in advance Peter
Re: Accumulators : Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
Just tried the below code and works for me, not sure why is sparkContext being sent inside the mapPartitions function in your case. Can you try with simple map() instead of mapPartition? val ac = sc.accumulator(0) val or = sc.parallelize(1 to 1) val ps = or.map(x = (x,x+2)).map(x = ac +=1) val test = ps.collect() println(ac.value) Thanks Best Regards On Sun, Oct 26, 2014 at 1:44 AM, octavian.ganea octavian.ga...@inf.ethz.ch wrote: Hi all, I tried to use accumulators without any success so far. My code is simple: val sc = new SparkContext(conf) val accum = sc.accumulator(0) val partialStats = sc.textFile(f.getAbsolutePath()) .map(line = { val key = line.split(\t).head; (key , line)} ) .groupByKey(128) .mapPartitions{iter = { accum += 1; foo(iter)}} .reduce(_ + _) println(accum.value) Now, if I remove the 'accum += 1', everything works fine. If I keep it, I get this weird error: Exception in thread main 14/10/25 21:58:56 INFO TaskSchedulerImpl: Cancelling stage 0 org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:890) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply(DAGScheduler.scala:887) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply(DAGScheduler.scala:887) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16.apply(DAGScheduler.scala:887) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16.apply(DAGScheduler.scala:886) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:886) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1204) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Can someone please help! Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accumulators-Task-not-serializable-java-io-NotSerializableException-org-apache-spark-SparkContext-tp17262.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark as Relational Database
@Peter - as Rick said - Spark's main usage is data analysis and not storage. Spark allows you to plugin different storage layers based on your use cases and quality attribute requirements. So in essence if your relational database is meeting your storage requirements you should think about how to use that with Spark. Because even if you decide not to use your relational database you will have to select some storage layer, most likely a distributed storage layer. Another option to think about is - can you possible restructure your data schema so that you don't have to do that a large number of joins?. If this is an option then you can potentially think about using stores such as Cassandra, HBase, HDFS etc. Spark really excels at processing large volumes of data really fast (given enough memory) on horizontally scalable commodity hardware. As Rick pointed out - It will probably outperform a relational star schema if all of your *working* data set can fit into RAM on your cluster. However, if you data size is much larger than your cluster memory you don't have a choice but to select a datastore. HTH -Soumya On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson rick.richard...@gmail.com wrote: Spark's API definitely covers all of the things that a relational database can do. It will probably outperform a relational star schema if all of your *working* data set can fit into RAM on your cluster. It will still perform quite well if most of the data fits and some has to spill over to disk. What are your requirements exactly? What is massive amounts of data exactly? How big is your cluster? Note that Spark is not for data storage, only data analysis. It pulls data into working data sets called RDD's. As a migration path, you could probably pull the data out of a relational database to analyze. But in the long run, I would recommend using a more purpose built, huge storage database such as Cassandra. If your data is very static, you could also just store it in files. On Oct 26, 2014 9:19 AM, Peter Wolf opus...@gmail.com wrote: My understanding is the SparkSQL allows one to access Spark data as if it were stored in a relational database. It compiles SQL queries into a series of calls to the Spark API. I need the performance of a SQL database, but I don't care about doing queries with SQL. I create the input to MLib by doing a massive JOIN query. So, I am creating a single collection by combining many collections. This sort of operation is very inefficient in Mongo, Cassandra or HDFS. I could store my data in a relational database, and copy the query results to Spark for processing. However, I was hoping I could keep everything in Spark. On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta soumya.sima...@gmail.com wrote: 1. What data store do you want to store your data in ? HDFS, HBase, Cassandra, S3 or something else? 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? One option is to process the data in Spark and then store it in the relational database of your choice. On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf opus...@gmail.com wrote: Hello all, We are considering Spark for our organization. It is obviously a superb platform for processing massive amounts of data... how about retrieving it? We are currently storing our data in a relational database in a star schema. Retrieving our data requires doing many complicated joins across many tables. Can we use Spark as a relational database? Or, if not, can we put Spark on top of a relational database? Note that we don't care about SQL. Accessing our data via standard queries is nice, but we are equally happy (or more happy) to write Scala code. What is important to us is doing relational queries on huge amounts of data. Is Spark good at this? Thank you very much in advance Peter
Re: Spark as Relational Database
Hi, It is very easy to integrate using Cassandra in a use case such as this. For instance, do your joins in Spark and do your data storage in Cassandra which allows a very flexible schema, unlike a relational DB, and is much faster, fault tolerant, and with spark and colocation WRT data locality, infinitely faster. If you use the Spark Cassandra Connector, reading and writing to Cassandra is as simple as: write - DStream or RDD stream.map(RawData(_)).saveToCassandra(keyspace, table) read - SparkContext or StreamingContext ssc.cassandraTable[Double](keyspace, dailytable) .select(precipitation) .where(weather_station = ? AND year = ?, wsid, year) .map(doWork) In your build: com.datastax.spark %% spark-cassandra-connector % 1.1.0-alpha4”// our 1.1.0 is for spark 1.1 https://github.com/datastax/spark-cassandra-connector docs: https://github.com/datastax/spark-cassandra-connector/tree/master/doc - Helena twitter: @helenaedelson On Oct 26, 2014, at 10:05 AM, Rick Richardson rick.richard...@gmail.com wrote: Spark's API definitely covers all of the things that a relational database can do. It will probably outperform a relational star schema if all of your *working* data set can fit into RAM on your cluster. It will still perform quite well if most of the data fits and some has to spill over to disk. What are your requirements exactly? What is massive amounts of data exactly? How big is your cluster? Note that Spark is not for data storage, only data analysis. It pulls data into working data sets called RDD's. As a migration path, you could probably pull the data out of a relational database to analyze. But in the long run, I would recommend using a more purpose built, huge storage database such as Cassandra. If your data is very static, you could also just store it in files. On Oct 26, 2014 9:19 AM, Peter Wolf opus...@gmail.com wrote: My understanding is the SparkSQL allows one to access Spark data as if it were stored in a relational database. It compiles SQL queries into a series of calls to the Spark API. I need the performance of a SQL database, but I don't care about doing queries with SQL. I create the input to MLib by doing a massive JOIN query. So, I am creating a single collection by combining many collections. This sort of operation is very inefficient in Mongo, Cassandra or HDFS. I could store my data in a relational database, and copy the query results to Spark for processing. However, I was hoping I could keep everything in Spark. On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta soumya.sima...@gmail.com wrote: 1. What data store do you want to store your data in ? HDFS, HBase, Cassandra, S3 or something else? 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? One option is to process the data in Spark and then store it in the relational database of your choice. On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf opus...@gmail.com wrote: Hello all, We are considering Spark for our organization. It is obviously a superb platform for processing massive amounts of data... how about retrieving it? We are currently storing our data in a relational database in a star schema. Retrieving our data requires doing many complicated joins across many tables. Can we use Spark as a relational database? Or, if not, can we put Spark on top of a relational database? Note that we don't care about SQL. Accessing our data via standard queries is nice, but we are equally happy (or more happy) to write Scala code. What is important to us is doing relational queries on huge amounts of data. Is Spark good at this? Thank you very much in advance Peter
what classes are needed to register in KryoRegistrator, e.g. Row?
In Tuning Spark https://spark.apache.org/docs/latest/tuning.html, it says, Spark automatically includes Kryo serializers for the *many commonly-used core Scala classes* covered in the AllScalaRegistrar from the Twitter chill https://github.com/twitter/chill library. I looked into the AllScalaRegistrar Javadoc, it only says, /** Registers all the scala (and java) serializers we have */ It seems to register only all scala and java primitive and collection classes, is that right? What about classes in Spark, do we need to register them ourselves, especially Row, GenericRow, MutableRow in Spark SQL?
Re: Accumulators : Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
Hi Akhil, Please see this related message. http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-td17263.html I am curious if this works for you also. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accumulators-Task-not-serializable-java-io-NotSerializableException-org-apache-spark-SparkContext-tp17262p17287.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How do you use the thrift-server to get data from a Spark program?
Hi all, This feels like a dumb question but bespeaks my lack of understanding: what is the Spark thrift-server for? Especially if there's an existing Hive installation. Background: We want to use Spark to do some processing starting from files (in probably MapRFS). We want to be able to read the result using SQL so that we can report the results using Eclipse BIRT. My confusion: Spark 1.1 includes a thrift-server for accessing data via JDBC. However, I don't understand how to make data available in it from the rest of Spark. I have a small program that does what I want in spark-shell. It reads some JSON, does some manipulation using SchemaRDDs and then has the data ready. If I've started the shell with the hive-site.xml pointing to a Hive installation I can use SchemaRDD.saveToTable to put it into Hive - and then I can use beeline to read it. But that's using the *Hive* thrift-server and not the Spark thrift-server. That doesn't seem to be the intention of having a separate thrift-server in Spark. Before I started on this I assumed that you could run a Spark program (in, say, Java) and then make those results accessible for the JDBC interface. So, please, fill me in. What am I missing? Many thanks, Edward
Spark optimization
I wonder if there is any tool to tweak spark (worker and master). I have 6 workers (192 GB RAM, 32 cores CPU each) with 2 masters and see only small different between MapReduce from hadoop and Spark. I've tested word count on 50 GB file. During tests spark hung on 2 nodes for few minuts with message: 14/10/26 21:38:52 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[8] at saveAsTextFile at NativeMethodAccessorImpl.java:-2) 14/10/26 21:38:52 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14/10/26 21:38:52 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to sp...@spark-s2.test.org:41437 14/10/26 21:38:52 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 5942 bytes 14/10/26 21:38:52 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to sp...@spark-s4.test.org:34546 Best regards, Morbious -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-optimization-tp17290.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How do you use the thrift-server to get data from a Spark program?
This is very experimental and mostly unsupported, but you can start the JDBC server from within your own programs https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L45 by passing it the HiveContext. On Sun, Oct 26, 2014 at 12:16 PM, Edward Sargisson ejsa...@gmail.com wrote: Hi all, This feels like a dumb question but bespeaks my lack of understanding: what is the Spark thrift-server for? Especially if there's an existing Hive installation. Background: We want to use Spark to do some processing starting from files (in probably MapRFS). We want to be able to read the result using SQL so that we can report the results using Eclipse BIRT. My confusion: Spark 1.1 includes a thrift-server for accessing data via JDBC. However, I don't understand how to make data available in it from the rest of Spark. I have a small program that does what I want in spark-shell. It reads some JSON, does some manipulation using SchemaRDDs and then has the data ready. If I've started the shell with the hive-site.xml pointing to a Hive installation I can use SchemaRDD.saveToTable to put it into Hive - and then I can use beeline to read it. But that's using the *Hive* thrift-server and not the Spark thrift-server. That doesn't seem to be the intention of having a separate thrift-server in Spark. Before I started on this I assumed that you could run a Spark program (in, say, Java) and then make those results accessible for the JDBC interface. So, please, fill me in. What am I missing? Many thanks, Edward
Spark SQL configuration
I'm a newbie with Spark. After installing it on all the machines I want to use, do I need to tell it about Hadoop configuration, or will it be able to find it himself? Thank you,
Re: Spark as Relational Database
I agree with Soumya. A relational database is usually the worst kind of database to receive a constant event stream. That said, the best solution is one that already works :) If your system is meeting your needs, then great. When you get so many events that your db can't keep up, I'd look into Cassandra to receive the events, and spark to analyze them. On Oct 26, 2014 9:14 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Given that you are storing event data (which is basically things that have happened in the past AND cannot be modified) you should definitely look at Event sourcing. http://martinfowler.com/eaaDev/EventSourcing.html If all you are doing is storing events then I don't think you need a relational database. Rather an event log is ideal. Please see - http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying There are many other datastores that can do a better job at storing your events. You can process your data and then store them in a relational database to query later. On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf opus...@gmail.com wrote: Thanks for all the useful responses. We have the usual task of mining a stream of events coming from our many users. We need to store these events, and process them. We use a standard Star Schema to represent our data. For the moment, it looks like we should store these events in SQL. When appropriate, we will do analysis with relational queries. Or, when appropriate we will extract data into working sets in Spark. I imagine this is a pretty common use case for Spark. On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson rick.richard...@gmail.com wrote: Spark's API definitely covers all of the things that a relational database can do. It will probably outperform a relational star schema if all of your *working* data set can fit into RAM on your cluster. It will still perform quite well if most of the data fits and some has to spill over to disk. What are your requirements exactly? What is massive amounts of data exactly? How big is your cluster? Note that Spark is not for data storage, only data analysis. It pulls data into working data sets called RDD's. As a migration path, you could probably pull the data out of a relational database to analyze. But in the long run, I would recommend using a more purpose built, huge storage database such as Cassandra. If your data is very static, you could also just store it in files. On Oct 26, 2014 9:19 AM, Peter Wolf opus...@gmail.com wrote: My understanding is the SparkSQL allows one to access Spark data as if it were stored in a relational database. It compiles SQL queries into a series of calls to the Spark API. I need the performance of a SQL database, but I don't care about doing queries with SQL. I create the input to MLib by doing a massive JOIN query. So, I am creating a single collection by combining many collections. This sort of operation is very inefficient in Mongo, Cassandra or HDFS. I could store my data in a relational database, and copy the query results to Spark for processing. However, I was hoping I could keep everything in Spark. On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta soumya.sima...@gmail.com wrote: 1. What data store do you want to store your data in ? HDFS, HBase, Cassandra, S3 or something else? 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? One option is to process the data in Spark and then store it in the relational database of your choice. On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf opus...@gmail.com wrote: Hello all, We are considering Spark for our organization. It is obviously a superb platform for processing massive amounts of data... how about retrieving it? We are currently storing our data in a relational database in a star schema. Retrieving our data requires doing many complicated joins across many tables. Can we use Spark as a relational database? Or, if not, can we put Spark on top of a relational database? Note that we don't care about SQL. Accessing our data via standard queries is nice, but we are equally happy (or more happy) to write Scala code. What is important to us is doing relational queries on huge amounts of data. Is Spark good at this? Thank you very much in advance Peter
Re: Spark LIBLINEAR
Debasish Das writes: If the SVM is not already migrated to BFGS, that's the first thing you should try...Basically following LBFGS Logistic Regression come up with LBFGS based linear SVM... About integrating TRON in mllib, David already has a version of TRON in breeze but someone needs to validate it for linear SVM and do experiment to see if it can improve upon LBFGS based linear SVM...Based on lib-linear papers, it should but I don't expect substantial difference... I am validating TRON for use-cases related to this PR (but I need more features on top of TRON): https://github.com/apache/spark/pull/2705 We are also working on integrating TRON to MLlib, though we haven't cheked the above work. We are also doing some serious comparison between quasi Newton and Newton, though this will take some time Chih-Jen On Fri, Oct 24, 2014 at 2:09 PM, k.tham kevins...@gmail.com wrote: Just wondering, any update on this? Is there a plan to integrate CJ's work with mllib? I'm asking since SVM impl in MLLib did not give us good results and we have to resort to training our svm classifier in a serial manner on the driver node with liblinear. Also, it looks like CJ Lin is coming to the bay area in the coming weeks (http://www.meetup.com/sfmachinelearning/events/208078582/) might be a good time to connect with him. -- View this message in context: http:// apache-spark-user-list.1001560.n3.nabble.com/ Spark-LIBLINEAR-tp5546p17236.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark 1.1.0 ClassNotFoundException issue when submit with multi jars using CLUSTER MODE
HI I am using Spark 1.1.0 config with STANDALONE clusterManager and CLUSTER deployMode. The logic is I want to submit multi jars with spark-submit , using the �C-jars optional, I got an ClassNotFoundException , by the way in my code I also use thread context class loader to load custom class . Strange things is that when I use CLIENT deployMode. the exception is not throws. Can anyone explain the class loader logic of spark or the issue when using cluster mode ? /10/24 14:18:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.RuntimeException: Cannot load class: cn.cekasp.al.demo.SimpleInputFormat3 at cn.cekasp.algorithm.util.ReflectUtil.findClass(ReflectUtil.java:12) at cn.cekasp.algorithm.util.ReflectUtil.newInstance(ReflectUtil.java:18) at cn.cekasp.algorithm.reader.JdbcSourceReader$1.call(JdbcSourceReader.java:95) at cn.cekasp.algorithm.reader.JdbcSourceReader$1.call(JdbcSourceReader.java:90) at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaP airRDD.scala:923) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1167) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:904) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:904) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:112 1) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:112 1) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11 45) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6 15) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: cn.cekasp.al.demo.SimpleInputFormat3 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at cn.cekasp.algorithm.util.ReflectUtil.findClass(ReflectUtil.java:10) ... 16 more
Re: scalac crash when compiling DataTypeConversions.scala
Any update? I encountered same issue in my environment. Here are my steps as usual: git clone https://github.com/apache/spark mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package build successfully by maven. import into IDEA as a maven project, click Build-Make Project,. 2 compile errors found, one of them are DataTypeConversions.scala Error:scala: while compiling: /opt/Development/spark/sql/core/src/main/scala/org/apache/spark/sql/types/util/DataTypeConversions.scala during phase: jvm library version: version 2.10.4 compiler version: version 2.10.4 reconstructed args: -nobootcp -javabootclasspath : -deprecation -feature -Xplugin:/home/shawguo/host/maven_repo/org/scalamacros/paradise_2.10.4/2.0.1/paradise_2.10.4-2.0.1.jar -unchecked -classpath
Re: scalac crash when compiling DataTypeConversions.scala
Yes it is necessary to do a mvn clean when encountering this issue. Typically you would have changed one or more of the profiles/options - which leads to this occurring. 2014-10-22 22:00 GMT-07:00 Ryan Williams ryan.blake.willi...@gmail.com: I started building Spark / running Spark tests this weekend and on maybe 5-10 occasions have run into a compiler crash while compiling DataTypeConversions.scala. Here https://gist.github.com/ryan-williams/7673d7da928570907f4d is a full gist of an innocuous test command (mvn test -Dsuites='*KafkaStreamSuite') exhibiting this behavior. Problem starts on L512 https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L512 and there’s a final stack trace at the bottom https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L671 . mvn clean or ./sbt/sbt clean “fix” it (I believe I’ve observed the issue while compiling with each tool), but are annoying/time-consuming to do, obvs, and it’s happening pretty frequently for me when doing only small numbers of incremental compiles punctuated by e.g. checking out different git commits. Have other people seen this? This post http://apache-spark-user-list.1001560.n3.nabble.com/spark-github-source-build-error-td10532.html on this list is basically the same error, but in TestSQLContext.scala and this SO post http://stackoverflow.com/questions/25211071/compilation-errors-in-spark-datatypeconversions-scala-on-intellij-when-using-m claims to be hitting it when trying to build in intellij. It seems likely to be a bug in scalac; would finding a consistent repro case and filing it somewhere be useful?
Re: scalac crash when compiling DataTypeConversions.scala
I heard from one person offline who regularly builds Spark on OSX and Linux and they felt like they only ever saw this error on OSX; if anyone can confirm whether they've seen it on Linux, that would be good to know. Stephen: good to know re: profiles/options. I don't think changing them is a necessary condition as I believe I've run into it without doing that, but any set of steps to reproduce this would be welcome so that we could escalate to Typesafe as appropriate. On Sun, Oct 26, 2014 at 11:46 PM, Stephen Boesch java...@gmail.com wrote: Yes it is necessary to do a mvn clean when encountering this issue. Typically you would have changed one or more of the profiles/options - which leads to this occurring. 2014-10-22 22:00 GMT-07:00 Ryan Williams ryan.blake.willi...@gmail.com: I started building Spark / running Spark tests this weekend and on maybe 5-10 occasions have run into a compiler crash while compiling DataTypeConversions.scala. Here https://gist.github.com/ryan-williams/7673d7da928570907f4d is a full gist of an innocuous test command (mvn test -Dsuites='*KafkaStreamSuite') exhibiting this behavior. Problem starts on L512 https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L512 and there’s a final stack trace at the bottom https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L671 . mvn clean or ./sbt/sbt clean “fix” it (I believe I’ve observed the issue while compiling with each tool), but are annoying/time-consuming to do, obvs, and it’s happening pretty frequently for me when doing only small numbers of incremental compiles punctuated by e.g. checking out different git commits. Have other people seen this? This post http://apache-spark-user-list.1001560.n3.nabble.com/spark-github-source-build-error-td10532.html on this list is basically the same error, but in TestSQLContext.scala and this SO post http://stackoverflow.com/questions/25211071/compilation-errors-in-spark-datatypeconversions-scala-on-intellij-when-using-m claims to be hitting it when trying to build in intellij. It seems likely to be a bug in scalac; would finding a consistent repro case and filing it somewhere be useful?
Re: Setting only master heap
Hi Guys, Here's some lines from the log file before the OOM. They don't look that helpful, so let me know if there's anything else I should be sending. I am running in standalone mode. spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError: Java heap space spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22 05:00:36 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkMaster-akka.actor.default-dispatcher-52] shutting down ActorSystem [sparkMaster] spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError: Java heap space spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:Exception in thread qtp2057079871-30 java.lang.OutOfMemoryError: Java heap space spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22 05:00:07 WARN AbstractNioSelector: Unexpected exception in the selector loop. spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22 05:02:51 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkMaster-8] shutting down ActorSystem [sparkMaster] spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError: Java heap space spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22 05:03:22 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkMaster-akka.actor.default-dispatcher-38] shutting down ActorSystem [sparkMaster] spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError: Java heap space spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22 05:03:22 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkMaster-6] shutting down ActorSystem [sparkMaster] spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError: Java heap space spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22 05:03:22 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkMaster-akka.actor.default-dispatcher-43] shutting down ActorSystem [sparkMaster] spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError: Java heap space spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22 05:03:22 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkMaster-akka.actor.default-dispatcher-13] shutting down ActorSystem [sparkMaster] spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError: Java heap space spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22 05:03:22 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkMaster-5] shutting down ActorSystem [sparkMaster] spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError: Java heap space spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5-14/10/22 05:03:22 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkMaster-akka.actor.default-dispatcher-12] shutting down ActorSystem [sparkMaster] On Thu, Oct 23, 2014 at 2:10 PM, Nan Zhu zhunanmcg...@gmail.com wrote: h… my observation is that, master in Spark 1.1 has higher frequency of GC…… Also, before 1.1, I never encounter GC overtime in Master, after upgrade to 1.1, I have met for 2 times (we upgrade soon after 1.1 release)…. Best, -- Nan Zhu On Thursday, October 23, 2014 at 1:08 PM, Andrew Or wrote: Yeah, as Sameer commented, there is unfortunately not an equivalent `SPARK_MASTER_MEMORY` that you can set. You can work around this by starting the master and the slaves separately with different settings of SPARK_DAEMON_MEMORY each time. AFAIK there haven't been any major changes in the standalone master in 1.1.0, so I don't see an immediate explanation for what you're observing. In general the Spark master doesn't use that much memory, and even if there are many applications it will discard the old ones appropriately, so unless you have a ton (like thousands) of concurrently running applications connecting to it there's little likelihood for it to OOM. At least that's my understanding. -Andrew 2014-10-22 15:51 GMT-07:00 Sameer Farooqui same...@databricks.com: Hi Keith, Would be helpful if you could post the error message. Are you running Spark in Standalone mode or with YARN? In general, the Spark Master is only used for scheduling and it should be fine with the default setting of 512 MB RAM. Is it actually the Spark Driver's memory that you intended to change? *++ If in Standalone mode ++* You're right that SPARK_DAEMON_MEMORY set the memory to allocate to the Spark Master, Worker and even HistoryServer daemons together. SPARK_WORKER_MEMORY is slightly confusing. In Standalone
Spark SQL Exists Clause
Hey, I'm trying to run TPC-H Query 4 (shown below), and get the following error: Exception in thread main java.lang.RuntimeException: [11.25] failure: ``UNION'' expected but `select' found It seems like Spark SQL doesn't support the exists clause. Is this true? select o_orderpriority, count(*) as order_count from orders where o_orderdate = date '1993-07-01' and o_orderdate date '1993-10-01' and exists ( select * from lineitem where l_orderkey = o_orderkey and l_commitdate l_receiptdate ) group by o_orderpriority order by o_orderpriority; Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exists-Clause-tp17307.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: scalac crash when compiling DataTypeConversions.scala
I see the errors regularly on linux under the conditions of having changed profiles. 2014-10-26 20:49 GMT-07:00 Ryan Williams ryan.blake.willi...@gmail.com: I heard from one person offline who regularly builds Spark on OSX and Linux and they felt like they only ever saw this error on OSX; if anyone can confirm whether they've seen it on Linux, that would be good to know. Stephen: good to know re: profiles/options. I don't think changing them is a necessary condition as I believe I've run into it without doing that, but any set of steps to reproduce this would be welcome so that we could escalate to Typesafe as appropriate. On Sun, Oct 26, 2014 at 11:46 PM, Stephen Boesch java...@gmail.com wrote: Yes it is necessary to do a mvn clean when encountering this issue. Typically you would have changed one or more of the profiles/options - which leads to this occurring. 2014-10-22 22:00 GMT-07:00 Ryan Williams ryan.blake.willi...@gmail.com: I started building Spark / running Spark tests this weekend and on maybe 5-10 occasions have run into a compiler crash while compiling DataTypeConversions.scala. Here https://gist.github.com/ryan-williams/7673d7da928570907f4d is a full gist of an innocuous test command (mvn test -Dsuites='*KafkaStreamSuite') exhibiting this behavior. Problem starts on L512 https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L512 and there’s a final stack trace at the bottom https://gist.github.com/ryan-williams/7673d7da928570907f4d#file-stdout-L671 . mvn clean or ./sbt/sbt clean “fix” it (I believe I’ve observed the issue while compiling with each tool), but are annoying/time-consuming to do, obvs, and it’s happening pretty frequently for me when doing only small numbers of incremental compiles punctuated by e.g. checking out different git commits. Have other people seen this? This post http://apache-spark-user-list.1001560.n3.nabble.com/spark-github-source-build-error-td10532.html on this list is basically the same error, but in TestSQLContext.scala and this SO post http://stackoverflow.com/questions/25211071/compilation-errors-in-spark-datatypeconversions-scala-on-intellij-when-using-m claims to be hitting it when trying to build in intellij. It seems likely to be a bug in scalac; would finding a consistent repro case and filing it somewhere be useful?
Re: Spark as Relational Database
Given that you are storing event data (which is basically things that have happened in the past AND cannot be modified) you should definitely look at Event sourcing. http://martinfowler.com/eaaDev/EventSourcing.html Agreed. In this context: a lesser known fact is that the Lambda Architecture is, in a nutshell, an extension of Fowler’s ES, so you might also want to check out: https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark Cheers, Michael -- Michael Hausenblas Ireland, Europe http://mhausenblas.info/ On 27 Oct 2014, at 01:14, Soumya Simanta soumya.sima...@gmail.com wrote: Given that you are storing event data (which is basically things that have happened in the past AND cannot be modified) you should definitely look at Event sourcing. http://martinfowler.com/eaaDev/EventSourcing.html If all you are doing is storing events then I don't think you need a relational database. Rather an event log is ideal. Please see - http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying There are many other datastores that can do a better job at storing your events. You can process your data and then store them in a relational database to query later. On Sun, Oct 26, 2014 at 9:01 PM, Peter Wolf opus...@gmail.com wrote: Thanks for all the useful responses. We have the usual task of mining a stream of events coming from our many users. We need to store these events, and process them. We use a standard Star Schema to represent our data. For the moment, it looks like we should store these events in SQL. When appropriate, we will do analysis with relational queries. Or, when appropriate we will extract data into working sets in Spark. I imagine this is a pretty common use case for Spark. On Sun, Oct 26, 2014 at 10:05 AM, Rick Richardson rick.richard...@gmail.com wrote: Spark's API definitely covers all of the things that a relational database can do. It will probably outperform a relational star schema if all of your *working* data set can fit into RAM on your cluster. It will still perform quite well if most of the data fits and some has to spill over to disk. What are your requirements exactly? What is massive amounts of data exactly? How big is your cluster? Note that Spark is not for data storage, only data analysis. It pulls data into working data sets called RDD's. As a migration path, you could probably pull the data out of a relational database to analyze. But in the long run, I would recommend using a more purpose built, huge storage database such as Cassandra. If your data is very static, you could also just store it in files. On Oct 26, 2014 9:19 AM, Peter Wolf opus...@gmail.com wrote: My understanding is the SparkSQL allows one to access Spark data as if it were stored in a relational database. It compiles SQL queries into a series of calls to the Spark API. I need the performance of a SQL database, but I don't care about doing queries with SQL. I create the input to MLib by doing a massive JOIN query. So, I am creating a single collection by combining many collections. This sort of operation is very inefficient in Mongo, Cassandra or HDFS. I could store my data in a relational database, and copy the query results to Spark for processing. However, I was hoping I could keep everything in Spark. On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta soumya.sima...@gmail.com wrote: 1. What data store do you want to store your data in ? HDFS, HBase, Cassandra, S3 or something else? 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? One option is to process the data in Spark and then store it in the relational database of your choice. On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf opus...@gmail.com wrote: Hello all, We are considering Spark for our organization. It is obviously a superb platform for processing massive amounts of data... how about retrieving it? We are currently storing our data in a relational database in a star schema. Retrieving our data requires doing many complicated joins across many tables. Can we use Spark as a relational database? Or, if not, can we put Spark on top of a relational database? Note that we don't care about SQL. Accessing our data via standard queries is nice, but we are equally happy (or more happy) to write Scala code. What is important to us is doing relational queries on huge amounts of data. Is Spark good at this? Thank you very much in advance Peter - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Create table error from Hive in spark-assembly-1.0.2.jar
Can you paste the hive-site.xml? Most of times I meet this exception, because the JDBC driver for hive metastore are not correct set or wrong driver classes are included in the assembly jar. As default, the assembly jar contains the derby.jar, which is the embedded derby JDBC driver. From: Jacob Chacko - Catalyst Consulting [mailto:jacob.cha...@catalystconsulting.be] Sent: Sunday, October 26, 2014 3:37 PM To: u...@spark.incubator.apache.org; user@spark.apache.org Subject: Create table error from Hive in spark-assembly-1.0.2.jar Hi All We are trying to create a table in Hive from spark-assembly-1.0.2.jar file. CREATE TABLE IF NOT EXISTS src (key INT, value STRING) JavaSparkContext sc = CC2SparkManager.sharedInstance().getSparkContext(); JavaHiveContext sqlContext = new JavaHiveContext(sc); sqlContext.sql(CREATE TABLE srcd (key INT, value STRING)); On Spark 1.1 we get the following error: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient On Spark 1.0.2 we get the following error: failure: ``UNCACHE'' expected but identifier CREATE found Could you help understand why we get this error Thanks Jacob
Re: RDD to DStream
I have a similar requirement. But instead of grouping it by chunkSize, I would have the timeStamp be part of the data. So the function I want has the following signature: // RDD of (timestamp, value) def rddToDStream[T](data: RDD[(Long, T)], timeWindow: Long)(implicit ssc: StreamingContext): DStream[T] And DStream should respect the timestamp part. This is important for simulation, right? Do you have any good solution for this? Jianshi On Thu, Aug 7, 2014 at 9:32 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Hey Aniket, Great thoughts! I understand the usecase. But as you have realized yourself it is not trivial to cleanly stream a RDD as a DStream. Since RDD operations are defined to be scan based, it is not efficient to define RDD based on slices of data within a partition of another RDD, using pure RDD transformations. What you have done is a decent, and probably the only feasible solution, with its limitations. Also the requirements of converting a batch of data to a stream of data can be pretty diverse. What rate, what # of events per batch, how many batches, is it efficient? Hence, it is not trivial to define a good, clean public API for that. If any one has any thoughts, ideas, etc on this, you are more than welcome to share them. TD On Mon, Aug 4, 2014 at 12:43 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: The use case for converting RDD into DStream is that I want to simulate a stream from an already persisted data for testing analytics. It is trivial to create a RDD from any persisted data but not so much for DStream. Therefore, my idea to create DStream from RDD. For example, lets say you are trying to implement analytics on time series data using Lambda architecture. This means you would have to implement the same analytics on streaming data (in streaming mode) as well as persisted data (in batch mode). The workflow for implementing the anlytics would be to first implement it in batch mode using RDD operations and then simulate stream to test the analytics in stream mode. The simulated stream should produce the elements at a specified rate. So the solution maybe to read data in a RDD, split (chunk) it into multiple RDDs with each RDD having the size of elements that need to be streamed per time unit and then finally stream each RDD using the compute function. The problem with using QueueInputDStream is that it will stream data as per the batch duration specified in the streaming context and one cannot specify a custom slide duration. Moreover, the class QueueInputDStream is private to streaming package, so I can't really use it/extend it from an external package. Also, I could not find a good solution split a RDD into equal sized smaller RDDs that can be fed into an extended version of QueueInputDStream. Finally, here is what I came up with: class RDDExtension[T: ClassTag](rdd: RDD[T]) { def toStream(streamingContext: StreamingContext, chunkSize: Int, slideDurationMilli: Option[Long] = None): DStream[T] = { new InputDStream[T](streamingContext) { private val iterator = rdd.toLocalIterator // WARNING: each partition much fit in RAM of local machine. private val grouped = iterator.grouped(chunkSize) override def start(): Unit = {} override def stop(): Unit = {} override def compute(validTime: Time): Option[RDD[T]] = { if (grouped.hasNext) { Some(rdd.sparkContext.parallelize(grouped.next())) } else { None } } override def slideDuration = { slideDurationMilli.map(duration = new Duration(duration)). getOrElse(super.slideDuration) } } } This aims to stream chunkSize elements every slideDurationMilli milliseconds (defaults to batch size in streaming context). It's still not perfect (for example, the streaming is not precise) but given that this will only be used for testing purposes, I don't look for ways to further optimize it. Thanks, Aniket On 2 August 2014 04:07, Mayur Rustagi mayur.rust...@gmail.com wrote: Nice question :) Ideally you should use a queuestream interface to push RDD into a queue then spark streaming can handle the rest. Though why are you looking to convert RDD to DStream, another workaround folks use is to source DStream from folders move files that they need reprocessed back into the folder, its a hack but much less headache . Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Aug 1, 2014 at 10:21 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Hi everyone I haven't been receiving replies to my queries in the distribution list. Not pissed but I am actually curious to know if my messages are actually going through or not. Can someone please confirm that my msgs are getting delivered via this distribution list? Thanks, Aniket On 1 August