Re: Spark SQL and running parquet tables?

2014-09-12 Thread DanteSama
Turns out it was Spray with a bad route -- the results weren't updating
despite the table running. This thread can be ignored.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-running-parquet-tables-tp13987p14114.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL and running parquet tables?

2014-09-12 Thread DanteSama
So, after toying around a bit, here's what I ended up with. First off,
there's no function "registerTempTable" -- "registerTable" seems to be
enough to work (it's the same whether directly on a SchemaRDD or on a
SqlContext being passed an RDD). The problem I encountered after was
reloading a table in one actor and referencing it another. 

The environment I had set has 2 types of Akka actors, a Query and a
Refresher. They share a reference (passed in on creation via
Props(classOf[Actor], sqlContext). The Refresher would simply reload the
parquet file and refresh the table:

sqlContext
  .parquetFile(dataDir)
  .registerAsTable(tableName)

The WebService would query it:

sqlContext.sql("query with tableName").collect()

This would break, the Refresher actor would work and be able to query, but
the Query actor would return that the table doesn't exist.


I now removed the Refresher and just updated the Query actor to refresh its
table if it's stale.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-running-parquet-tables-tp13987p14102.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL and running parquet tables?

2014-09-11 Thread DanteSama
Michael Armbrust wrote
> You'll need to run parquetFile("path").registerTempTable("name") to
> refresh the table.

I'm not seeing that function on SchemaRDD in 1.0.2, is there something I'm
missing?

SchemaRDD Scaladoc

  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-running-parquet-tables-tp13987p14002.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL and running parquet tables?

2014-09-11 Thread DanteSama
I've been under the impression that creating and registering a parquet table
will pick up on updates to the table, such as inserts. I have a program
running that does the following:

// Create Context
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)

// Register table
sqlContext
   .parquetFile("hdfs://somewhere/users/sql/")
   .registerAsTable("mytable")

This program is continuously running. Over time, queries get fired off to
that sqlContext:

// Query the registered table, collect and return
sqlContext.sql(query)
  .collect()


Then, elsewhere, I have processes which inserts data into that same table,
like so:

// Create context
val ssc = new StreamingContext(conf, Seconds(3600))
val sqlContext = new SQLContext(ssc.sparkContext)

// Register table
createParquetFile[Row]("hdfs://somewhere/users/sql/")
.registerAsTable("mytable")

// Insert into (rdd exists and is filled with type Row)
createSchemaRDD[Row](rdd)
.coalesce(1)
.insertInto("mytable")


I've made a local test where it is the case that the first program will be
aware of the changes the second program makes. But when deploying with real
data, outside of that local test case, the running table "mytable" doesn't
get updated. If I kill the query program and restart it, it refreshes to the
current state of "mytable".

Thoughts?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-running-parquet-tables-tp13987.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SchemaRDD - Parquet - "insertInto" makes many files

2014-09-04 Thread DanteSama
Yep, that worked out. Does this solution have any performance implications
past all the work being done on (probably) 1 node?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480p13501.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SchemaRDD - Parquet - "insertInto" makes many files

2014-09-04 Thread DanteSama
It seems that running insertInto on an SchemaRDD with a ParquetRelation
creates an individual file for each item in the RDD. Sometimes, it has
multiple rows in one file, and sometimes it only writes the column headers.

My question is, is it possible to have it write the entire RDD as 1 file,
but still be associated and registered as a table? Right now I'm doing the
following:

// Create the Parquet "file"
createParquetFile[T]("hdfs://somewhere/folder").registerAsTable("table")

val rdd = some RDD

// Insert the RDD's items into the table
createSchemaRDD[T](rdd).insertInto("table")

However, this ends up with a single file for each row of the format
"part-r-${partition + offset}.parquet" (snagged from ParquetTableOperations
> AppendingParquetOutputFormat)

I know that I can create a single parquet file from an RDD by using
SchemaRDD.saveAsParquetFile, but that prevents me from being able to load a
table once and be aware of any changes.

I'm fine with each insertInto call making a new parquet file in the table
directory. But a file per row is a little over the top... Perhaps there are
Hadoop confgs that I'm missing?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org