Sean
Dstream.saveAsTextFiles internally calls foreachRDD and saveAsTextFile for
each interval
def saveAsTextFiles(prefix: String, suffix: String = ) {
val saveFunc = (rdd: RDD[T], time: Time) = {
val file = rddToFileName(prefix, suffix, time)
rdd.saveAsTextFile(file)
}
Hi
I wanted to store DataFrames as partitioned Hive tables. Is there a way to
do this via the saveAsTable call. The set of options does not seem to be
documented.
def
saveAsTable(tableName: String, source: String, mode: SaveMode, options:
Map[String, String]): Unit
(Scala-specific) Creates a
Hi
Shark supported both the HiveServer1 and HiveServer2 thrift interfaces
(using $ bin/shark -service sharkserver[1 or 2]).
SparkSQL seems to support only HiveServer2. I was wondering what is involved
to add support for HiveServer1. Is this something straightforward to do that
I can embark on
Hi Landon
I had a problem very similar to your, where we have to process around 5
million relatively small files on NFS. After trying various options, we did
something similar to what Matei suggested.
1) take the original path and find the subdirectories under that path and
then parallelize the
This is my first implementation. There are a few rough edges, but when I run
this I get the following exception. The class extends Partitioner which in
turn extends Serializable. Any idea what I am doing wrong?
scala res156.partitionBy(new EqualWeightPartitioner(1000, res156,
weightFunction))
Hi
I am using Spark to distribute computationally intensive tasks across the
cluster. Currently I partition my RDD of tasks randomly. There is a large
variation in how long each of the jobs take to complete, leading to most
partitions being processed quickly and a couple of partitions take
Hi
I am running calling a C++ library on Spark using JNI. Occasionally the C++
library causes the JVM to crash. The task terminates on the MASTER, but the
driver does not return. I am not sure why the driver does not terminate. I
also notice that after such an occurrence, I lose some workers
Christopher
Sorry I might be missing the obvious, but how do i get my function called on
all Executors used by the app? I dont want to use RDDs unless necessary.
once I start my shell or app, how do I get
TaskNonce.getSingleton().doThisOnce() executed on each executor?
@dmpour
Hi
Is there a way in Spark to run a function on each executor just once. I have
a couple of use cases.
a) I use an external library that is a singleton. It keeps some global state
and provides some functions to manipulate it (e.g. reclaim memory. etc.) . I
want to check the global state of this
Jaonary
val loadedData: RDD[(String,(String,Array[Byte]))] =
sc.objectFile(yourObjectFileName)
Deenar
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Reload-RDD-saved-with-saveAsObjectFile-tp2943p3009.html
Sent from the Apache Spark User List mailing
Matei
It turns out that saveAsObjectFile(), saveAsSequenceFile() and
saveAsHadoopFile() currently do not pickup the hadoop settings as Aureliano
found out in this post
http://apache-spark-user-list.1001560.n3.nabble.com/Turning-kryo-on-does-not-decrease-binary-output-tp212p249.html
Deenar
--
Aureliano
Apologies for hijacking this thread.
Matei
On the subject of processing lots (millions) of small input files on HDFS,
what are the best practices to follow on spark. Currently my code looks
something like this. Without coalesce there is one task and one output file
per input file.
12 matches
Mail list logo