jar not found :(
Seems if I create a directory sim link so that the share path in the same on
the unix mount point as in windows, and submit from the drive where the mount
point is, then it works. Granted, that's quite an ugly hack.
Reverting to serving jar off http (i.e. using a relative
Hi,
I have log files with lines that begin with a time stamp. However those
lines continue onto new lines representing Java stack traces. I want to be
able to search for the line then pull out the corresponding stack traces.
I was thinking of using either take(n) or reduce to 'peek' ahead at
Hi,
Do you mean with java, I shouldn’t have Issue class as a property (attribute)
in Instrument Class?
Ex :
Class Issue {
Int a;
}
Class Instrument {
Issue issue;
}
How about scala? Does it support such user defined datatypes in classes
Case class Issue .
case class Issue( a:Int = 0)
Hi
I got a problem when reading a textfile which contains nested complex
type data and got a type unmatch problem.Any hint will be appreciated.
The problem take place at map(s = s.map as type mismatch; found
:
Hi,
Can we pass RDD to functions?
Like, can we do the following?
*def func (temp: RDD[String]):RDD[String] = {*
*//body of the function*
*}*
Thank You
Jeremy,
Did you complete this benchmark in a way that's shareable with those
interested here?
Andrew
On Tue, Apr 15, 2014 at 2:50 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
I'd also be interested in seeing such a benchmark.
On Tue, Apr 15, 2014 at 9:25 AM, Ian Ferreira
Yes you can create more and more pipelines with your RDDs
Thanks
Best Regards
On Wed, Nov 12, 2014 at 3:24 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
Can we pass RDD to functions?
Like, can we do the following?
*def func (temp: RDD[String]):RDD[String] = {*
*//body of the
I was about to ask this question.
On Wed, Nov 12, 2014 at 3:42 PM, Andrew Ash and...@andrewash.com wrote:
Jeremy,
Did you complete this benchmark in a way that's shareable with those
interested here?
Andrew
On Tue, Apr 15, 2014 at 2:50 PM, Nicholas Chammas
nicholas.cham...@gmail.com
Hi All,
We are exploring insertion into RDBMS(SQL Server) through Spark by JDBC
Driver. The excerpt from the code is as follows :
We are doing insertion inside an action :
Integer res = flatMappedRDD.reduce(new Function2Integer,Integer,Integer(){
Hi list,
In an excelent blog post on Kafka and Spark Streaming integrartion (
http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/),
Michael Noll poses an assumption about the number of partitions of the RDDs
created by input DStreams. He says his
[cid:image001.png@01CFFE9C.25904980]
Hi,
How to set the above properties on JavaSQLContext. I am not able to see
setConf method on JavaSQLContext Object.
I have added spark core jar and spark assembly jar to my build path. And I am
using spark 1.1.0 and hadoop 2.4.0
--Naveen
Thanks for your replies.
Actually we can kill a driver by the command bin/spark-class
org.apache.spark.deploy.Client kill spark-master driver-id if you know
the driver id.
2014-11-11 22:35 GMT+08:00 Ritesh Kumar Singh riteshoneinamill...@gmail.com
:
There is a property :
I think it‘s ok,feel free to treat RDD like common object
qinwei
From: Deep PradhanDate: 2014-11-12 18:24To: user@spark.apache.orgSubject: Pass
RDD to functionsHi, Can we pass RDD to functions?Like, can we do the following?
def func (temp: RDD[String]):RDD[String] = {//body of the
HI guys, I starting to working with spark from java and when i run the
folliwing code :
SparkConf conf = new SparkConf().setMaster(spark://10.0.2.20:7077
).setAppName(SparkTest);
JavaSparkContext sc = new JavaSparkContext(conf);
I recived the following error and the java process exit ends:
Thanks Akhil.
-Naveen
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Wednesday, November 12, 2014 6:38 PM
To: Naveen Kumar Pokala
Cc: user@spark.apache.org
Subject: Re: Spark SQL configurations
JavaSQLContext.sqlContext.setConf is available.
Thanks
Best Regards
On Wed, Nov 12, 2014
Hi, everyone!
I consider flatmap as a narrow dependency , but why it has shuffle?
as shown on the web UI:
my code is as below :
val transferRDD = sc.textFile(hdfs://host:port/path)
val rdd = transferRDD.map(line = {
val trunks = line.split(\t)
Sounds like ipython notebook issue, not an ISpark one. Might want to reinstall
pip install ipython[notebook], which will grab the notebook necessary
components like tornado.
Try spinning up ispark console instead of notebook to see if the ISpark kernel
is functioning.
ipython console —profile
HI,
I am facing the following problem when I am trying to save my RDD as parquet
File.
14/11/12 07:43:59 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0
(TID 48,): org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null
Hi DB,
DB Tsai wrote
I also worry about that the author of JPMML changed the license of
jpmml-evaluator due to his interest of his commercial business, and he
might change the license of jpmml-model in the future.
I am the principal author of the said Java PMML API projects and I want to
Hello,
I'm trying to run a classification task using mllib decision trees. After
successfully training the model, I was trying to test the model using some
sample rows when I hit this exception.
The code snippet that caused this error is :
model = DecisionTree.trainClassifier(parsedData,
I have 2 tables in a hive context, and I want to select one field of each
table where ids of each table are equal. For example,
val tmp2=sqlContext.sql(select a.ult_fecha,b.pri_fecha from
fecha_ult_compra_u3m as a, fecha_pri_compra_u3m as b where a.id=b.id)
but i get an error:
Hi,
Try adding this in spark-env.sh
export
JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/usr/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64
export
We are running spark on yarn with combined memory 1TB and when trying to
cache a table partition(which is 100G), seeing a lot of failed collect
stages in the UI and this never succeeds. Because of the failed collect, it
seems like the mapPartitions keep getting resubmitted. We have more than
What if the window is of 5 seconds, and the file takes longer than 5
seconds to be completely scanned? It will still attempt to load the whole
file?
On Mon, Nov 10, 2014 at 6:24 PM, Soumitra Kumar kumar.soumi...@gmail.com
wrote:
Entire file in a window.
On Mon, Nov 10, 2014 at 9:20 AM, Saiph
Hi all,
I am trying to run spark with the latest build (from branch-1.2), as far as
I can see, all the paths are set and SparkContext starts up OK, however, I
cannot run anything that goes to the nodes. I get the following error:
Error from python worker:
/usr/bin/python2.7: No module named
Sean,
Thanks a lot for your reply!
A few follow up questions:
1. numIterations should be 100, not 100*trainingSetSize, right?
2. My training set has 90k positive data points (with label 1) and 60k
negative data points (with label 0).
I set my numIterations to 100 as default. I still got the same
OK, it's not class imbalance. Yes, 100 iterations.
My other guess is that the stepSize of 1 is way too big for your data.
I'd suggest you look at the weights / intercept of the resulting model to
see if it makes any sense.
You can call clearThreshold on the model, and then it will 'predict' the
This is a bug, will be fixed by https://github.com/apache/spark/pull/3230
On Wed, Nov 12, 2014 at 7:20 AM, rprabhu rpra...@ufl.edu wrote:
Hello,
I'm trying to run a classification task using mllib decision trees. After
successfully training the model, I was trying to test the model using some
Hi,
This is my code,
import org.apache.hadoop.hbase.CellUtil
/**
* JF: convert a Result object into a string with column family and
qualifier names. Sth like
*
'columnfamily1:columnqualifier1:value1;columnfamily2:columnqualifier2:value2'
etc.
* k-v pairs are separated by ';'. different
yes, can you always specify minimum number of partitions and that would
force some parallelism ( assuming you have enough cores)
On Wed, Nov 12, 2014 at 9:36 AM, Saiph Kappa saiph.ka...@gmail.com wrote:
What if the window is of 5 seconds, and the file takes longer than 5
seconds to be
As far as I know you basically have two options: let partitions be
recomputed (possibly caching / persisting memory only), or persist to disk
(and memory) and suffer the cost of writing to disk. The question is which
will be more expensive in your case. My experience is you're better off
letting
please use join syntax.
On Wed, Nov 12, 2014 at 8:57 AM, Franco Barrientos
franco.barrien...@exalitica.com wrote:
I have 2 tables in a hive context, and I want to select one field of each
table where id’s of each table are equal. For example,
*val tmp2=sqlContext.sql(select
Hi
I'd like to use the result of one RDD1 in another RDD2. Normally I would use
something like a barrier so make the 2nd RDD wait till the computation of the
1st RDD is done then include the result from RDD1 in the closure for RDD2.
Currently I create another RDD, RDD3, out of the result of RDD1
Hi Nick,
I saw the HBase api has experienced lots of changes. If I remember
correctly, the default hbase in spark 1.1.0 is 0.94.6. The one I am using is
0.98.1. To get the column family names and qualifier names, we need to call
different methods for these two different versions. I don't know how
After comparing with previous code, I got it work by making the return a Some
instead of Tuple2. Perhaps some day I will understand this.
spr wrote
--code
val updateDnsCount = (values: Seq[(Int, Time)], state: Option[(Int,
Time)]) = {
val currentCount = if
Hello all,
I have a really strange thing going on.
I have a test data set with 500K lines in a gzipped csv file.
I have an array of column processors, one for each column in the dataset.
A Processor tracks aggregate state and has a method process(v : String)
I'm calling:
val processors:
Hi all,
I'm trying to read an hbase table using this an example from github (
https://github.com/apache/spark/blob/master/examples/src/main/python/hbase_inputformat.py),
however I have two qualifiers in a column family.
Ex.:
ROW COLUMN+CELL row1 column=f1:1, timestamp=1401883411986,
My understanding is that the reason you have an Option is so you could filter
out tuples when None is returned. This way your state data won't grow forever.
-Original Message-
From: spr [mailto:s...@yarcdata.com]
Sent: November-12-14 2:25 PM
To: u...@spark.incubator.apache.org
Subject:
You can't use RDDs inside of RDDs, so this won't work anyway. You could
collect the result of RDD1 and broadcast it, perhaps. collect() blocks.
On Wed, Nov 12, 2014 at 6:41 PM, Adrian Mocanu amoc...@verticalscope.com
wrote:
Hi
I’d like to use the result of one RDD1 in another RDD2. Normally
This is the log output:
2014-11-12 19:07:16,561 INFO thriftserver.SparkExecuteStatementOperation
(Logging.scala:logInfo(59)) - Running query 'CACHE TABLE xyz_cached AS
SELECT * FROM xyz where date_prefix = 20141112'
2014-11-12 19:07:17,455 INFO Configuration.deprecation
Hi,
I just cloned spark from the github and I'm trying to build to generate a
tar ball.
I'm doing : mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive
-DskipTests clean package
Although the build is successful, I don't see the targz generated.
Am I running the wrong command ?
--
Thanks,
Hi Akshat,
If your application is to serve results directly from a SparkContext, you
may want to take a look at http://prediction.io. It integrates Spark with
spray.io (another REST/web toolkit by Typesafe). Some heavy lifting is done
here:
Just making sure but are you looking for the tar in assembly/target dir ?
On Wed, Nov 12, 2014 at 3:14 PM, Ashwin Shankar ashwinshanka...@gmail.com
wrote:
Hi,
I just cloned spark from the github and I'm trying to build to generate a
tar ball.
I'm doing : mvn -Pyarn -Phadoop-2.4
I've also tried setting the aforementioned properties using
System.setProperty() as well as on the command line while submitting the
job using --conf key=value. All to no success. When I go to the Spark UI
and click on that particular streaming job and then the Environment tab,
I can see the
Yes, I'm looking at assembly/target. I don't see the tar ball.
I only see scala-2.10/spark-assembly-1.2.0-SNAPSHOT-hadoop2.4.0.jar
,classes,test-classes,
maven-shared-archive-resources,spark-test-classpath.txt.
On Wed, Nov 12, 2014 at 12:16 PM, Sadhan Sood sadhan.s...@gmail.com wrote:
Just
mvn package doesn't make tarballs. It creates artifacts that will generally
appear in target/ and subdirectories, and likewise within modules. Look at
make-distribution.sh
On Wed, Nov 12, 2014 at 8:14 PM, Ashwin Shankar ashwinshanka...@gmail.com
wrote:
Hi,
I just cloned spark from the github
Can you give us a bit more detail:
hbase release you're using.
whether you can reproduce using hbase shell.
I did the following using hbase shell against 0.98.4:
hbase(main):001:0 create 'test', 'f1'
0 row(s) in 2.9140 seconds
= Hbase::Table - test
hbase(main):002:0 put 'test', 'row1', 'f1:1',
I think you can provide -Pbigtop-dist to build the tar.
On Wed, Nov 12, 2014 at 3:21 PM, Sean Owen so...@cloudera.com wrote:
mvn package doesn't make tarballs. It creates artifacts that will
generally appear in target/ and subdirectories, and likewise within
modules. Look at
Adrian, do you know if this is documented somewhere? I was also under the
impression that setting a key's value to None would cause the key to be
discarded (without any explicit filtering on the user's part) but can not
find any official documentation to that effect
On Wed, Nov 12, 2014 at 2:43
To my knowledge, Spark 1.1 comes with HBase 0.94
To utilize HBase 0.98, you will need:
https://issues.apache.org/jira/browse/SPARK-1297
You can apply the patch and build Spark yourself.
Cheers
On Wed, Nov 12, 2014 at 12:57 PM, Alan Prando a...@scanboo.com.br wrote:
Hi Ted! Thanks for
I am trying to determine how effective partitioning is at parallelizing my
tasks. So far I suspect it that all work is done in one task. My plan is to
create a number of accumulators - one for each task and have functions
increment the accumulator for the appropriate task (or slave) the values
Looking at HBaseResultToStringConverter :
override def convert(obj: Any): String = {
val result = obj.asInstanceOf[Result]
Bytes.toStringBinary(result.value())
}
Here is the code for Result.value():
public byte [] value() {
if (isEmpty()) {
return null;
}
I'm using spark 1.1 and the provided ec2 scripts to start my cluster
(r3.8xlarge machines). From the spark-shell, I can verify that the environment
variables are set
scala System.getenv(SPARK_LOCAL_DIRS)res0: String = /mnt/spark,/mnt2/spark
However, when I look on the workers, the directories
Well it looks like this is a scala problem after all. I loaded the file
using pure scala and ran the exact same Processors without Spark and I got
20 seconds (with the code in the same file as the 'main') vs 30 seconds
(with the exact same code in a different file) on the 500K rows.
--
View
I'm loading sequence files containing json blobs in the value, transforming
them into RDD[String] and then using hiveContext.jsonRDD(). It looks like
Spark reads the files twice- once when I I define the jsonRDD() and then
again when I actually make my call to hiveContext.sql().
Looking @ the
You are correct; the filtering I’m talking about is done implicitly. You don’t
have to do it yourself. Spark will do it for you and remove those entries from
the state collection.
From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
Sent: November-12-14 3:50 PM
To: Adrian Mocanu
Cc: spr;
Hi Xiangrui, thank you very much for your response. I looked for the .so as
you suggested.
It is not here:
$ jar tf
assembly/target/spark-assembly_2.10-1.1.0-dist/spark-assembly-1.1.0-hadoop2.4.0.jar
| grep netlib-native_system-linux-x86_64.so
or here:
$ jar tf
There are a few things you can do here:
- Infer the schema on a subset of the data, pass that inferred schema
(schemaRDD.schema) as the second argument of jsonRDD.
- Hand construct a schema and pass it as the second argument including the
fields you are interested in.
- Instead load the data
forgot to mention, that this setup works in spark standalone mode, only
problem when I run on yarn.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18777.html
Sent from the Apache Spark User List mailing list
...@gmail.com wrote:
This is the log output:
2014-11-12 19:07:16,561 INFO thriftserver.SparkExecuteStatementOperation
(Logging.scala:logInfo(59)) - Running query 'CACHE TABLE xyz_cached AS
SELECT * FROM xyz where date_prefix = 20141112'
2014-11-12 19:07:17,455 INFO Configuration.deprecation
JavaSparkContext currentContext = ...;
AccumulatorInteger accumulator = currentContext.accumulator(0,
MyAccumulator);
will create an Accumulator of Integers. For many large Data problems
Integer is too small and Long is a better type.
I see a call like the following
We noticed while caching data from our hive tables which contain data in
compressed sequence file format that it gets uncompressed in memory when
getting cached. Is there a way to turn this off and cache the compressed
data as is ?
That means the -Pnetlib-lgpl option didn't work. Could you use sbt
to build the assembly jar and see whether the .so file is inside the
assembly jar? Which system and Java version are you using? -Xiangrui
On Wed, Nov 12, 2014 at 2:22 PM, jpl jlefe...@soe.ucsc.edu wrote:
Hi Xiangrui, thank you
Interesting result here. I'm trying to parallelize a list for some simple
tests with spark and Ganglia. It seems that spark.parallelize doesn't create
partitions except for on the master node on our cluster. The image below
shows the CPU utilization per node over three tests. The first two compute
Hey all
I am doing a groupby on nearly 2TB of data and I am getting this error:
2014-11-13 00:25:30 ERROR org.apache.spark.MapOutputTrackerMasterActor - Map
output statuses were 32163619 bytes which exceeds spark.akka.frameSize
(10485760 bytes).
org.apache.spark.SparkException: Map output
Hi all,
I have a Spark streaming job which constantly receives messages from Kafka.
I was using Spark 1.0.2 and the job has been running for a month. However,
when I am currently using Spark 1.1.0. the Spark streaming job cannot
receive any messages from Kafka. I have not made any change to the
Hi all,
I have a Spark streaming job which constantly receives messages from Kafka.
I was using Spark 1.0.2 and the job has been running for a month. However,
when I am currently using Spark 1.1.0. the Spark streaming job cannot
receive any messages from Kafka. I have not made any change to the
Bill,
However, when I am currently using Spark 1.1.0. the Spark streaming job
cannot receive any messages from Kafka. I have not made any change to the
code.
Do you see any suspicious messages in the log output?
Tobias
I have figured out that building the fat jar with sbt does not seem to
included the pyspark scripts using the following command:
sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive clean
publish-local assembly
however the maven command works OK:
mvn -Pdeb -Pyarn -Phadoop-2.3
Hey
Thanks for responding so fast.
I ran the code with the fix and it works great.
Regards,
Rahul
--
View this message in context:
Thanks. Will it work with sbt at some point?
On Thu, 13 Nov 2014 01:03 Xiangrui Meng men...@gmail.com wrote:
You need to use maven to include python files. See
https://github.com/apache/spark/pull/1223 . -Xiangrui
On Wed, Nov 12, 2014 at 4:48 PM, jamborta jambo...@gmail.com wrote:
I have
Hi all,
Thanks for the information. I am running Spark streaming in a yarn cluster
and the configuration should be correct. I followed the KafkaWordCount to
write the current code three months ago. It has been working for several
months. The messages are in json format. Actually, this code worked
You need to use maven to include python files. See
https://github.com/apache/spark/pull/1223 . -Xiangrui
On Wed, Nov 12, 2014 at 4:48 PM, jamborta jambo...@gmail.com wrote:
I have figured out that building the fat jar with sbt does not seem to
included the pyspark scripts using the following
I use this command to summit *spark application* to *yarn cluster*
export YARN_CONF_DIR=conf
bin/spark-submit --class Mining
--master yarn-cluster
--executor-memory 512m ./target/scala-2.10/mining-assembly-0.1.jar
*In Web UI, it stuck on* UNDEFINED
[image: enter image description here]
*In
Adding a call to rdd.repartition() after randomizing the keys has no effect
either. code -
//partitioning is done like partitionIdx = f(key) % numPartitions
//we use random keys to get even partitioning
val uniform = other_stream.transform(rdd = {
rdd.map({ kv =
val k
Hey Jamborta,
What java version did you build the jar with?
2014-11-12 16:48 GMT-08:00 jamborta jambo...@gmail.com:
I have figured out that building the fat jar with sbt does not seem to
included the pyspark scripts using the following command:
sbt/sbt -Pdeb -Pyarn -Phadoop-2.3
Hi,
I am doing a flatMap followed by mapPartitions to do some blocked
operation...flatMap is shuffling data but this shuffle is strictly
shuffling to disk and not over the network right ?
Thanks.
Deb
I have made some progress - the partitioning is very uneven, and everything
goes to one partition. I see that spark partitions by key, so I tried this:
//partitioning is done like partitionIdx = f(key) % numPartitions
//we use random keys to get even partitioning
val uniform =
I am having a problem trying to figure out how to solve a problem. I would
like to stream events from Kafka to my Spark Streaming app and write the
contents of each RDD out to a HDFS directory. Each event that comes into
the app via kafka will be JSON and have an event field with the name of the
I'm missing something simpler (I think). That is, why do I need a Some instead
of Tuple2? Because a Some might or might not be there, but a Tuple2 must be
there? Or something like that?
From: Adrian Mocanu
amoc...@verticalscope.commailto:amoc...@verticalscope.com
You are correct; the
Did you configure Spark master as local, it should be local[n], n 1 for local
mode. Beside there’s a Kafka wordcount example in Spark Streaming example, you
can try that. I’ve tested with latest master, it’s OK.
Thanks
Jerry
From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Thursday,
Currently there’s no way to cache the compressed sequence file directly.
Spark SQL uses in-memory columnar format while caching table rows, so we
must read all the raw data and convert them into columnar format.
However, you can enable in-memory columnar compression by setting
Hi,
I have a set of input files for a spark program, with each file
corresponding to a logical data partition. What is the API/mechanism to
assign each input file (or a set of files) to a spark partition, when
initializing RDDs?
When i create a spark RDD pointing to the directory of files, my
Thanks! I used sbt (command below) and the .so file is now there (shown
below). Now that I have this new assembly.jar, how do I run the spark-shell
so that it can see the .so file when I call the kmeans function? Thanks
again for your help with this.
sbt/sbt -Dhadoop.version=2.4.0 -Pyarn
Dear Liu:
I have tested this issue under Spark-1.1.0. The problem is solved under
this newer version.
On Wed, Nov 12, 2014 at 3:18 PM, Bo Liu bli...@cse.ust.hk wrote:
Dear Liu:
Thank you for your replay. I will set up an experimental environment for
spark-1.1 and test it.
On Wed, Nov 12,
It's the exact same API you've already found, and it's documented:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.AccumulatorParam
JavaSparkContext has helper methods for int and double but not long. You
can just make your own little implementation of
The fact that the caching percentage went down is highly suspicious. It
should generally not decrease unless other cached data took its place, or
if unless executors were dying. Do you know if either of these were the
case?
On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld
flatmap would have to shuffle data only if output RDD is expected to be
partitioned by some key.
RDD[X].flatmap(X=RDD[Y])
If it has to shuffle it should be local.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Nov 13,
I see Javadoc Style documentation but nothing that looks like a code sample
I tried the following before asking
public static class LongAccumulableParam implements
AccumulableParamLong,Long,Serializable {
@Override
public Long addAccumulator(final Long r, final Long t) {
Look again, the type is AccumulatorParam, not AccumulableParam. But
yes that's what you do.
On Thu, Nov 13, 2014 at 4:32 AM, Steve Lewis lordjoe2...@gmail.com wrote:
I see Javadoc Style documentation but nothing that looks like a code sample
I tried the following before asking
public
Sorry, I think I was not clear in what I meant.
I didn't mean it went down within a run, with the same instance.
I meant I'd run the whole app, and one time, it would cache 100%, and the
next run, it might cache only 83%
Within a run, it doesn't change.
On Wed, Nov 12, 2014 at 11:31 PM, Aaron
As of now my approach is to fetch all data from tables located in different
databases in separate RDD's and then make a union of them and then query on
them together. I want to know whether I can perform a query on it directly
along with creating an RDD. i.e. Instead of creating two RDDs , firing
We have all our data in Cassandra so I’d prefer to not have to bring up
Hadoop/HDFS as that’s just another thing that can break.
But I’m reading that spark requires a shared filesystem like HDFS or S3…
Can I use Tachyon or this or something simple for a shared filesystem?
--
Founder/CEO
Hi Kevin,
Yes, Spark can read and write to Cassandra without Hadoop. Have you seen
this:
https://github.com/datastax/spark-cassandra-connector
Harold
On Wed, Nov 12, 2014 at 9:28 PM, Kevin Burton bur...@spinn3r.com wrote:
We have all our data in Cassandra so I’d prefer to not have to bring
I also discussed with Liancheng two weeks ago. And he suggested to use
toLocalIterator to collect partitions of RDD to driver (same order if RDD
is sorted), and then turn each partition to a RDD and put them in the queue.
So: To turn RDD[(timestamp, value)] to DStream
1) Group by
Yup , very important that n1 for spark streaming jobs, If local use
local[2]
The thing to remember is that your spark receiver will take a thread to itself
and produce data , so u need another thread to consume it .
In a cluster manager like yarn or mesos, the word thread Is not used
Yes. That’s what I was planning on using actually.
I was just curious whether intermediate data had to be kept in HDFS but
this answers my question. thanks.
On Wed, Nov 12, 2014 at 9:33 PM, Harold Nguyen har...@nexgate.com wrote:
Hi Kevin,
Yes, Spark can read and write to Cassandra without
Spark's scheduling is pretty simple: it will allocate tasks to open cores
on executors, preferring ones where the data is local. It even performs
delay scheduling, which means waiting a bit to see if an executor where
the data resides locally becomes available.
Are yours tasks seeing very skewed
Hi,
I have two RDDs A and B which are created from reading file from HDFS.
I have a third RDD C which is created by taking join of A and B. All three
RDDs (A, B and C ) are not cached.
Now if I perform any action on C (let say collect), action is served without
reading any data from the disk.
Hi Xiangrui,
All is well. Got it working now, I just recompiled with sbt with the
additional package flag and that created all the /bin files. Then when I
start spark-shell, the webUI environment show the assembly jar is in spark's
classpath entries and now the kmeans function finds it -- no
1 - 100 of 102 matches
Mail list logo