I am a beginner to Scala/Spark. Could you please elaborate on how to make
RDD of results of func() and collect?
On Tue, Feb 24, 2015 at 2:27 PM, Sean Owen so...@cloudera.com wrote:
They aren't the same 'lst'. One is on your driver. It gets copied to
executors when the tasks are executed.
It's something like the average error in rating, but a bit different
-- it's the square root of average squared error. But if you think of
the ratings as 'stars' you could kind of think of 0.86 as 'generally
off by 0.86' stars and that would be somewhat right.
Whether that's good depends on what
Instead of
...foreach {
edgesBySrc = {
lst ++= func(edgesBySrc)
}
}
try
...flatMap { edgesBySrc = func(edgesBySrc) }
or even more succinctly
...flatMap(func)
This returns an RDD that basically has the list you are trying to
build, I believe.
You can collect() to the driver but
Thanks, but it still doesn't seem to work.
Below is my entire code.
var mp = scala.collection.mutable.Map[VertexId, Int]()
var myRdd = graph.edges.groupBy[VertexId](f).flatMap {
edgesBySrc = func(edgesBySrc, a, b)
}
myRdd.foreach {
node = {
mp(node) = 1
}
}
Yep, much better with 0.1.
The best model was trained with rank = 12 and lambda = 0.1, and numIter =
20, and its RMSE on the test set is 0.869092 (Spark 1.3.0)
Question : What is the intuition behind RSME of 0.86 vs 1.3 ? I know the
smaller the better. But is it that better ? And what is a good
If you thinking of the yarn memory overhead, then yes, I have increased
that as well. However, I'm glad to say that my job finished successfully
finally. Besides the timeout and memory settings, performing repartitioning
(with shuffling) at the right time seems to be the key to make this large
job
Hi all,
I have been running a simple SQL program on Spark. To test the concurrency,
I have created 10 threads inside the program, all threads using same
SQLContext object. When I ran the program on my EC2 cluster using
spark-submit, only 3 threads were running in parallel. I have repeated the
println occurs on the machine where the task executes, which may or
may not be the same as your local driver process. collect()-ing brings
data back to the driver, so printing there definitely occurs on the
driver.
On Tue, Feb 24, 2015 at 9:48 AM, patcharee patcharee.thong...@uni.no wrote:
Hi,
Hi Sreeharsha,
My data is in HDFS. I am trying to use Spark HiveContext (instead of
SQLContext) to fire queries on my data just because HiveContext supports
more operations.
Sreeharsha wrote
Change derby to mysql and check once me to faced the same issue
I am pretty new to Spark and
So say I want to calculate top K users visiting a page in the past 2 hours
updated every 5 mins.
so here I want to maintain something like this
Page_01 = {user_01:32, user_02:3, user_03:7...}
...
Basically a count of number of times a user visited a page. Here my key is
page name/id and state
You can use a reduceByKeyAndWindow with your specific time window. You can
specify the inverse function in reduceByKeyAndWindow.
On Tue, Feb 24, 2015 at 1:36 PM, Ashish Sharma ashishonl...@gmail.com
wrote:
So say I want to calculate top K users visiting a page in the past 2 hours
updated every
My issue is posted here on stack-overflow. What am I doing wrong here?
http://stackoverflow.com/questions/28689186/facing-error-while-extending-scala-class-with-product-interface-to-overcome-limi
--
View this message in context:
I think this could be of some help to you.
https://issues.apache.org/jira/browse/SPARK-3660
On Tue, Feb 24, 2015 at 2:18 AM, Matus Faro matus.f...@kik.com wrote:
Hi,
Our application is being designed to operate at all times on a large
sliding window (day+) of data. The operations
Hi,
I would like to print the content of RDD[String]. I tried
1) linesWithSpark.foreach(println)
2) linesWithSpark.collect().foreach(println)
I submitted the job by spark-submit. 1) did not print, but 2) did.
But when I used the shell, both 1) and 2) printed.
Any ideas why 1) behaves
I assume this is a difference between your local driver classpath and
remote worker classpath. It may not be a question of whether the class
is there, but classpath visibility issues. Have you looked into
settings like spark.files.userClassPathFirst?
On Tue, Feb 24, 2015 at 4:43 AM, necro351 .
Thanks Akhil.
Not sure whether thelowlevel consumer.will be officially supported by Spark
Streaming. So far, I don't see it mentioned/documented in the spark streaming
programming guide.
bit1...@163.com
From: Akhil Das
Date: 2015-02-24 16:21
To: bit1...@163.com
CC: user
Subject: Re: Many
There is a newly introduced JDBC data source in Spark 1.3.0 (not the
JdbcRDD in Spark core), which may be useful. However, currently there's
no SQL server specific logics implemented. I'd assume standard SQL
queries should work.
Cheng
On 2/24/15 7:02 PM, Suhel M wrote:
Hey,
I am trying to
Did you happen to have a look at
https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
Thanks
Best Regards
On Tue, Feb 24, 2015 at 3:39 PM, anu anamika.guo...@gmail.com wrote:
My issue is posted here on stack-overflow. What am I doing wrong
Hi all,
Maybe some of you are interested: I wrote a new guide on how to start using
Spark from Clojure. The tutorial covers
* setting up a project,
* doing REPL- or Test Driven Development of Spark jobs
* Running Spark jobs locally.
Just read it on
Hi,
I have just signed up for Amazon AWS because I learnt that it provides
service for free for the first 12 months.
I want to run Spark on EC2 cluster. Will they charge me for this?
Thank You
The free tier includes 750 hours of t2.micro instance time per month.
http://aws.amazon.com/free/
That's basically a month of hours, so it's all free if you run one
instance only at a time. If you run 4, you'll be able to run your
cluster of 4 for about a week free.
A t2.micro has 1GB of memory,
Thank You Sean.
I was just trying to experiment with the performance of Spark Applications
with various worker instances (I hope you remember that we discussed about
the worker instances).
I thought it would be a good one to try in EC2. So, it doesn't work out,
does it?
Thank You
On Tue, Feb 24,
Great to know, thanks Xiangrui.
*Sebastián Ramírez*
Diseñador de Algoritmos
http://www.senseta.com
Tel: (+571) 795 7950 ext: 1012
Cel: (+57) 300 370 77 10
Calle 73 No 7 - 06 Piso 4
Linkedin: co.linkedin.com/in/tiangolo/
Twitter: @tiangolo https://twitter.com/tiangolo
No, I think I am ok with the time it takes.
Just that, with the increase in the partitions along with the increase in
the number of workers, I want to see the improvement in the performance of
an application.
I just want to see this happen.
Any comments?
Thank You
On Tue, Feb 24, 2015 at 8:52
Hi John,
This would be a potential application for the Spark Kernel project (
https://github.com/ibm-et/spark-kernel). The Spark Kernel serves as your
driver application, allowing you to feed it snippets of code (or load up
entire jars via magics) in Scala to execute against a Spark cluster.
Hi,
I am sorry that I made a mistake on AWS tarif. You can read the email of
sean owen which explains better the strategies to run spark on AWS.
For your question: it means that you just download spark and unzip it. Then
run spark shell by ./bin/spark-shell or ./bin/pyspark. It is useful to get
But how will I specify my state there?
On Tue, Feb 24, 2015 at 12:50 AM Arush Kharbanda ar...@sigmoidanalytics.com
wrote:
You can use a reduceByKeyAndWindow with your specific time window. You can
specify the inverse function in reduceByKeyAndWindow.
On Tue, Feb 24, 2015 at 1:36 PM, Ashish
Thank You Akhil. Will look into it.
Its free, isn't it? I am still a student :)
On Tue, Feb 24, 2015 at 9:06 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
If you signup for Google Compute Cloud, you will get free $300 credits for
3 months and you can start a pretty good cluster for your
Hi Suhel,
My team is currently working with a lot of SQL Server databases as one of
our many data sources and ultimately we pull the data into HDFS from SQL
Server. As we had a lot of SQL databases to hit, we used the jTDS driver
and SQOOP to extract the data out of SQL Server and into HDFS
You can definitely, easily, try a 1-node standalone cluster for free.
Just don't be surprised when the CPU capping kicks in within about 5
minutes of any non-trivial computation and suddenly the instance is
very s-l-o-w.
I would consider just paying the ~$0.07/hour to play with an
m3.medium,
This should help you understand the cost of running a Spark cluster for a
short period of time:
http://www.ec2instances.info/
If you run an instance for even 1 second of a single hour you are charged
for that complete hour. So before you shut down your miniature cluster make
sure you really are
If you signup for Google Compute Cloud, you will get free $300 credits for
3 months and you can start a pretty good cluster for your testing purposes.
:)
Thanks
Best Regards
On Tue, Feb 24, 2015 at 8:25 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
I have just signed up for Amazon AWS
Yes it is :)
Thanks
Best Regards
On Tue, Feb 24, 2015 at 9:09 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Thank You Akhil. Will look into it.
Its free, isn't it? I am still a student :)
On Tue, Feb 24, 2015 at 9:06 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
If you signup for
Kindly bear with my questions as I am new to this.
If you run spark on local mode on a ec2 machine
What does this mean? Is it that I launch Spark cluster from my local
machine,i.e., by running the shell script that is there in /spark/ec2?
On Tue, Feb 24, 2015 at 8:32 PM, gen tang
Hi,
As a real spark cluster needs a least one master and one slaves, you need
to launch two machine. Therefore the second machine is not free.
However, If you run spark on local mode on a ec2 machine. It is free.
The charge of AWS depends on how much and the types of machine that you
launched,
Thank You All.
I think I will look into paying ~$0.7/hr as Sean suggested.
On Tue, Feb 24, 2015 at 9:01 PM, gen tang gen.tan...@gmail.com wrote:
Hi,
I am sorry that I made a mistake on AWS tarif. You can read the email of
sean owen which explains better the strategies to run spark on AWS.
Interesting. Accumulators are shown on Web UI if you are using the
ordinary SparkContext (Spark 1.2). It just has to be named (and that's
what you did).
scala val acc = sc.accumulator(0, test accumulator)
acc: org.apache.spark.Accumulator[Int] = 0
scala val rdd = sc.parallelize(1 to 1000)
Hi,
I finally solved the problem by setting spark.yarn.executor.memoryOverhead
with the option --conf spark.yarn.executor.memoryOverhead= for
spark-submit, as pointed out in
http://stackoverflow.com/questions/28404714/yarn-why-doesnt-task-go-out-of-heap-space-but-container-gets-killed
and
I have been posting on the Mesos list, as I am looking to see if it
it's possible or not to share spark drivers. Obviously, in stand
alone cluster mode, the Master handles requests, and you can
instantiate a new sparkcontext to a currently running master. However
in Mesos (and perhaps Yarn) I
Hello,
I have built Spark 1.3. I can successfully use the dataframe api. However, I
am not able to find its api documentation in Python. Do you know when the
documentation will be available?
Best Regards,
poiuytrez
--
View this message in context:
Hello Subacini,
Until someone more knowledgeable suggests a better, more straightforward,
and simpler approach with a working code snippet, I suggest the following
workaround / hack:
inputStream.foreachRDD(rdd =
val myStr = rdd.toDebugString
// process myStr string value, e.g. using
Tathagata, yes, I was using StreamingContext.getOrCreate. My question is
about the design decision here. I was expecting that if I have a streaming
application that say crashed, and I wanted to give the executors more
memory, I would be able to restart, using the checkpointed RDD but with
more
I'm running a cluster of 3 Amazon EC2 machines (small number because it's
expensive when experiments keep crashing after a day!).
Today's crash looks like this (stacktrace at end of message).
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0
On my
Hi Imran,
I will say your explanation is extremely helpful J
I tested some ideas according to your explanation and it make perfect sense to
me. I modify my code to use cogroup+mapValues instead of union+reduceByKey to
preserve the partition, which gives me more than 100% performance gain
Hi Colin,
Here is how I have configured my hadoop cluster to have yarn logs available
through both the yarn CLI and the _yarn_ history server (with gzip compression
and 10 days retention):
1. Add the following properties in the yarn-site.xml on each node managers and
on the resource manager:
Looks like in my tired state, I didn't mention spark the whole time.
However, it might be implied by the application log above. Spark log
aggregation appears to be working, since I can run the yarn command above.
I do have yarn logging setup for the yarn history server. I was trying to
use the
the spark history server and the yarn history server are totally
independent. Spark knows nothing about yarn logs, and vice versa, so
unfortunately there isn't any way to get all the info in one place.
On Tue, Feb 24, 2015 at 12:36 PM, Colin Kincaid Williams disc...@uw.edu
wrote:
Looks like in
So back to my original question.
I can see the spark logs using the example above:
yarn logs -applicationId application_1424740955620_0009
This shows yarn log aggregation working. I can see the std out and std
error in that container information above. Then how can I get this
information in a
Sorry for the mistake, I actually have it this way:
val myObject = new MyObject();
val myObjectBroadcasted = sc.broadcast(myObject);
val rdd1 = sc.textFile(/file1).map(e =
{
myObjectBroadcasted.value.insert(e._1);
(e._1,1)
});
rdd.cache.count(); //to make sure it is transformed.
val rdd2 =
Shark used to have shark.map.tasks variable. Is there an equivalent for
Spark SQL?
We are trying a scenario with heavily partitioned Hive tables. We end up
with a UnionRDD with a lot of partitions underneath and hence too many
tasks:
You're not using the broadcasted variable within your map operations. You're
attempting to modify myObjrct directly which won't work because you are
modifying the serialized copy on the executor. You want to do
myObjectBroadcasted.value.insert and myObjectBroadcasted.value.lookup.
Sent with
Hi all,
I am trying to do the following.
val myObject = new MyObject();
val myObjectBroadcasted = sc.broadcast(myObject);
val rdd1 = sc.textFile(/file1).map(e =
{
myObject.insert(e._1);
(e._1,1)
});
rdd.cache.count(); //to make sure it is transformed.
val rdd2 = sc.textFile(/file2).map(e =
{
I am aware of that, but two things are working against me here with
spark-kernel. Python is our language, and we are really looking for a
supported way to approach this for the enterprise. I like the
concept, it just doesn't work for us given our constraints.
This does raise an interesting point
It's hard to tell. I have not run this on EC2 but this worked for me:
The only thing that I can think of is that the scheduling mode is set to
- *Scheduling Mode:* FAIR
val pool: ExecutorService = Executors.newFixedThreadPool(poolSize)
while_loop to get curr_job
pool.execute(new
Hi,
I plan to run a parameter search varying the number of cores, epoch, and
parallelism. The web console provides a way to archive the previous runs,
though is there a way to view in the console the throughput? Rather than
logging the throughput separately to the log files and correlating the
Usually it happens in Linux when application deletes file w/o double
checking that there are no open FDs (resource leak). In this case, Linux
holds all space allocated and does not release it until application exits
(crashes in your case). You check file system and everything is normal, you
have
Here is a tool which may give you some clue:
http://file-leak-detector.kohsuke.org/
Cheers
On Tue, Feb 24, 2015 at 11:04 AM, Vladimir Rodionov
vrodio...@splicemachine.com wrote:
Usually it happens in Linux when application deletes file w/o double
checking that there are no open FDs (resource
Hi there,
I assume you are using spark 1.2.1 right?
I faced the exact same issue and switched to 1.1.1 with the same
configuration and it was solved.
On 24 Feb 2015 19:22, Ted Yu yuzhih...@gmail.com wrote:
Here is a tool which may give you some clue:
http://file-leak-detector.kohsuke.org/
They aren't the same 'lst'. One is on your driver. It gets copied to
executors when the tasks are executed. Those copies are updated. But
the updates will never reflect in the local copy back in the driver.
You may just wish to make an RDD of the results of func() and
collect() them back to the
If you make `Image` a case class, then select(image.data) should work.
On Tue, Feb 24, 2015 at 3:06 PM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hi all,
I have a DataFrame that contains a user defined type. The type is an image
with the following attribute
class Image(w: Int, h: Int,
Hi Experts,
My Spark Job is failing with below error.
From the logs I can see that input-3-1424842351600 was added at 5:32:32 and
was never purged out of memory. Also the available free memory for the
executor is *2.1G*.
Please help me figure out why executors cannot fetch this input.
Txz for
Rdd.foreach runs in the executors. You should use `collect` to fetch data
to the driver. E.g.,
myRdd.collect().foreach {
node = {
mp(node) = 1
}
}
Best Regards,
Shixiong Zhu
2015-02-25 4:00 GMT+08:00 Vijayasarathy Kannan kvi...@vt.edu:
Thanks, but it still doesn't seem to
Hi Denny,
yes the user has all the rights to HDFS. I am running all the spark
operations with this user.
and my hive-site.xml looks like this
property
namehive.metastore.warehouse.dir/name
value/user/hive/warehouse/value
descriptionlocation of default database for the
I'm also facing the same issue.
I tried the configurations but it seems the executors spark's
log4j.properties seems to override the passed values, so you have to change
/etc/spark/conf/log4j.properties.
Let me know if any of you have managed to get this fixes programatically.
I am planning to
bq. depend on missing fastutil classes like Long2LongOpenHashMap
Looks like Long2LongOpenHashMap should be added to the shaded jar.
Cheers
On Tue, Feb 24, 2015 at 7:36 PM, Jim Kleckner j...@cloudphysics.com wrote:
Spark includes the clearspring analytics package but intentionally excludes
Hi all,
The Hadoop Summit uses community choice voting to decide which talks to
feature. It would be great if the community could help vote for Spark talks
so that Spark has a good showing at this event. You can make three votes on
each track. Below I've listed 3 talks that are important to
The error message you have is:
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:file:/user/hive/warehouse/src is not a directory or
unable to create one)
Could you verify that you (the user you are running under) has the rights
to create
Hi All,
I am running a simple word count example of spark (standalone cluster) , In the
UI it is showing
For each worker no. of cores available are 32 ,but while running the jobs only
5 cores are being used,
What should I do to increase no. of used core or it is selected based on jobs.
Thanks
That's all you should need to do. Saying this, I did run into an issue
similar to this when I was switching Spark versions which were tied to
different default Hive versions (eg Spark 1.3 by default works with Hive
0.13.1). I'm wondering if you may be hitting this issue due to that?
On Tue, Feb
I have a strong suspicion that it was caused by a disk full on the executor.
I am not sure if the executor was supposed to recover that way from it.
I cannot be sure about it, I should have had enough disk space, but I think
I had some data skew which could have lead to some executor to run out
Try adding --total-executor-cores 5 , where 5 is the number of cores.
Thanks,
Vishnu
On Wed, Feb 25, 2015 at 11:52 AM, Somnath Pandeya
somnath_pand...@infosys.com wrote:
Hi All,
I am running a simple word count example of spark (standalone cluster) ,
In the UI it is showing
For each
You can set the following in the conf while creating the SparkContext (if
you are not using spark-submit)
.set(spark.cores.max, 32)
Thanks
Best Regards
On Wed, Feb 25, 2015 at 11:52 AM, Somnath Pandeya
somnath_pand...@infosys.com wrote:
Hi All,
I am running a simple word count
My Hadoop version is Hadoop 2.5.0-cdh5.3.0
From the Driver logs [3] I can see that SparkUI started on a specified
port, also my YARN app tracking URL[1] points to that port which is in turn
getting redirected to the proxy URL[2] which gives me
java.net.BindException: Cannot assign requested
No problem, Joe. There you go
https://issues.apache.org/jira/browse/SPARK-5081
And also there is this one https://issues.apache.org/jira/browse/SPARK-5715
which is marked as resolved
On 24 February 2015 at 21:51, Joe Wass jw...@crossref.org wrote:
Thanks everyone.
Yiannis, do you know if
Hi Sathish,
The current implementation of countByKey uses reduceByKey:
https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L332
It seems that countByKey is mostly deprecated:
https://issues.apache.org/jira/browse/SPARK-3994
-Jey
On Tue,
The official documentation will be posted when 1.3 is released (early
March).
Right now, you can build the docs yourself by running jekyll build in
docs. Alternatively, just look at dataframe,py as Ted pointed out.
On Tue, Feb 24, 2015 at 6:56 AM, Ted Yu yuzhih...@gmail.com wrote:
Have you
Hi Sean,
I launched the spark-shell on the same machine as I started YARN service. I
don't think port will be an issue.
I am new to spark. I checked the HDFS web UI and the YARN web UI. But I
don't know how to check the AM. Can you help?
Thanks,
David
On Tue, Feb 24, 2015 at 8:37 PM Sean
Btw, the correct syntax for alias should be
`df.select($image.data.as(features))`.
On Tue, Feb 24, 2015 at 3:35 PM, Xiangrui Meng men...@gmail.com wrote:
If you make `Image` a case class, then select(image.data) should work.
On Tue, Feb 24, 2015 at 3:06 PM, Jaonary Rabarisoa jaon...@gmail.com
Thanks for sharing, Chris.
On Tue, Feb 24, 2015 at 4:39 AM, Christian Betz
christian.b...@performance-media.de wrote:
Hi all,
Maybe some of you are interested: I wrote a new guide on how to start
using Spark from Clojure. The tutorial covers
- setting up a project,
- doing REPL-
Hi,
I am trying to use the fair scheduler pools
(http://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools)
to schedule two jobs at the same time.
In my simple example, I have configured spark in local mode with 2 cores
(local[2]). I have also configured two pools in
Added - thanks! I trimmed it down a bit to fit our normal description length.
On Mon, Jan 5, 2015 at 8:24 AM, Thomas Stone tho...@prediction.io wrote:
Please can we add PredictionIO to
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
PredictionIO
http://prediction.io/
By doing so, I got the following error :
Exception in thread main org.apache.spark.sql.AnalysisException: GetField
is not valid on fields
Seems that it doesn't like image.data expression.
On Wed, Feb 25, 2015 at 12:37 AM, Xiangrui Meng men...@gmail.com wrote:
Btw, the correct syntax for alias
Another way to see the Python docs:
$ export PYTHONPATH=$SPARK_HOME/python
$ pydoc pyspark.sql
On Tue, Feb 24, 2015 at 2:01 PM, Reynold Xin r...@databricks.com wrote:
The official documentation will be posted when 1.3 is released (early
March).
Right now, you can build the docs yourself by
I've added it, thanks!
On Fri, Feb 20, 2015 at 12:22 AM, Emre Sevinc emre.sev...@gmail.com wrote:
Hello,
Could you please add Big Industries to the Powered by Spark page at
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark ?
Company Name: Big Industries
URL:
Hello,
Thanks for adding, but URL seems to have a typo: when I click it tries to
open
http//www.bigindustries.be/
But it should be:
http://www.bigindustries.be/
Kind regards,
Emre Sevinç
http://http//www.bigindustries.be/
On Feb 25, 2015 12:29 AM, Patrick Wendell pwend...@gmail.com wrote:
Hello,
Quick question. I am trying to understand difference between reduceByKey vs
countByKey? Which one gives better performance reduceByKey or countByKey?
While we can perform same count operation using reduceByKey why we need
countByKey/countByValue?
Thanks
Sathish
Hi all,
I have a DataFrame that contains a user defined type. The type is an image
with the following attribute
*class Image(w: Int, h: Int, data: Vector)*
In my DataFrame, images are stored in column named image that corresponds
to the following case class
*case class LabeledImage(label: Int,
It may have to do with the akka heartbeat interval per SPARK-3923 -
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-3923 ?
On Tue, Feb 24, 2015 at 16:40 Xi Shen davidshe...@gmail.com wrote:
Hi Sean,
I launched the spark-shell on the same machine as I started YARN service.
I
Hi ,
I have placed my hive-site.xml inside spark/conf and i am trying to execute
some hive queries given in the documentation.
Can you please suggest what wrong am I doing here.
scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext:
Spark includes the clearspring analytics package but intentionally excludes
the dependencies of the fastutil package (see below).
Spark includes parquet-column which includes fastutil and relocates it under
parquet/
but creates a shaded jar file which is incomplete because it shades out some
of
Hi,
I run into Task not Serializable excption with following code below. When I
remove the threads and run, it works, but with threads I run into Task not
serializable exception.
object SparkKart extends Serializable{
def parseVector(line: String): Vector[Double] = {
DenseVector(line.split('
Hi Akhil
I guess it skipped my attention. I would definitely give it a try.
While I would still like to know what is the issue with the way I have
created schema?
On Tue, Feb 24, 2015 at 4:35 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Did you happen to have a look at
Check out the FAQ in the link by Deepak Vohra.
The main differences are that the desktop installation includes common
user's packages, as LibreOffice, while the server installation doesn't. But
the server includes server user's packages, as apache2.
Also, the Desktop has a GUI (a graphical
Hello,
For the past days I have been trying to process and analyse with Spark a
Cassandra eventLog table similar to the one shown here.
Basically what I want to calculate is the delta time epoch between each
event type for all the device id's in the table. Currently its working as
expected but I
94 matches
Mail list logo