These errors should be fixed on master with Sean's PR:
https://github.com/apache/spark/pull/209
The orbit errors are quite possibly due to using https instead of http,
whether or not the SSL cert was bad. Let us know if they go away with
reverting to http.
On Sun, Mar 23, 2014 at 11:48 AM,
Hello,
I have a weird error showing up when I run a job on my Spark cluster.
The version of spark is 0.9 and I have 3+ GB free on the disk when this
error shows up. Any ideas what I should be looking for?
[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task
167.0:3 failed
On some systems, /tmp/ is an in-memory tmpfs file system, with its own size
limit. It's possible that this limit has been exceeded. You might try
running the df command to check to free space of /tmp or root if tmp
isn't listed.
3 GB also seems pretty low for the remaining free space of a disk.
Hey All,
I think the old thread is here:
https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J
The method proposed in that thread is to create a utility class for
doing single-pass aggregations. Using Algebird is a pretty good way to
do this and is a bit more flexible since
On 3/23/14, 5:49 PM, Matei Zaharia wrote:
You can set spark.local.dir to put this data somewhere other than /tmp
if /tmp is full. Actually it’s recommended to have multiple local
disks and set to to a comma-separated list of directories, one per disk.
Matei, does the number of tasks/partitions
Hi
Thanks for reporting this. It'll be great if you can check a couple of
things:
1. Are you trying to use this with Hadoop2 by any chance ? There was an
incompatible ASM version bug that we fixed for Hadoop2
https://github.com/amplab-extras/SparkR-pkg/issues/17 and we verified it,
but I just
Hello,
In spark we can use *newAPIHadoopRDD *to access the different distributed
system like HDFS, HBase, and MongoDB via different inputformat.
Is it possible to access the *inputsplit *in Spark directly? Spark can
cache data in local memory.
Perform local computation/aggregation on the local
Hey Jeremy, what happens if you pass batchSize=10 as an argument to your
SparkContext? It tries to serialize that many objects together at a time, which
might be too much. By default the batchSize is 1024.
Matei
On Mar 23, 2014, at 10:11 AM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Hi
I don’t see the errors anymore. Thanks Aaron.
On 24-Mar-2014, at 12:52 am, Aaron Davidson ilike...@gmail.com wrote:
These errors should be fixed on master with Sean's PR:
https://github.com/apache/spark/pull/209
The orbit errors are quite possibly due to using https instead of http,
Aaron, thanks for replying. I am very much puzzled as to what is going
on. A job that used to run on the same cluster is failing with this
mysterious message about not having enough disk space when in fact I can
see through watch df -h that the free space is always hovering around
3+GB on the
Bleh, strike that, one of my slaves was at 100% inode utilization on the
file system. It was /tmp/spark* leftovers that apparently did not get
cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up
/tmp/spark* regularly.
Thanks
Thanks for bringing this up, 100% inode utilization is an issue I haven't
seen raised before and this raises another issue which is not on our
current roadmap for state cleanup (cleaning up data which was not fully
cleaned up from a crashed process).
On Sun, Mar 23, 2014 at 7:57 PM, Ognen
I would love to work on this (and other) stuff if I can bother someone
with questions offline or on a dev mailing list.
Ognen
On 3/23/14, 10:04 PM, Aaron Davidson wrote:
Thanks for bringing this up, 100% inode utilization is an issue I
haven't seen raised before and this raises another issue
Hey there fellow Dukes of Data,
How can I tell how many partitions my RDD is split into?
I'm interested in knowing because, from what I gather, having a good number
of partitions is good for performance. If I'm looking to understand how my
pipeline is performing, say for a parallelized write out
It's much simpler: rdd.partitions.size
On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Hey there fellow Dukes of Data,
How can I tell how many partitions my RDD is split into?
I'm interested in knowing because, from what I gather, having a good
number
As Mark said you can actually access this easily. The main issue I've
seen from a performance perspective is people having a bunch of really
small partitions. This will still work but the performance will
improve if you coalesce the partitions using rdd.coalesce().
This can happen for example if
Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a No space left on device error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.
On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
og...@nengoiksvelzud.com wrote:
Hi Tsai,
Could you share more information about the machine you used and the
training parameters (runs, k, and iterations)? It can help solve your
issues. Thanks!
Best,
Xiangrui
On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming mailingl...@ltsai.com wrote:
Hi,
At the reduceBuyKey stage, it takes
Hi,
This is on a 4 nodes cluster each with 32 cores/256GB Ram.
(0.9.0) is deployed in a stand alone mode.
Each worker is configured with 192GB. Spark executor memory is also 192GB.
This is on the first iteration. K=50. Here’s the code I use:
http://pastebin.com/2yXL3y8i , which is a
Hi,
I have few questions regarding log file management in spark:
1. Currently I did not find any way to modify the lof file name for
executor/drivers). Its hardcoded as stdout and stderr. Also there is no log
rotation.
In case of streaming application this will grow forever and become
Hi All !! I am getting the following error in interactive spark-shell
[0.8.1]
*org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed more than
0 times; aborting job java.lang.OutOfMemoryError: GC overhead limit
exceeded*
But i had set the following in the spark.env.sh and
K = 50 is certainly a large number for k-means. If there is no
particular reason to have 50 clusters, could you try to reduce it
to, e.g, 100 or 1000? Also, the example code is not for large-scale
problems. You should use the KMeans algorithm in mllib clustering for
your problem.
Thanks, Let me try with a smaller K.
Does the size of the input data matters for the example? Currently I have 50M
rows. What is a reasonable size to demonstrate the capability of Spark?
On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng men...@gmail.com wrote:
K = 50 is certainly a large
To be clear on what your configuration will do:
- SPARK_DAEMON_MEMORY=8g will make your standalone master and worker
schedulers have a lot of memory. These do not impact the actual amount of
useful memory given to executors or your driver, however, so you probably
don't need to set this.
-
PS you have a typo in DEAMON - its DAEMON. Thanks Latin.
On Mar 24, 2014 7:25 AM, Sai Prasanna ansaiprasa...@gmail.com wrote:
Hi All !! I am getting the following error in interactive spark-shell
[0.8.1]
*org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed more
than 0 times;
I am also facing the same problem. I have implemented Serializable for my
code, but the exception is thrown from third party libraries on which I have
no control .
Exception in thread main org.apache.spark.SparkException: Job aborted:
Task not serializable: java.io.NotSerializableException: (lib
Can someone answer this question please?
Specifically about the Serializable implementation of dependent jars .. ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-Of-dependent-Java-lib-tp1973p3087.html
Sent from the Apache
Yes my input data is partitioned in a completely random manner, so each
worker that produces shuffle data processes only a part of it. The way I
understand it is that before each stage each workers needs to distribute
correct partitions (based on hash key ranges?) to other workers. And this
is
Hello,
Has anyone got any ideas? I am not quite sure if my problem is an exact fit
for Spark. Since in reality in this
section of my program i am not really doing a reduce job simply a group by
and partition.
Would calling pipe on the Partiotined JavaRDD do the trick? Are there any
examples
Patrick, correct. I have a 16 node cluster. On 14 machines out of 16,
the inode usage was about 50%. On two of the slaves, one had inode usage
of 96% and on the other it was 100%. When i went into /tmp on these two
nodes - there were a bunch of /tmp/spark* subdirectories which I
deleted. This
Oh, glad to know it's that simple!
Patrick, in your last comment did you mean filter in? As in I start with
one year of data and filter it so I have one day left? I'm assuming in that
case the empty partitions would be for all the days that got filtered out.
Nick
2014년 3월 24일 월요일, Patrick
We now have a method to work this around.
For the classes that can't easily implement serialized, we wrap this class a
scala object.
For example:
class A {} // This class is not serializable,
object AHolder {
private val a: A = new A()
def get: A = a
}
This
Dear all,
Sorry for asking such a basic question, but someone can explain when one
should use mapPartiontions instead of map.
Thanks
Jaonary
What does this error mean:
@hadoop-s2.oculus.local:45186]: Error [Association failed with
[akka.tcp://spark@hadoop-s2.oculus.local:45186]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://spark@hadoop-s2.oculus.local:45186]
Caused by:
Another thing I have noticed is that out of my master+15 slaves, two
slaves always carry a higher inode load. So for example right now I am
running an intensive job that takes about an hour to finish and two
slaves have been showing an increase in inode consumption (they are
about 10% above
I've seen two cases most commonly:
The first is when I need to create some processing object to process each
record. If that object creation is expensive, creating one per record
becomes prohibitive. So instead, we use mapPartition, and create one per
partition, and use it on each record in the
Hi All,
I found out why this problem exists. Consider the following scenario:
- a DStream is created from any source. (I've checked with file and socket)
- No actions are applied to this DStream
- Sliding Window operation is applied to this DStream and an action is applied
to the sliding window.
Mark,
This appears to be a Scala-only feature. :(
Patrick,
Are we planning to add this to PySpark?
Nick
On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra m...@clearstorydata.comwrote:
It's much simpler: rdd.partitions.size
On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
Oh, I also forgot to mention:
I start the master and workers (call ./sbin/start-all.sh), and then start
the shell:
MASTER=spark://localhost:7077 ./bin/spark-shell
Then I get the exceptions...
Thanks
--
View this message in context:
I have a DStream like this:
..RDD[a,b],RDD[b,c]..
Is there a way to remove duplicates across the entire DStream? Ie: I would like
the output to be (by removing one of the b's):
..RDD[a],RDD[b,c].. or ..RDD[a,b],RDD[c]..
Thanks
-Adrian
After a long time (may be a month) I could do a fresh build for
2.0.0-mr1-cdh4.5.0...I was using the cached files in .ivy2/cache
My case is especially painful since I have to build behind a firewall...
@Sean thanks for the fix...I think we should put a test for https/firewall
compilation as
Number of rows doesn't matter much as long as you have enough workers
to distribute the work. K-means has complexity O(n * d * k), where n
is number of points, d is the dimension, and k is the number of
clusters. If you use the KMeans implementation from MLlib, the
initialization stage is done on
Hi,
I have a very simple use case:
I have an rdd as following:
d = [[1,2,3,4],[1,5,2,3],[2,3,4,5]]
Now, I want to remove all the duplicates from a column and return the
remaining frame..
For example:
If i want to remove the duplicate based on column 1.
Then basically I would remove either row
1. Note sure on this, I don't believe we change the defaults from Java.
2. SPARK_JAVA_OPTS can be used to set the various Java properties (other
than memory heap size itself)
3. If you want to have 8 GB executors then, yes, only two can run on each
16 GB node. (In fact, you should also keep a
Hello,
I'm interested in extending the comparison between GraphX and GraphLab
presented in Xin et. al (2013). The evaluation presented there is rather
limited as it only compares the frameworks for one algorithm (PageRank) on
a cluster with a fixed number of nodes. Are there any graph algorithms
Niko,
Comparing some other components will be very useful as wellsvd++ from
graphx vs the same algorithm in graphlabalso mllib.recommendation.als
implicit/explicit compared to the collaborative filtering toolkit in
graphlab...
To stress test what's the biggest sparse dataset that you
Just found this issue and wanted to link it here, in case somebody finds this
thread later:
https://spark-project.atlassian.net/browse/SPARK-939
On Thursday, March 20, 2014 at 11:14 AM, Matei Zaharia wrote:
Hi Jaka,
I’d recommend rebuilding Spark with a new version of the HTTPClient
Has anyone successfully followed the instructions on the Quick Start page
of the Spark home page to run a standalone Scala application? I can't,
and I figure I must be missing something obvious!
I'm trying to follow the instructions here as close to word for word as
possible:
Hi, Diana,
See my inlined answer
--
Nan Zhu
On Monday, March 24, 2014 at 3:44 PM, Diana Carroll wrote:
Has anyone successfully followed the instructions on the Quick Start page of
the Spark home page to run a standalone Scala application? I can't, and I
figure I must be missing
I am able to run standalone apps. I think you are making one mistake
that throws you off from there onwards. You don't need to put your app
under SPARK_HOME. I would create it in its own folder somewhere, it
follows the rules of any standalone scala program (including the
layout). In the giude,
Hi Niko,
The GraphX team recently wrote a longer paper with more benchmarks and
optimizations: http://arxiv.org/abs/1402.2394
Regarding the performance of GraphX vs. GraphLab, I believe GraphX
currently outperforms GraphLab only in end-to-end benchmarks of pipelines
involving both graph-parallel
Yana: Thanks. Can you give me a transcript of the actual commands you are
running?
THanks!
Diana
On Mon, Mar 24, 2014 at 3:59 PM, Yana Kadiyska yana.kadiy...@gmail.comwrote:
I am able to run standalone apps. I think you are making one mistake
that throws you off from there onwards. You
I have what I would call unexpected behaviour when using window on a stream.
I have 2 windowed streams with a 5s batch interval. One window stream is
(5s,5s)=smallWindow and the other (10s,5s)=bigWindow
What I've noticed is that the 1st RDD produced by bigWindow is incorrect and is
of the size
Diana,
Anywhere on the filesystem you have read/write access (you need not be
in your spark home directory):
mkdir myproject
cd myproject
mkdir project
mkdir target
mkdir -p src/main/scala
cp $mypath/$mymysource.scala src/main/scala/
cp $mypath/myproject.sbt .
Make sure that myproject.sbt
Thanks, Nan Zhu.
You say that my problems are because you are in Spark directory, don't
need to do that actually , the dependency on Spark is resolved by sbt
I did try it initially in what I thought was a much more typical place,
e.g. ~/mywork/sparktest1. But as I said in my email:
(Just for
Thanks Ongen.
Unfortunately I'm not able to follow your instructions either. In
particular:
sbt compile
sbt run arguments if any
This doesn't work for me because there's no program on my path called
sbt. The instructions in the Quick Start guide are specific that I
should call
There is no direct way to get this in pyspark, but you can get it from the
underlying java rdd. For example
a = sc.parallelize([1,2,3,4], 2)
a._jrdd.splits().size()
On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Mark,
This appears to be a Scala-only
Hi,
Quick question about partitions. If my RDD is partitioned into 5
partitions, does that mean that I am constraining it to exist on at most 5
machines?
Thanks
Yeah, that's exactly what I did. Unfortunately it doesn't work:
$SPARK_HOME/sbt/sbt package
awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for
reading (No such file or directory)
Attempting to fetch sbt
/usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or
Ah we should just add this directly in pyspark - it's as simple as the
code Shivaram just wrote.
- Patrick
On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman
shivaram.venkatara...@gmail.com wrote:
There is no direct way to get this in pyspark, but you can get it from the
underlying java
Ah crud, I guess you are right, I am using the sbt I installed manually
with my Scala installation.
Well, here is what you can do:
mkdir ~/bin
cd ~/bin
wget
http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.1/sbt-launch.jar
vi sbt
Put the following contents into
Hi, Diana,
You don’t need to use spark-distributed sbt
just download sbt from its official website and set your PATH to the right place
Best,
--
Nan Zhu
On Monday, March 24, 2014 at 4:30 PM, Diana Carroll wrote:
Yeah, that's exactly what I did. Unfortunately it doesn't work:
For instance, I need to work with an RDD in terms of N parts. Will calling
RDD.coalesce(N) possibly cause processing bottlenecks?
On Mon, Mar 24, 2014 at 1:28 PM, Walrus theCat walrusthe...@gmail.comwrote:
Hi,
Quick question about partitions. If my RDD is partitioned into 5
partitions,
Thanks for your help, everyone. Several folks have explained that I can
surely solve the problem by installing sbt.
But I'm trying to get the instructions working *as written on the Spark
website*. The instructions not only don't have you install sbt
separately...they actually specifically have
RDD.coalesce should be fine for rebalancing data across all RDD partitions.
Coalesce is pretty handy in situations where you have sparse data and want
to compact it (e.g. data after applying a strict filter) OR you know the
magic number of partitions according to your cluster which will be
Ongen:
I don't know why your process is hanging, sorry. But I do know that the
way saveAsTextFile works is that you give it a path to a directory, not a
file. The file is saved in multiple parts, corresponding to the
partitions. (part-0, part-1 etc.)
(Presumably it does this because it
Diana, thanks. I am not very well acquainted with HDFS. I use hdfs -put
to put things as files into the filesystem (and sc.textFile to get stuff
out of them in Spark) and I see that they appear to be saved as files
that are replicated across 3 out of the 16 nodes in the hdfs cluster
(which is
Thanks Matei, unfortunately doesn't seem to fix it. I tried batchSize = 10,
100, as well as 1 (which should reproduce the 0.8.1 behavior?), and it stalls
at the same point in each case.
-- Jeremy
-
jeremy freeman, phd
neuroscientist
@thefreemanlab
On Mar 23, 2014, at 9:56
Hello Sanjay,
Yes, your understanding of lazy semantics is correct. But ideally
every batch should read based on the batch interval provided in the
StreamingContext. Can you open a JIRA on this?
On Mon, Mar 24, 2014 at 7:45 AM, Sanjay Awatramani
sanjay_a...@yahoo.com wrote:
Hi All,
I found
Yes, I believe that is current behavior. Essentially, the first few
RDDs will be partial windows (assuming window duration sliding
interval).
TD
On Mon, Mar 24, 2014 at 1:12 PM, Adrian Mocanu
amoc...@verticalscope.com wrote:
I have what I would call unexpected behaviour when using window on a
Syed,
Thanks for the tip. I'm not sure if coalesce is doing what I'm intending
to do, which is, in effect, to subdivide the RDD into N parts (by calling
coalesce and doing operations on the partitions.) It sounds like, however,
this won't bottleneck my processing power. If this sets off any
Just so I can close this thread (in case anyone else runs into this
stuff) - I did sleep through the basics of Spark ;). The answer on why
my job is in waiting state (hanging) is here:
http://spark.incubator.apache.org/docs/latest/spark-standalone.html#resource-scheduling
Ognen
On 3/24/14,
Hi guys,
thanks for the information, I'll give it a try with Algebird,
thanks again,
Richard
@Patrick, thanks for the release calendar
On Mon, Mar 24, 2014 at 12:16 AM, Patrick Wendell pwend...@gmail.comwrote:
Hey All,
I think the old thread is here:
Hi,I have large data set of numbers ie RDD and wanted to perform a
computation only on groupof two values at a time.For
example1,2,3,4,5,6,7... is an RDDCan i group the RDD into
(1,2),(3,4),(5,6)...?? and perform the respective computations ?in an
efficient manner?As we do'nt have a way to index
Diana, I think you are correct - I just installed
wget
http://mirror.symnds.com/software/Apache/incubator/spark/spark-0.9.0-incubating/spark-0.9.0-incubating-bin-cdh4.tgz
and indeed I see the same error that you see
It looks like in previous versions sbt-launch used to just come down
in the
I found that I never read the document carefully and I never find that Spark
document is suggesting you to use Spark-distributed sbt……
Best,
--
Nan Zhu
On Monday, March 24, 2014 at 5:47 PM, Diana Carroll wrote:
Thanks for your help, everyone. Several folks have explained that I can
We need some one who can explain us with short code snippet on given example
so that we get clear cut idea on RDDs indexing..
Guys please help us
--
View this message in context:
Hi,
sc.parallelize(Array.tabulate(100)(i=i)).filter( _ % 20 == 0
).coalesce(5,true).glom.collect yields
Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(),
Array(), Array())
How do I get something more like:
Array(Array(0), Array(20), Array(40), Array(60), Array(80))
There is also https://github.com/apache/spark/pull/18 against the current
repo which may be easier to apply.
On Fri, Mar 21, 2014 at 8:58 AM, Hai-Anh Trinh a...@adatao.com wrote:
Hi Jaonary,
You can find the code for k-fold CV in
https://github.com/apache/incubator-spark/pull/448. I have
It is suggested implicitly in giving you the command ./sbt/sbt. The
separately installed sbt isn't in a folder called sbt, whereas Spark's
version is. And more relevantly, just a few paragraphs earlier in the
tutorial you execute the command sbt/sbt assembly which definitely refers
to the spark
partition your input into even number partitions
use mapPartition to operate on Iterator[Int]
maybe there are some more efficient way….
Best,
--
Nan Zhu
On Monday, March 24, 2014 at 7:59 PM, yh18190 wrote:
Hi, I have large data set of numbers ie RDD and wanted to perform a
Yes, actually even for spark, I mostly use the sbt I installed…..so always
missing this issue….
If you can reproduce the problem with a spark-distribtued sbt…I suggest
proposing a PR to fix the document, before 0.9.1 is officially released
Best,
--
Nan Zhu
On Monday, March 24, 2014
Ognen, can you comment if you were actually able to run two jobs
concurrently with just restricting spark.cores.max? I run Shark on the
same cluster and was not able to see a standalone job get in (since
Shark is a long running job) until I restricted both spark.cores.max
_and_
I didn’t group the integers, but process them in group of two,
partition that
scala val a = sc.parallelize(List(1, 2, 3, 4), 2)
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at
console:12
process each partition and process elements in the partition in group
points.foreach(p=p.y = another_value) will return a new modified RDD.
2014-03-24 18:13 GMT+08:00 Chieh-Yen r01944...@csie.ntu.edu.tw:
Dear all,
I have a question about the usage of RDD.
I implemented a class called AppDataPoint, it looks like:
case class AppDataPoint(input_y : Double,
Hi hequn, a relative question, is that mean the memory usage will doubled? And
further more, if the compute function in a rdd is not idempotent, rdd will
changed during the job running, is that right?
-原始邮件-
发件人: hequn cheng chenghe...@gmail.com
发送时间: 2014/3/25 9:35
收件人:
No, it won't. The type of RDD#foreach is Unit, so it doesn't return an
RDD. The utility of foreach is purely for the side effects it generates,
not for its return value -- and modifying an RDD in place via foreach is
generally not a very good idea.
On Mon, Mar 24, 2014 at 6:35 PM, hequn cheng
Sorry, I meant the master branch of https://github.com/apache/spark. -Xiangrui
On Mon, Mar 24, 2014 at 6:27 PM, Tsai Li Ming mailingl...@ltsai.com wrote:
Thanks again.
If you use the KMeans implementation from MLlib, the
initialization stage is done on master,
The master here is the
First question:
If you save your modified RDD like this:
points.foreach(p=p.y = another_value).collect() or
points.foreach(p=p.y = another_value).saveAsTextFile(...)
the modified RDD will be materialized and this will not use any work's
memory.
If you have more transformatins after the map(), the
Hi hequn, I dig into the source of spark a bit deeper, and I got some ideas,
firstly, immutable is a feather of rdd but not a solid rule, there are ways to
change it, for excample, a rdd with non-idempotent compute function, though
it is really a bad design to make that function non-idempotent
My thought would be to key by the first item in each array, then take just
one array for each key. Something like the below:
v = sc.parallelize(Seq(Seq(1,2,3,4),Seq(1,5,2,3),Seq(2,3,4,5)))
col = 0
output = v.keyBy(_(col)).reduceByKey(a,b = a).values
On Tue, Mar 25, 2014 at 1:21 AM, Chengi Liu
reopen this thread because i encounter this problem again.
Here is my env:
scala 2.10.3 s
spark 0.9.0tandalone mode
shark 0.9.0downlaod the source code and build by myself
hive hive-shark-0.11
I have copied hive-site.xml from my hadoop cluster , it's hive version is
0.12, after copied , i
This worked great. Thanks a lot
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Java-API-Serialization-Issue-tp1460p3178.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Is there a way to see the resource usage of each spark-shell command — say time
taken and memory used?
I checked the WebUI of spark-shell and of the master and I don’t see any such
breakdown. I see the time taken in the INFO logs but nothing about memory
usage. It would also be nice to track
Hi Qingyang Li,
Shark-0.9.0 uses a patched version of hive-0.11 and using
configuration/metastore of hive-0.12 could be incompatible.
May I know the reason you are using hive-site.xml from previous hive version(to
use existing metastore?). You might just leave hive-site.xml blank, otherwise.
Possibly one of your executors is in the middle of a large stop-the-world
GC and doesn't respond to network traffic during that period? If you
shared some information about how each node in your cluster is set up (heap
size, memory, CPU, etc) that might help with debugging.
Andrew
On Mon, Mar
Hi,
I have been following Aniket's spork github repository.
https://github.com/aniket486/pig
I have done all the changes mentioned in recently modified pig-spark file.
I am using:
hadoop 2.0.5 alpha
spark-0.8.1-incubating
mesos 0.16.0
##PIG variables
export
Hi Guys,
I think that I already did this question, but I don't remember if anyone
has answered me. I would like changing in the function print() the
quantity of words and the frequency number that are sent to driver's
screen. The default value is 10.
Anyone could help me with this?
Best
You can extend DStream and override print() method. Then you can create
your own DSTream extending from this.
On Tue, Mar 25, 2014 at 6:07 PM, Eduardo Costa Alfaia
e.costaalf...@unibs.it wrote:
Hi Guys,
I think that I already did this question, but I don't remember if anyone
has answered
Hi,
I have an example where I use a tuple of (int,int) in Python as key for a
RDD. When I do a reduceByKey(...), sometimes the tuples turn up with the
two int's reversed in order (which is problematic, as the ordering is part
of the key).
Here is a ipython notebook that has some code and
401 - 500 of 75449 matches
Mail list logo