This codec does require native libraries to be installed, IIRC, but
they are installed with CDH 5.
The error you show does not look related though. Are you sure your HA
setup is working and that you have configured it correctly in whatever
config spark is seeing?
--
Sean Owen | Director, Data
File as a stream?
I think you are confusing Spark Streaming with buffer reader. Spark
streaming is meant to process batches of data (files, packets, messages) as
they come in, infact utilizing time of packet reception as a way to create
windows etc.
In your case you are better off reading the
Mostly none of the items in PairRDD match your input. Hence the error.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, May 1, 2014 at 2:06 PM, vivek.ys vivek...@gmail.com wrote:
Hi All,
I am facing an issue
Broadcast variable is meant to be shared across each node not map tasks.
The process you are using should work, however having 6GB of broadcast
variable could be an issue. Does the broadcast variable finally move or
always stays stuck?
Mayur Rustagi
Ph: +1 (760) 203 3257
No I am sure the items match. Because userCluster productCluster are
prepared from data . Cross product of userCluster productCluster is a
super set of data.
On Thu, May 1, 2014 at 3:41 PM, Mayur Rustagi mayur.rust...@gmail.comwrote:
Mostly none of the items in PairRDD match your input.
RDD are immutable so cannot be updated. You can create new RDD containing
updated entries(often not what you want to do).
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, May 1, 2014 at 4:42 AM, narayanabhatla
Thanks a lot for very prompt response. Then next questions are the following.
1. Can we conclude that Spark is NOT the solution for our requirement? Or
2. Is there a design approach to meet such requirements using Spark?
From: Mayur Rustagi [mailto:mayur.rust...@gmail.com]
Graph.subgraph() allows you to apply a filter to edges and/or vertices.
On Thu, May 1, 2014 at 8:52 AM, Николай Кинаш peroksi...@gmail.com wrote:
Hello.
How to remove vertex or edges from graph in GraphX?
Cool intro, thanks! One question. On slide 23 it says Standalone (local
mode). That sounds a bit confusing without hearing the talk.
Standalone mode is not local. It just does not depend on a cluster
software. I think it's the best mode for EC2/GCE, because they provide a
distributed filesystem
Thanks for the clarification. I'll fix the slide. I've done a lot of
Scalding/Cascading programming where the two concepts are synonymous, but
clearly I was imposing my prejudices here ;)
dean
On Thu, May 1, 2014 at 8:18 AM, Daniel Darabos
daniel.dara...@lynxanalytics.com wrote:
Cool intro,
Very Useful material. Currently, I am trying to persuade my client choose Spark
instead of Hadoop MapReduce. Your slide give me more evidence to support my
opinion.
--
ZhangYi (张逸)
Developer
tel: 15023157626
blog: agiledon.github.com
weibo: tw张逸
Sent with Sparrow
That's great! Thanks. Let me know if it works ;) or what I could improve to
make it work.
dean
On Thu, May 1, 2014 at 8:45 AM, ZhangYi yizh...@thoughtworks.com wrote:
Very Useful material. Currently, I am trying to persuade my client choose
Spark instead of Hadoop MapReduce. Your slide give
Hi, I have a very simple spark program written in Scala:
/*** testApp.scala ***/
object testApp {
def main(args: Array[String]) {
println(Hello! World!)
}
}
Then I use the following command to compile it:
$ sbt/sbt package
The compilation finished successfully and I got a JAR file.
But
Here's how I configure SBT, which I think is the usual way:
export SBT_OPTS=-XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=256m -Xmx1g
See if that takes. But your error is that you're already asking for
too much memory for your machine. So maybe you are setting the value
successfully, but it's
So Seq[V] contains only new tuples. I initially thought that whenever a new
tuple was found, it would add it to Seq and call the update function
immediately so there wouldn't be more than 1 update to Seq per function call.
Say I want to sum tuples with the same key is an RDD using
There are many freely-available resources for the enterprising individual
to use if they want to Spark up their life.
For others, some structured training is in order. Say I want everyone from
my department at my company to get something like the AMP
Camphttp://ampcamp.berkeley.edu/experience,
Hi all,
I am thinking of starting work on a profiler for Spark clusters. The current
idea is that it would collect jstacks from executor nodes and put them into
a central index (either a database or elasticsearch), and it would present
them to people in a UI that would let people slice and dice
If you're in the Bay Area, the Spark Summit would be a great source of
information.
http://spark-summit.org/2014
-Roger
From: Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
Sent: Thursday, May 01, 2014 10:12 AM
To: u...@spark.incubator.apache.org
Subject: Spark Training
There are many
Some thing like Twitter Ambrose would be lovely to integrate :)
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, May 1, 2014 at 8:44 PM, Punya Biswal pbis...@palantir.com wrote:
Hi all,
I am thinking of starting
You may also want to check out Paco Nathan's Introduction to Spark courses:
http://liber118.com/pxn/
On May 1, 2014, at 8:20 AM, Mayur Rustagi mayur.rust...@gmail.com wrote:
Hi Nicholas,
We provide training on spark, hands-on also associated ecosystem.
We gave it recently at a
Hi
I am using Spark to distribute computationally intensive tasks across the
cluster. Currently I partition my RDD of tasks randomly. There is a large
variation in how long each of the jobs take to complete, leading to most
partitions being processed quickly and a couple of partitions take
I'm working on a 1-day workshop that I'm giving in Australia next week and
a few other conferences later in the year. I'll post a link when it's ready.
dean
On Thu, May 1, 2014 at 10:30 AM, Denny Lee denny.g@gmail.com wrote:
You may also want to check out Paco Nathan's Introduction to
Hi Sai,
i don't sincerely figure out where you are using the RDDs (because the
split method isn't defined in them) by the way you should use the map
function instead of the foreach due the fact it is NOT idempotent and some
partitions could be recomputed executing the function multiple times.
Thank you Patrick.
I took a quick stab at it:
val s3Client = new AmazonS3Client(...)
val copyObjectResult = s3Client.copyObject(upload, outputPrefix +
/part-0, rolled-up-logs, 2014-04-28.csv)
val objectListing = s3Client.listObjects(upload, outputPrefix)
Hi,
I guess Spark is using streaming in context of streaming live data but
what I mean is something more on the lines of hadoop streaming.. where one
can code in any programming language?
Or is something among that lines on the cards?
Thanks
--
Mohit
When you want success as badly as you
Take a look at the RDD.pipe() operation. That allows you to pipe the data
in a RDD to any external shell command (just like Unix Shell pipe).
On May 1, 2014 10:46 AM, Mohit Singh mohit1...@gmail.com wrote:
Hi,
I guess Spark is using streaming in context of streaming live data but
what I mean
I'm working with spark 0.9.0 on cdh5.
I'm running a spark application written in java in yarn-client mode.
Cause of the OP installed on the cluster I need to run the application using
the hdfs user, otherwise I have a permission problem and getting the following
error:
Yeah actually it's hdfs that has superuser privileges on HDFS, not
root. It looks like you're trying to access a nonexistent user
directory like /user/foo, and it fails because root can't create it,
and you inherit privileges for root since that is what your app runs
as.
I don't think you want to
The fastest way to save to S3 should be to leave the RDD with many
partitions, because all partitions will be written out in parallel.
Then, once the various parts are in S3, somehow concatenate the files
together into one file.
If this can be done within S3 (I don't know if this is possible),
Hi, I am getting the following error. How could I fix this problem?
Joe
14/05/02 03:51:48 WARN TaskSetManager: Lost TID 12 (task 2.0:1)
14/05/02 03:51:48 INFO TaskSetManager: Loss was due to
java.lang.ClassNotFoundException:
org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$4
Hi dear all,
When I tried to build Spark 0.9.1 on my Mac OS X 10.9.2 with Java 8, I
found the following errors:
[error] error while loading CharSequence, class file
'/Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/jre/lib/rt.jar(java/lang/CharSequence.class)'
is broken
[error]
HI Zhige,
I had the same issue and revert to using JDK 1.7.055
From: Zhige Xin xinzhi...@gmail.com
Reply-To: user@spark.apache.org
Date: Thursday, May 1, 2014 at 12:32 PM
To: user@spark.apache.org
Subject: Can't be built on MAC
Hi dear all,
When I tried to build Spark 0.9.1 on my Mac OS X
Thank you! Ian.
Zhige
On Thu, May 1, 2014 at 12:35 PM, Ian Ferreira ianferre...@hotmail.comwrote:
HI Zhige,
I had the same issue and revert to using JDK 1.7.055
From: Zhige Xin xinzhi...@gmail.com
Reply-To: user@spark.apache.org
Date: Thursday, May 1, 2014 at 12:32 PM
To:
I'm trying to understand updateStateByKey.
Here's an example I'm testing with:
Input data: DStream( RDD( (a,2) ), RDD( (a,3) ), RDD( (a,4) ), RDD(
(a,5) ), RDD( (a,6) ), RDD( (a,7) ) )
Code:
val updateFunc = (values: Seq[Int], state: Option[StateClass]) = {
val previousState =
Hello Spark Fans,
I am trying to run a spark job via oozie as a java action. The spark code
is packaged as a MySparkJob.jar compiled using sbt assembly (excluding
spark and hadoop dependencies).
I am able to invoke the spark job from any client using
java -cp
Is this possible, it is very annoying to have such a great script, but still
have to manually update stuff afterwards.
The problem is that equally-sized partitions take variable time to complete
based on their contents?
Sent from my mobile phone
On May 1, 2014 8:31 AM, deenar.toraskar deenar.toras...@db.com wrote:
Hi
I am using Spark to distribute computationally intensive tasks across the
cluster. Currently
If I use a range partitioner, will this make updateStateByKey take the tuples
in order?
Right now I see them not being taken in order (most of them are ordered but not
all)
-Adrian
HelIo. I followed A Standalone App in Java part of the tutorial
https://spark.apache.org/docs/0.8.1/quick-start.html
Spark standalone cluster looks it's running without a problem :
http://i.stack.imgur.com/7bFv8.png
I have built a fat jar for running this JavaApp on the cluster. Before maven
Thanks, Rustagi. Yes, the global data is read-only and stays from the
beginning to the end of the whole Spark task. Actually, it is not only
identical for one Map/Reduce task, but used by a lot of map/reduce tasks of
mine. That's why I intend to put the data into each node of my cluster, and
hope
Hi,
I have the following code structure. I compiles ok, but at runtime it aborts
with the error:
Exception in thread main org.apache.spark.SparkException: Job aborted:
Task not serializable: java.io.NotSerializableException:
I am running in local (standalone) mode.
trait A{
def input(...):
Have you tried making A extend Serializable?
On Thu, May 1, 2014 at 3:47 PM, SK skrishna...@gmail.com wrote:
Hi,
I have the following code structure. I compiles ok, but at runtime it aborts
with the error:
Exception in thread main org.apache.spark.SparkException: Job aborted:
Task not
Someone (Ze Ni, https://www.sics.se/people/ze-ni) has actually attempted
such a comparative study as a Masters thesis:
http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf
According to this snapshot (c. 2013), Stratosphere is different from Spark
in not having an explicit concept of
Hi,
I am a newbie to Spark. I looked for documentation or examples to answer my
question but came up empty handed.
I don't know whether I am using the right terminology but here goes.
I have a file of records. Initially, I had the following Spark program (I
am omitting all the surrounding code
Hi,
I have installed spark 1.0 from the branch-1.0, build went fine, and I have
tried running the example on Yarn client mode, here is my command:
/home/hadoop/spark-branch-1.0/bin/spark-submit
/home/hadoop/spark-branch-1.0/examples/target/scala-2.10/spark-examples-1.0.0-hadoop2.2.0.jar
--master
anyone talk something about this?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/same-partition-id-means-same-location-tp5136p5200.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I'm trying to connect to a YARN cluster by running these commands:
export HADOOP_CONF_DIR=/hadoop/var/hadoop/conf/
export YARN_CONF_DIR=$HADOOP_CONF_DIR
export SPARK_YARN_MODE=true
export
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
export
Depends on your code. Referring to the earlier example, if you do
words.map(x = (x,1)).updateStateByKey()
then for a particular word, if a batch contains 6 occurrences of that word,
then the Seq[V] will be [1, 1, 1, 1, 1, 1]
Instead if you do
words.map(x = (x,1)).reduceByKey(_ +
Ordered by what? arrival order? sort order?
TD
On Thu, May 1, 2014 at 2:35 PM, Adrian Mocanu amoc...@verticalscope.comwrote:
If I use a range partitioner, will this make updateStateByKey take the
tuples in order?
Right now I see them not being taken in order (most of them are ordered
I have a custom app that was compiled with scala 2.10.3 which I believe is
what the latest spark-ec2 script installs. However running it on the master
yields this cryptic error which according to the web implies incompatible
jar versions.
Exception in thread main java.lang.NoClassDefFoundError:
50 matches
Mail list logo