Option 1 should be fine, Option 2 would bound a lot on network as the data
increase in time.
Thanks
Best Regards
On Mon, Jun 22, 2015 at 5:59 PM, Ashish Soni asoni.le...@gmail.com wrote:
Hi All ,
What is the Best Way to install and Spark Cluster along side with Hadoop
Cluster , Any
Have a look at
http://s3.thinkaurelius.com/docs/titan/0.5.0/titan-io-format.html You could
use those Input/Output formats with newAPIHadoopRDD api call.
Thanks
Best Regards
On Sun, Jun 21, 2015 at 8:50 PM, Madabhattula Rajesh Kumar
mrajaf...@gmail.com wrote:
Hi,
How to connect TItan
Not sure, but try removing the provided or create a lib directory in the
project home and bring that jar over there.
On 20 Jun 2015 18:08, Ritesh Kumar Singh riteshoneinamill...@gmail.com
wrote:
Hi,
I'm using IntelliJ ide for my spark project.
I've compiled spark 1.3.0 for scala 2.11.4 and
restarting from checkpoint even for graceful shutdown.
I think usually the file is expected to be processed only once. Maybe this
is a bug in fileStream? or do you know any approach to workaround it?
Much thanks!
--
*From:* Akhil Das [mailto:ak
This is how i used to build a assembly jar with sbt:
Your build.sbt file would look like this:
*import AssemblyKeys._*
*assemblySettings*
*name := FirstScala*
*version := 1.0*
*scalaVersion := 2.10.4*
*libraryDependencies += org.apache.spark %% spark-core % 1.3.1*
*libraryDependencies +=
You can try setting these properties:
.set(spark.local.dir,/mnt/spark/)
.set(java.io.tmpdir,/mnt/spark/)
Thanks
Best Regards
On Fri, Jun 19, 2015 at 8:28 AM, yuemeng (A) yueme...@huawei.com wrote:
hi,all
if i want to change the /tmp folder to any other folder for spark ut use
Like this?
val add_msgs = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](
ssc, kafkaParams, Array(add).toSet)
val delete_msgs = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](
ssc, kafkaParams, Array(delete).toSet)
val
.setMaster(local) set it to local[2] or local[*]
Thanks
Best Regards
On Thu, Jun 18, 2015 at 5:59 PM, Bartek Radziszewski bar...@scalaric.com
wrote:
hi,
I'm trying to run simple kafka spark streaming example over spark-shell:
sc.stop
import org.apache.spark.SparkConf
import
Why not something like your mobile app pushes data to your webserver which
pushes the data to Kafka or Cassandra or any other database and have a
Spark streaming job running all the time operating on the incoming data and
pushes the calculated values back. This way, you don't have to start a
spark
Which version of spark? and what is your data source? For some reason, your
processing delay is exceeding the batch duration. And its strange that you
are not seeing any scheduling delay.
Thanks
Best Regards
On Thu, Jun 18, 2015 at 7:29 AM, Mike Fang chyfan...@gmail.com wrote:
Hi,
I have a
This might give you a good start
http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html
its a bit old though.
Thanks
Best Regards
On Thu, Jun 18, 2015 at 2:33 PM, texol t.rebo...@gmail.com wrote:
Hi,
I'm new to GraphX and I'd like to use Machine Learning
You could possibly open up a JIRA and shoot an email to the dev list.
Thanks
Best Regards
On Wed, Jun 17, 2015 at 11:40 PM, jcai jonathon@yale.edu wrote:
Hi,
I am running this on Spark stand-alone mode. I find that when I examine the
web UI, a couple bugs arise:
1. There is a
Can you try repartitioning the rdd after creating the K,V. And also, while
calling the rdd1.join(rdd2, Pass the # partition argument too)
Thanks
Best Regards
On Wed, Jun 17, 2015 at 12:15 PM, Al M alasdair.mcbr...@gmail.com wrote:
I have 2 RDDs I want to Join. We will call them RDD A and RDD
Not sure why spark-submit isn't shipping your project jar (may be try with
--jars), You can do a sc.addJar(/path/to/your/project.jar) also, it should
solve it.
Thanks
Best Regards
On Wed, Jun 17, 2015 at 6:37 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi folks,
running into a pretty
Not quiet sure, but try pointing the spark.history.fs.logDirectory to your
s3
Thanks
Best Regards
On Tue, Jun 16, 2015 at 6:26 PM, Gianluca Privitera
gianluca.privite...@studio.unibo.it wrote:
In Spark website it’s stated in the View After the Fact section (
, thanks again!
--
*From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
*Sent:* Monday, June 15, 2015 3:48 PM
*To:* Haopu Wang
*Cc:* user
*Subject:* Re: If not stop StreamingContext gracefully, will checkpoint
data be consistent?
I think it should be fine
Did you look inside all logs? Mesos logs and executor logs?
Thanks
Best Regards
On Mon, Jun 15, 2015 at 7:09 PM, Gary Ogden gog...@gmail.com wrote:
My Mesos cluster has 1.5 CPU and 17GB free. If I set:
conf.set(spark.mesos.coarse, true);
conf.set(spark.cores.max, 1);
in the SparkConf
am
not experiencing any performance benefit from it.
Is it something related to the bottleneck of MQ or Reliable Receiver?
*From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
*Sent:* Saturday, June 13, 2015 1:10 AM
*To:* Chaudhary, Umesh
*Cc:* user@spark.apache.org
*Subject:* Re
You can also look into https://spark.apache.org/docs/latest/tuning.html for
performance tuning.
Thanks
Best Regards
On Mon, Jun 15, 2015 at 10:28 PM, Rex X dnsr...@gmail.com wrote:
Thanks very much, Akhil.
That solved my problem.
Best,
Rex
On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das ak
Whats in your executor (that .tgz file) conf/spark-default.conf file?
Thanks
Best Regards
On Mon, Jun 15, 2015 at 7:14 PM, Gary Ogden gog...@gmail.com wrote:
I'm loading these settings from a properties file:
spark.executor.memory=256M
spark.cores.max=1
spark.shuffle.consolidateFiles=true
In the conf/slaves file, are you having the ip addresses? or the hostnames?
Thanks
Best Regards
On Sat, Jun 13, 2015 at 9:51 PM, Sea 261810...@qq.com wrote:
In spark 1.4.0, I find that the Address is ip (it was hostname in v1.3.0),
why? who did it?
I'm assuming by spark-client you mean the spark driver program. In that
case you can pick any machine (say Node 7), create your driver program in
it and use spark-submit to submit it to the cluster or if you create the
SparkContext within your driver program (specifying all the properties)
then
Something like this?
val huge_data = sc.textFile(/path/to/first.csv).map(x =
(x.split(\t)(1), x.split(\t)(0))
val gender_data = sc.textFile(/path/to/second.csv),map(x =
(x.split(\t)(0), x))
val joined_data = huge_data.join(gender_data)
joined_data.take(1000)
Its scala btw, python api should
Have a look here https://spark.apache.org/docs/latest/tuning.html
Thanks
Best Regards
On Mon, Jun 15, 2015 at 11:27 AM, Proust GZ Feng pf...@cn.ibm.com wrote:
Hi, Spark Experts
I have played with Spark several weeks, after some time testing, a reduce
operation of DataFrame cost 40s on a
I think it should be fine, that's the whole point of check-pointing (in
case of driver failure etc).
Thanks
Best Regards
On Mon, Jun 15, 2015 at 6:54 AM, Haopu Wang hw...@qilinsoft.com wrote:
Hi, can someone help to confirm the behavior? Thank you!
-Original Message-
From: Haopu
Yes, if you have enabled WAL and checkpointing then after the store, you
can simply delete the SQS Messages from your receiver.
Thanks
Best Regards
On Sat, Jun 13, 2015 at 6:14 AM, Michal Čizmazia mici...@gmail.com wrote:
I would like to have a Spark Streaming SQS Receiver which deletes SQS
This is a good start, if you haven't seen this already
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
Thanks
Best Regards
On Sat, Jun 13, 2015 at 8:46 AM, srinivasraghavansr71
sreenivas.raghav...@gmail.com wrote:
Hi everyone,
I am interest to
Are you looking for something like filter? See a similar example here
https://spark.apache.org/examples.html
Thanks
Best Regards
On Sat, Jun 13, 2015 at 3:11 PM, Hao Wang bill...@gmail.com wrote:
Hi,
I have a bunch of large log files on Hadoop. Each line contains a log and
its severity. Is
I think the straight answer would be No, but yes you can actually hardcode
these parameters if you want. Look in the SparkContext.scala
https://github.com/apache/spark/blob/master/core%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2FSparkContext.scala#L364
where all these properties are being
Looks like your spark is not able to pick up the HADOOP_CONF. To fix this,
you can actually add jets3t-0.9.0.jar to the classpath
(sc.addJar(/path/to/jets3t-0.9.0.jar).
Thanks
Best Regards
On Thu, Jun 11, 2015 at 6:44 PM, shahab shahab.mok...@gmail.com wrote:
Hi,
I tried to read a csv file
This is a good start, if you haven't read it already
http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations
Thanks
Best Regards
On Thu, Jun 11, 2015 at 8:17 PM, 唐思成 jadetan...@qq.com wrote:
Hi all:
We are trying to using spark to do some real
You can disable shuffle spill (spark.shuffle.spill
http://spark.apache.org/docs/latest/configuration.html#shuffle-behavior)
if you are having enough memory to hold that much data. I believe adding
more resources would be your only choice.
Thanks
Best Regards
On Thu, Jun 11, 2015 at 9:46 PM, Al M
You can verify if the jars are shipped properly by looking at the driver UI
(running on 4040) Environment tab.
Thanks
Best Regards
On Sat, Jun 13, 2015 at 12:43 AM, Jonathan Coveney jcove...@gmail.com
wrote:
Spark version is 1.3.0 (will upgrade as soon as we upgrade past mesos
0.19.0)...
How many cores are you allocating for your job? And how many receivers are
you having? It would be good if you can post your custom receiver code, it
will help people to understand it better and shed some light.
Thanks
Best Regards
On Fri, Jun 12, 2015 at 12:58 PM, Chaudhary, Umesh
4040 is your driver port, you need to run some application. Login to your
cluster start a spark-shell and try accessing 4040.
Thanks
Best Regards
On Wed, Jun 10, 2015 at 3:51 PM, mrm ma...@skimlinks.com wrote:
Hi,
I am using Spark 1.3.1 standalone and I have a problem where my cluster is
RDD's are immutable, why not join two DStreams?
Not sure, but you can try something like this also:
kvDstream.foreachRDD(rdd = {
val file = ssc.sparkContext.textFile(/sigmoid/)
val kvFile = file.map(x = (x.split(,)(0), x))
rdd.join(kvFile)
})
Thanks
Best Regards
On
Opening your 4040 manually or ssh tunneling (ssh -L 4040:127.0.0.1:4040
master-ip, and then open localhost:4040 in browser.) will work for you then
.
Thanks
Best Regards
On Wed, Jun 10, 2015 at 5:10 PM, mrm ma...@skimlinks.com wrote:
Hi Akhil,
Thanks for your reply! I still cannot see port
If you look at the maven repo, you can see its from typesafe only
http://mvnrepository.com/artifact/org.spark-project.akka/akka-actor_2.10/2.3.4-spark
For sbt, you can download the sources by adding withSources() like:
libraryDependencies += org.spark-project.akka % akka-actor_2.10 %
2.3.4-spark
May be you should update your spark version to the latest one.
Thanks
Best Regards
On Wed, Jun 10, 2015 at 11:04 AM, Chandrashekhar Kotekar
shekhar.kote...@gmail.com wrote:
Hi,
I have configured Spark to run on YARN. Whenever I start spark shell using
'spark-shell' command, it
Hope Swig http://www.swig.org/index.php and JNA
https://github.com/twall/jna/ might help for accessing c++ libraries from
Java.
Thanks
Best Regards
On Wed, Jun 10, 2015 at 11:50 AM, mahesht mahesh.s.tup...@gmail.com wrote:
There is C++ component which uses some model which we want to replace
standalone mode.
Any ideas?
Thanks
Dong Lei
*From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
*Sent:* Tuesday, June 9, 2015 4:46 PM
*To:* Dong Lei
*Cc:* user@spark.apache.org
*Subject:* Re: ClassNotDefException when using spark-submit with
multiple jars and files located
Delete the checkpoint directory, you might have modified your driver
program.
Thanks
Best Regards
On Wed, Jun 10, 2015 at 9:44 PM, Ashish Nigam ashnigamt...@gmail.com
wrote:
Hi,
If checkpoint data is already present in HDFS, driver fails to load as it
is performing lookup on previous
Looks like libphp version is 5.6 now, which version of spark are you using?
Thanks
Best Regards
On Thu, Jun 11, 2015 at 3:46 AM, barmaley o...@solver.com wrote:
Launching using spark-ec2 script results in:
Setting up ganglia
RSYNC'ing /etc/ganglia to slaves...
...
Shutting down GANGLIA
This might help
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_installing-kerb-spark-quickstart.html
Thanks
Best Regards
On Wed, Jun 10, 2015 at 6:49 PM, kazeborja kazebo...@gmail.com wrote:
Hello all.
I've been reading some old mails and
+ /pgs/sample/samplechr + i + .txt
/opt/data/shellcompare/data/user + j + /pgs/intermediateResult/result +
i + .txt 600)
pipeModify2.collect()
sc.stop()
}
}
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:Akhil Das ak
?
On Tue, Jun 9, 2015 at 1:07 PM, amit tewari amittewar...@gmail.com
wrote:
Thanks Akhil, as you suggested, I have to go keyBy(route) as need the
columns intact.
But wil keyBy() take accept multiple fields (eg x(0), x(1))?
Thanks
Amit
On Tue, Jun 9, 2015 at 12:26 PM, Akhil Das ak
downloaded the application jar but not
the other jars specified by “—jars”.
Or I misunderstand the usage of “--jars”, and the jars should be already
in every worker, driver will not download them?
Is there some useful docs?
Thanks
Dong Lei
*From:* Akhil Das [mailto:ak
Regards
On Tue, Jun 9, 2015 at 2:09 PM, luohui20...@sina.com wrote:
Only 1 minor GC, 0.07s.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:Akhil Das ak...@sigmoidanalytics.com
收件人:罗辉 luohui20...@sina.com
抄送人:user user@spark.apache.org
主题:Re: How
Try this way:
scalaval input1 = sc.textFile(/test7).map(line =
line.split(,).map(_.trim));
scalaval input2 = sc.textFile(/test8).map(line =
line.split(,).map(_.trim));
scalaval input11 = input1.map(x=(*(x(0) + x(1)*),x(2),x(3)))
scalaval input22 = input2.map(x=(*(x(0) + x(1)*),x(2),x(3)))
scala
like this?
myDStream.foreachRDD(rdd = rdd.saveAsTextFile(/sigmoid/, codec ))
Thanks
Best Regards
On Mon, Jun 8, 2015 at 8:06 PM, Bob Corsaro rcors...@gmail.com wrote:
It looks like saveAsTextFiles doesn't support the compression parameter of
RDD.saveAsTextFile. Is there a way to add the
May be you should check in your driver UI and see if there's any GC time
involved etc.
Thanks
Best Regards
On Mon, Jun 8, 2015 at 5:45 PM, luohui20...@sina.com wrote:
hi there
I am trying to descrease my app's running time in worker node. I
checked the log and found the most
Once you submits the application, you can check in the driver UI (running
on port 4040) Environment Tab to see whether those jars you added got
shipped or not. If they are shipped and still you are getting NoClassDef
exceptions then it means that you are having a jar conflict which you can
resolve
Can you look in your worker logs for more detailed stack-trace? If its
about winutils.exe you can look at these links to get it resolved.
- http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7
- https://issues.apache.org/jira/browse/SPARK-2356
Thanks
Best Regards
On Mon, Jun 8,
it just lets me straight in.
On Mon, Jun 8, 2015 at 11:58 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Can you do *ssh -v 192.168.1.16* from the Master machine and make sure
its able to login without password?
Thanks
Best Regards
On Mon, Jun 8, 2015 at 2:51 PM, James King jakwebin
Can you do *ssh -v 192.168.1.16* from the Master machine and make sure its
able to login without password?
Thanks
Best Regards
On Mon, Jun 8, 2015 at 2:51 PM, James King jakwebin...@gmail.com wrote:
I have two hosts 192.168.1.15 (Master) and 192.168.1.16 (Worker)
These two hosts have
Are you seeing the same behavior on the driver UI? (that running on port
4040), If you click on the stage id header you can sort the stages based on
IDs.
Thanks
Best Regards
On Fri, Jun 5, 2015 at 10:21 PM, Mike Hynes 91m...@gmail.com wrote:
Hi folks,
When I look at the output logs for an
It could be a CPU, IO, Network bottleneck, you need to figure out where
exactly its chocking. You can use certain monitoring utilities (like top)
to understand it better.
Thanks
Best Regards
On Sun, Jun 7, 2015 at 4:07 PM, SamyaMaiti samya.maiti2...@gmail.com
wrote:
Hi All,
I have a Spark
Another approach would be to use a zookeeper. If you have zookeeper
running somewhere in the cluster you can simply create a path like
*/dynamic-list* in it and then write objects/values to it, you can even
create/access nested objects.
Thanks
Best Regards
On Fri, Jun 5, 2015 at 7:06 PM,
Which consumer are you using? If you can paste the complete code then may
be i can try reproducing it.
Thanks
Best Regards
On Sun, Jun 7, 2015 at 1:53 AM, EH eas...@gmail.com wrote:
And here is the Thread Dump, where seems every worker is waiting for
Executor
#6 Thread 95:
You could try adding the configuration in the spark-defaults.conf file. And
once you run the application you can actually check on the driver UI (runs
on 4040) Environment tab to see if the configuration is set properly.
Thanks
Best Regards
On Thu, Jun 4, 2015 at 8:40 PM, Justin Steigel
you can simply do rdd.repartition(1).saveAsTextFile(...), it might not be
efficient if your output data is huge since one task will be doing the
whole writing.
Thanks
Best Regards
On Fri, Jun 5, 2015 at 3:46 PM, marcos rebelo ole...@gmail.com wrote:
Hi all
I'm running spark in a single local
Hi
Here's a working example: https://gist.github.com/akhld/b10dc491aad1a2007183
[image: Inline image 1]
Thanks
Best Regards
On Wed, Jun 3, 2015 at 10:09 PM, dgoldenberg dgoldenberg...@gmail.com
wrote:
Hi,
I've got a Spark Streaming driver job implemented and in it, I register a
streaming
Replace this line:
img_data = sc.parallelize( list(im.getdata()) )
With:
img_data = sc.parallelize( list(im.getdata()), 3 * No cores you have )
Thanks
Best Regards
On Thu, Jun 4, 2015 at 1:57 AM, Justin Spargur jmspar...@gmail.com wrote:
Hi all,
I'm playing around with
That's because you need to add the master's public key (~/.ssh/id_rsa.pub)
to the newly added slaves ~/.ssh/authorized_keys.
I add slaves this way:
- Launch a new instance by clicking on the slave instance and choose *launch
more like this *
*- *Once its launched, ssh into it and add the master
is, Is there an
alternate api though which a spark application can be launched which can
return a exit status back to the caller as opposed to initiating JVM halt.
On Wed, Jun 3, 2015 at 12:58 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Run it as a standalone application. Create an sbt project
You need to look into your executor/worker logs to see whats going on.
Thanks
Best Regards
On Wed, Jun 3, 2015 at 12:01 PM, patcharee patcharee.thong...@uni.no
wrote:
Hi,
What can be the cause of this ERROR cluster.YarnScheduler: Lost executor?
How can I fix it?
Best,
Patcharee
I think when you do a ssc.stop it will stop your entire application and by
update a transformation function you mean modifying the driver program?
In that case even if you checkpoint your application, it won't be able to
recover from its previous state.
A simpler approach would be to add certain
Run it as a standalone application. Create an sbt project and do sbt run?
Thanks
Best Regards
On Wed, Jun 3, 2015 at 11:36 AM, pavan kumar Kolamuri
pavan.kolam...@gmail.com wrote:
Hi guys , i am new to spark . I am using sparksubmit to submit spark jobs.
But for my use case i don't want it
)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
... 1 more
Best,
Patcharee
On 03. juni 2015 09:21, Akhil Das wrote:
You need to look into your executor/worker logs to see whats going on.
Thanks
Best Regards
On Wed, Jun 3, 2015 at 12:01 PM, patcharee
If you want to submit applications to a remote cluster where your port 7077
is opened publically, then you would need to set the *spark.driver.host *(with
the public ip of your laptop) and *spark.driver.port* (optional, if there's
no firewall between your laptop and the remote cluster). Keeping
You could try rdd.persist(MEMORY_AND_DISK/DISK_ONLY).flatMap(...), I think
StorageLevel MEMORY_AND_DISK means spark will try to keep the data in
memory and if there isn't sufficient space then it will be shipped to the
disk.
Thanks
Best Regards
On Mon, Jun 1, 2015 at 11:02 PM, octavian.ganea
It says your namenode is down (connection refused on 8020), you can restart
your HDFS by going into hadoop directory and typing sbin/stop-dfs.sh and
then sbin/start-dfs.sh
Thanks
Best Regards
On Tue, Jun 2, 2015 at 5:03 AM, Su She suhsheka...@gmail.com wrote:
Hello All,
A bit scared I did
I found an interesting presentation
http://www.slideshare.net/colorant/spark-shuffle-introduction and go
through this thread also
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html
Thanks
Best Regards
On Tue, Jun 2, 2015 at 3:06 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)
You can try to skip the tests, try with:
mvn -Dhadoop.version=2.4.0 -Pyarn *-DskipTests* clean package
Thanks
Best Regards
On Tue, Jun 2, 2015 at 2:51 AM, Stephen Boesch java...@gmail.com wrote:
I downloaded the 1.3.1 distro tarball
$ll ../spark-1.3.1.tar.gz
-rw-r-@ 1 steve staff
You can run/submit your code from one of the worker which has access to the
file system and it should be fine i think. Give it a try.
Thanks
Best Regards
On Tue, Jun 2, 2015 at 3:22 PM, Pradyumna Achar pradyumna.ac...@gmail.com
wrote:
Hello!
I have Spark running in standalone mode, and there
Basically, you need to convert it to a serializable format before doing the
collect/take.
You can fire up a spark shell and paste this:
val sFile = sc.sequenceFile[LongWritable, Text](/home/akhld/sequence
/sigmoid)
*.map(_._2.toString)*
sFile.take(5).foreach(println)
Use the
This thread
http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application
has various methods on accessing S3 from spark, it might help you.
Thanks
Best Regards
On Sun, May 24, 2015 at 8:03 AM, ogoh oke...@gmail.com wrote:
Hello,
I am
May be you can make use of the Window operations
https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#window-operations,
Also another approach would be to keep your incoming data in
Hbase/Redis/Cassandra kind of database and then whenever you need to
average it, you just query the
Here's a more detailed documentation
https://github.com/datastax/spark-cassandra-connector from Datastax, You
can also shoot an email directly to their mailing list
http://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user
since its more related to their code.
Thanks
Best
I do this way:
- Launch a new instance by clicking on the slave instance and choose *launch
more like this *
*- *Once its launched, ssh into it and add the master public key to
.ssh/authorized_keys
- Add the slaves internal IP to the master's conf/slaves file
- do sbin/start-all.sh and it will
MEMORY_AND_DISK_SER_2. Besides, the driver's memory is also growing. I
don't think Kafka messages will be cached in driver.
On Thu, May 28, 2015 at 12:24 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Are you using the createStream or createDirectStream api? If its the
former, you can try setting
aggressive? Let's say I have 20
slaves up, and I want to add one more, why should we stop the entire
cluster for this?
thanks, nizan
On Thu, May 28, 2015 at 10:19 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
I do this way:
- Launch a new instance by clicking on the slave instance
You can use python boto library for that, in fact spark-ec2 script uses it
underneath. Here's the
https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L706 call
spark-ec2 is making to get all machines under a given security group.
Thanks
Best Regards
On Thu, May 28, 2015 at 2:22 PM,
=
rdd.foreachPartition { iter =
iter.foreach(count = logger.info(count.toString))
}
}
It receives messages from Kafka, parse the json, filter and count the
records, and then print it to logs.
Thanks.
On Thu, May 28, 2015 at 3:07 PM, Akhil Das ak
After submitting the job, if you do a ps aux | grep spark-submit then you
can see all JVM params. Are you using the highlevel consumer (receiver
based) for receiving data from Kafka? In that case if your throughput is
high and the processing delay exceeds batch interval then you will hit this
How about creating two and union [ sc.union(first, second) ] them?
Thanks
Best Regards
On Wed, May 27, 2015 at 11:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I have this piece
sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable,
AvroKeyInputFormat[GenericRecord]](
files like result1.txt,result2.txt...result21.txt.
Sorry for not adding some comments for my code.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:Akhil Das ak...@sigmoidanalytics.com
收件人:罗辉 luohui20...@sina.com
抄送人:user user@spark.apache.org
主题:Re: Re
Try these:
- Disable shuffle : spark.shuffle.spill=false (It might end up in OOM)
- Enable log rotation:
sparkConf.set(spark.executor.logs.rolling.strategy, size)
.set(spark.executor.logs.rolling.size.maxBytes, 1024)
.set(spark.executor.logs.rolling.maxRetainedFiles, 3)
You can also look into
;Best regards!
San.Luo
- 原始邮件 -
发件人:madhu phatak phatak@gmail.com
收件人:luohui20...@sina.com
抄送人:Akhil Das ak...@sigmoidanalytics.com, user user@spark.apache.org
主题:Re: Re: how to distributed run a bash shell in spark
日期:2015年05月25日 14点11分
Hi,
You can use pipe operator, if you
Try this way:
object Holder extends Serializable { @transient lazy val log =
Logger.getLogger(getClass.getName)}
val someRdd = spark.parallelize(List(1, 2, 3))
someRdd.map {
element =
Holder.*log.info http://log.info/(s$element will be processed)*
element + 1
If you want to notify after every batch is completed, then you can simply
implement the StreamingListener
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener
interface, which has methods like onBatchCompleted, onBatchStarted etc in
which
Hi Kevin,
Did you try adding a host name for the ipv6? I have a few ipv6 boxes, spark
failed for me when i use just the ipv6 addresses, but it works fine when i
use the host names.
Here's an entry in my /etc/hosts:
2607:5300:0100:0200::::0a4d hacked.work
My spark-env.sh file:
You mean you want to execute some shell commands from spark? Here's
something i tried a while back. https://github.com/akhld/spark-exploit
Thanks
Best Regards
On Sun, May 24, 2015 at 4:53 PM, luohui20...@sina.com wrote:
hello there
I am trying to run a app in which part of it needs to
I used to hit a NPE when i don't add all the dependency jars to my context
while running it in standalone mode. Can you try adding all these
dependencies to your context?
sc.addJar(/home/akhld/.ivy2/cache/org.apache.spark/spark-streaming-kafka_2.10/jars/spark-streaming-kafka_2.10-1.3.1.jar)
You can look at the logic for offloading data from Memory by looking at
ensureFreeSpace
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L416
call.
And dropFromMemory
without problem.
Best Regards,
Allan
On 21 May 2015 at 01:30, Akhil Das ak...@sigmoidanalytics.com wrote:
This is more like an issue with your HDFS setup, can you check in the
datanode logs? Also try putting a new file in HDFS and see if that works.
Thanks
Best Regards
On Wed, May 20
How many part files are you having? Did you try re-partitioning to a
smaller number so that you will have bigger files of smaller number.
Thanks
Best Regards
On Wed, May 20, 2015 at 3:06 AM, Richard Grossman richie...@gmail.com
wrote:
Hi
I'm using spark 1.3.1 and now I can't set the size of
textFile does reads all files in a directory.
We have modified the sparkstreaming code base to read nested files from S3,
you can check this function
This thread happened a year back, can you please share what issue you are
facing? which version of spark you are using? What is your system
environment? Exception stack-trace?
Thanks
Best Regards
On Thu, May 21, 2015 at 12:19 PM, Keerthi keerthi.reddy1...@gmail.com
wrote:
Hi ,
I had tried
Yes Peter that's correct, you need to identify the processes and with that
you can pull the actual usage metrics.
Thanks
Best Regards
On Thu, May 21, 2015 at 2:52 PM, Peter Prettenhofer
peter.prettenho...@gmail.com wrote:
Thanks Akhil, Ryan!
@Akhil: YARN can only tell me how much vcores my
501 - 600 of 1386 matches
Mail list logo