Hi,
I'm using Spark Streaming 1.0.
Say I have a source of website click stream, like the following:
('2014-09-19 00:00:00', '192.168.1.1', 'home_page')
('2014-09-19 00:00:01', '192.168.1.2', 'list_page')
...
And I want to calculate the page views (PV, number of logs) and unique
user (UV,
So sorry about teasing you with the Scala. But the method is there in Java
too, I just checked.
On Fri, Sep 19, 2014 at 2:02 PM, Victor Tso-Guillen v...@paxata.com wrote:
It might not be the same as a real hadoop reducer, but I think it would
accomplish the same. Take a look at:
import
Hi,
I'm developing an application with spark-streaming-kafka, which
depends on spark-streaming and kafka. Since spark-streaming is
provided in runtime, I want to exclude the jars from the assembly. I
tried the following configuration:
libraryDependencies ++= {
val sparkVersion = 1.0.2
Seq(
Thanks Andrew. that helps
On Fri, Sep 19, 2014 at 5:47 PM, Andrew Or-2 [via Apache Spark User List]
ml-node+s1001560n14708...@n3.nabble.com wrote:
Hey just a minor clarification, you _can_ use SparkFiles.get in your
application only if it runs on the executors, e.g. in the following way:
object Nizoz {
def connect(): Unit = {
val conf = new SparkConf().setAppName(nizoz).setMaster(master);
val spark = new SparkContext(conf)
val lines =
spark.textFile(file:///home/moshe/store/frameworks/spark-1.1.0-bin-hadoop1/README.md)
val lineLengths = lines.map(s = s.length)
Hi Moshe,
Spark needs a Hadoop 2.x/YARN cluster. Other wise you can run it without
hadoop in the stand alone mode.
Manu
On Sat, Sep 20, 2014 at 12:55 AM, Moshe Beeri moshe.be...@gmail.com wrote:
object Nizoz {
def connect(): Unit = {
val conf = new
Thanks Andrew.
I understand the problem a little better now. There was a typo in my earlier
mail a bug in the code (causing the NPE in SparkFiles). I am using the
--master yarn-cluster (not local). And in this mode, the
com.test.batch.modeltrainer.ModelTrainerMain - my main-class will run on the
Thank Manu,
I just saw I have included hadoop client 2.x in my pom.xml, removing it
solved the problem.
Thanks for you help
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Fails-to-run-simple-Spark-Hello-World-scala-program-tp14718p14721.html
Sent from
It's probably slw as you say because it's actually also doing the map
phase that will do the RTree search and so on, and only then saving to hdfs
on 60 partition. If you want to see the time spent in saving to hdfs, you
could do a count for instance before saving. Also saving from 60 partition
Hi Nanu/All
Now I interfacing an other strange (relatively to new complex framework)
error.
I run ./sbin/start-all.sh (my computer name after John nash) and got the
connection Connecting to master spark://nash:7077
running on my local machine yields
java.lang.ClassNotFoundException:
now that spark has a sort based shuffle, can we expect a secondary sort
soon? there are some use cases where getting a sorted iterator of values
per key is helpful.
I'm actually surprised your memory is that high. Spark only allocates
spark.storage.memoryFraction for storing RDDs. This defaults to .6, so 32
GB * .6 * 10 executors should be a total of 192 GB.
-Sandy
On Sat, Sep 20, 2014 at 8:21 AM, Soumya Simanta soumya.sima...@gmail.com
wrote:
There 128
Thanks Xangrui and RJ for the responses.
RJ, I have created a Jira for the same. It would be great if you could look
into this. Following is the link to the improvement task,
https://issues.apache.org/jira/browse/SPARK-3614
Let me know if I can be of any help and please keep me posted!
Thanks,
I downloaded the spark-workshop in scala from
https://github.com/deanwampler/spark-workshop. When I type sbt and then
compile, I got the following errors
[warn] ::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn]
I am running the Thrift server in SparkSQL, and running it on the node I
compiled spark on. When I run it, tasks only work if they landed on that
node, other executors started on nodes I didn't compile spark on (and thus
don't have the compile directory) fail. Should spark be distributed
OK so in Java - pardon the verbosity I might say something like the code
below
but I face the following issues
1) I need to store all values in memory as I run combineByKey - it I could
return an RDD which consumed values that would be great but I don't know
how to do that -
2) In my version of
Hi,
I want to set the serializer for my spark-shell to Kyro.
spark.serializer to org.apache.spark.serializer.KryoSerializer
Can I do it without setting a new SparkConf?
Thanks
-Soumya
Hi,
I am building a dictionary of RDD[(String, Long)] and after the dictionary
is built and cached, I find key almonds at value 5187 using:
rdd.filter{case(product, index) = product == almonds}.collect
Output:
Debug product almonds index 5187
Now I take the same dictionary and write it out as:
I did not persist / cache it as I assumed zipWithIndex will preserve
order...
There is also zipWithUniqueId...I am trying that...If that also shows the
same issue, we should make it clear in the docs...
On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen so...@cloudera.com wrote:
From offline question
I changed zipWithIndex to zipWithUniqueId and that seems to be working...
What's the difference between zipWithIndex vs zipWithUniqueId ?
For zipWithIndex we don't need to run the count to compute the offset which
is needed for zipWithUniqueId and so zipWithIndex is efficient ? It's not
very
HI ,
Can somebody help me with adding library dependencies in my build.sbt so
that the java.lang.NoClassDefFoundError issue can be resolved.
My sbt (only the dependencies part) -
libraryDependencies ++= Seq(
org.apache.spark %% spark-core % 1.0.1 ,
org.apache.spark %% spark-streaming %
Hi chinchu,
Where does the code trying to read the file run? Is it running on the
driver or on some executor?
If it's running on the driver, in yarn-cluster mode, the file should
have been copied to the application's work directory before the driver
is started. So hopefully just doing new
Thanks, Evan and Andy:
Here a very functional version, i need to improve the syntax, but this works
very well, the initial version takes around 36 hours in a 9 machines with 8
cores, and this version takes 36 minutes in a cluster with 7 machines with 8
cores :
object SimpleApp {
def
Thanks Marcelo. The code trying to read the file always runs in the driver. I
understand the problem with other master-deployment but will it work in
local, yarn-client yarn-cluster deployments.. that's all I care for now
:-)
Also what is the suggested way to do something like this ? Put the
1. Actually, I disagree that combineByKey requires that all values be
held in memory for a key. Only the use case groupByKey does that, whereas
reduceByKey, foldByKey, and the generic combineByKey do not necessarily
make that requirement. If your combine logic really shrinks the result
Organization Name: Vectorum Inc.
URL: http://www.vectorum.comhttp://www.vectorum.com/
List of Spark Components: Tachyon, Spark 1.1, Spark SQL, MLib (In works) w=
ith Hadoop and Play Framework. Working on digital finger print search.
Use Case: Using Machine data to predict machine failures. We
26 matches
Mail list logo