Spark Streaming: Calculate PV/UV by Minute and by Day?

2014-09-20 Thread Ji ZHANG
Hi, I'm using Spark Streaming 1.0. Say I have a source of website click stream, like the following: ('2014-09-19 00:00:00', '192.168.1.1', 'home_page') ('2014-09-19 00:00:01', '192.168.1.2', 'list_page') ... And I want to calculate the page views (PV, number of logs) and unique user (UV,

Re: Reproducing the function of a Hadoop Reducer

2014-09-20 Thread Victor Tso-Guillen
So sorry about teasing you with the Scala. But the method is there in Java too, I just checked. On Fri, Sep 19, 2014 at 2:02 PM, Victor Tso-Guillen v...@paxata.com wrote: It might not be the same as a real hadoop reducer, but I think it would accomplish the same. Take a look at: import

How to Exclude Spark Dependencies from spark-streaming-kafka?

2014-09-20 Thread Ji ZHANG
Hi, I'm developing an application with spark-streaming-kafka, which depends on spark-streaming and kafka. Since spark-streaming is provided in runtime, I want to exclude the jars from the assembly. I tried the following configuration: libraryDependencies ++= { val sparkVersion = 1.0.2 Seq(

Re: spark-submit command-line with --files

2014-09-20 Thread chinchu
Thanks Andrew. that helps On Fri, Sep 19, 2014 at 5:47 PM, Andrew Or-2 [via Apache Spark User List] ml-node+s1001560n14708...@n3.nabble.com wrote: Hey just a minor clarification, you _can_ use SparkFiles.get in your application only if it runs on the executors, e.g. in the following way:

Fails to run simple Spark (Hello World) scala program

2014-09-20 Thread Moshe Beeri
object Nizoz { def connect(): Unit = { val conf = new SparkConf().setAppName(nizoz).setMaster(master); val spark = new SparkContext(conf) val lines = spark.textFile(file:///home/moshe/store/frameworks/spark-1.1.0-bin-hadoop1/README.md) val lineLengths = lines.map(s = s.length)

Re: Fails to run simple Spark (Hello World) scala program

2014-09-20 Thread Manu Suryavansh
Hi Moshe, Spark needs a Hadoop 2.x/YARN cluster. Other wise you can run it without hadoop in the stand alone mode. Manu On Sat, Sep 20, 2014 at 12:55 AM, Moshe Beeri moshe.be...@gmail.com wrote: object Nizoz { def connect(): Unit = { val conf = new

Re: spark-submit command-line with --files

2014-09-20 Thread chinchu
Thanks Andrew. I understand the problem a little better now. There was a typo in my earlier mail a bug in the code (causing the NPE in SparkFiles). I am using the --master yarn-cluster (not local). And in this mode, the com.test.batch.modeltrainer.ModelTrainerMain - my main-class will run on the

Re: Fails to run simple Spark (Hello World) scala program

2014-09-20 Thread Moshe Beeri
Thank Manu, I just saw I have included hadoop client 2.x in my pom.xml, removing it solved the problem. Thanks for you help -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fails-to-run-simple-Spark-Hello-World-scala-program-tp14718p14721.html Sent from

Re: Example of Geoprocessing with Spark

2014-09-20 Thread andy petrella
It's probably slw as you say because it's actually also doing the map phase that will do the RTree search and so on, and only then saving to hdfs on 60 partition. If you want to see the time spent in saving to hdfs, you could do a count for instance before saving. Also saving from 60 partition

Re: Fails to run simple Spark (Hello World) scala program

2014-09-20 Thread Moshe Beeri
Hi Nanu/All Now I interfacing an other strange (relatively to new complex framework) error. I run ./sbin/start-all.sh (my computer name after John nash) and got the connection Connecting to master spark://nash:7077 running on my local machine yields java.lang.ClassNotFoundException:

secondary sort

2014-09-20 Thread Koert Kuipers
now that spark has a sort based shuffle, can we expect a secondary sort soon? there are some use cases where getting a sorted iterator of values per key is helpful.

Re: Problem with giving memory to executors on YARN

2014-09-20 Thread Sandy Ryza
I'm actually surprised your memory is that high. Spark only allocates spark.storage.memoryFraction for storing RDDs. This defaults to .6, so 32 GB * .6 * 10 executors should be a total of 192 GB. -Sandy On Sat, Sep 20, 2014 at 8:21 AM, Soumya Simanta soumya.sima...@gmail.com wrote: There 128

Re: New API for TFIDF generation in Spark 1.1.0

2014-09-20 Thread jatinpreet
Thanks Xangrui and RJ for the responses. RJ, I have created a Jira for the same. It would be great if you could look into this. Following is the link to the improvement task, https://issues.apache.org/jira/browse/SPARK-3614 Let me know if I can be of any help and please keep me posted! Thanks,

org.eclipse.jetty.orbit#javax.transaction;working@localhost: not found

2014-09-20 Thread jinilover
I downloaded the spark-workshop in scala from https://github.com/deanwampler/spark-workshop. When I type sbt and then compile, I got the following errors [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn]

SparkSQL Thriftserver in Mesos

2014-09-20 Thread John Omernik
I am running the Thrift server in SparkSQL, and running it on the node I compiled spark on. When I run it, tasks only work if they landed on that node, other executors started on nodes I didn't compile spark on (and thus don't have the compile directory) fail. Should spark be distributed

Re: Reproducing the function of a Hadoop Reducer

2014-09-20 Thread Steve Lewis
OK so in Java - pardon the verbosity I might say something like the code below but I face the following issues 1) I need to store all values in memory as I run combineByKey - it I could return an RDD which consumed values that would be great but I don't know how to do that - 2) In my version of

Setting serializer to KryoSerializer from command line for spark-shell

2014-09-20 Thread Soumya Simanta
Hi, I want to set the serializer for my spark-shell to Kyro. spark.serializer to org.apache.spark.serializer.KryoSerializer Can I do it without setting a new SparkConf? Thanks -Soumya

Distributed dictionary building

2014-09-20 Thread Debasish Das
Hi, I am building a dictionary of RDD[(String, Long)] and after the dictionary is built and cached, I find key almonds at value 5187 using: rdd.filter{case(product, index) = product == almonds}.collect Output: Debug product almonds index 5187 Now I take the same dictionary and write it out as:

Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
I did not persist / cache it as I assumed zipWithIndex will preserve order... There is also zipWithUniqueId...I am trying that...If that also shows the same issue, we should make it clear in the docs... On Sat, Sep 20, 2014 at 1:44 PM, Sean Owen so...@cloudera.com wrote: From offline question

Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
I changed zipWithIndex to zipWithUniqueId and that seems to be working... What's the difference between zipWithIndex vs zipWithUniqueId ? For zipWithIndex we don't need to run the count to compute the offset which is needed for zipWithUniqueId and so zipWithIndex is efficient ? It's not very

Spark streaming twitter exception

2014-09-20 Thread Maisnam Ns
HI , Can somebody help me with adding library dependencies in my build.sbt so that the java.lang.NoClassDefFoundError issue can be resolved. My sbt (only the dependencies part) - libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.0.1 , org.apache.spark %% spark-streaming %

Re: spark-submit command-line with --files

2014-09-20 Thread Marcelo Vanzin
Hi chinchu, Where does the code trying to read the file run? Is it running on the driver or on some executor? If it's running on the driver, in yarn-cluster mode, the file should have been copied to the application's work directory before the driver is started. So hopefully just doing new

Re: Example of Geoprocessing with Spark

2014-09-20 Thread Abel Coronado Iruegas
Thanks, Evan and Andy: Here a very functional version, i need to improve the syntax, but this works very well, the initial version takes around 36 hours in a 9 machines with 8 cores, and this version takes 36 minutes in a cluster with 7 machines with 8 cores : object SimpleApp { def

Re: spark-submit command-line with --files

2014-09-20 Thread chinchu
Thanks Marcelo. The code trying to read the file always runs in the driver. I understand the problem with other master-deployment but will it work in local, yarn-client yarn-cluster deployments.. that's all I care for now :-) Also what is the suggested way to do something like this ? Put the

Re: Reproducing the function of a Hadoop Reducer

2014-09-20 Thread Victor Tso-Guillen
1. Actually, I disagree that combineByKey requires that all values be held in memory for a key. Only the use case groupByKey does that, whereas reduceByKey, foldByKey, and the generic combineByKey do not necessarily make that requirement. If your combine logic really shrinks the result

Could you please add us to 'Powered by Spark' List

2014-09-20 Thread US Office Admin
Organization Name: Vectorum Inc. URL: http://www.vectorum.comhttp://www.vectorum.com/ List of Spark Components: Tachyon, Spark 1.1, Spark SQL, MLib (In works) w= ith Hadoop and Play Framework. Working on digital finger print search. Use Case: Using Machine data to predict machine failures. We