Spark Streaming from Kafka

2014-10-29 Thread Harold Nguyen
Hi, Just wondering if you've seen the following error when reading from Kafka: ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - java.lang.NoClassDefFoundError: scala/reflect/ClassManifest at kafka.utils.Log4jController$.init(Log4jController.scala:29) at

How to retrive spark context when hiveContext is used in sparkstreaming

2014-10-29 Thread critikaled
Hi, I'm trying to get hold of use spark context from hive context or streamingcontext. I have 2 pieces of codes one in core spark one in spark streaming. plain spark with hive which gives me context. Spark streaming code with hive which prints null. plz help me figure out how to make this code

Re: FileNotFoundException in appcache shuffle files

2014-10-29 Thread Shaocun Tian
Hi, Ryan We have met similar errors and increasing executor memory solved it. Though I am not sure about the detailed reason, it might be worth a try. On Wed, Oct 29, 2014 at 1:34 PM, Ryan Williams [via Apache Spark User List] ml-node+s1001560n17605...@n3.nabble.com wrote: My job is failing

Fwd: sampling in spark

2014-10-29 Thread Chengi Liu
-- Forwarded message -- From: Chengi Liu chengi.liu...@gmail.com Date: Tue, Oct 28, 2014 at 11:23 PM Subject: Re: sampling in spark To: Davies Liu dav...@databricks.com Any suggestions.. Thanks On Tue, Oct 28, 2014 at 12:53 AM, Chengi Liu chengi.liu...@gmail.com wrote: Is

Re: Submiting Spark application through code

2014-10-29 Thread Akhil Das
​And the scala way of doing it would be: val sc = new SparkContext(conf) sc.addJar(/full/path/to/my/application/jar/myapp.jar) On Wed, Oct 29, 2014 at 1:44 AM, Shailesh Birari sbir...@wynyardgroup.com wrote: Yes, this is doable. I am submitting the Spark job using JavaSparkContext

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-29 Thread Prashant Sharma
Yes we shade akka to change its protobuf version (If I am not wrong.). Yes, binary compatibility with other akka modules is compromised. One thing you can try is use akka from org.spark-project.akka, I have not tried this and not sure if its going to help you but may be you could exclude the akka

Re: install sbt

2014-10-29 Thread Akhil Das
1. Download https://dl.bintray.com/sbt/native-packages/sbt/0.13.6/sbt-0.13.6.zip 2. Extract 3. export PATH=$PATH:/path/to/sbt/bin/sbt Now you can do all the sbt commands (sbt package etc.) Thanks Best Regards On Tue, Oct 28, 2014 at 9:49 PM, Soumya Simanta soumya.sima...@gmail.com wrote: sbt

Re: Saving to Cassandra from Spark Streaming

2014-10-29 Thread Akhil Das
You need to set the following jar (cassandra-connector http://central.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.10/1.1.0-alpha3/spark-cassandra-connector_2.10-1.1.0-alpha3.jar) in the classpath like:

Re: Submitting Spark job on Unix cluster from dev environment (Windows)

2014-10-29 Thread Akhil Das
What are you trying to do? Connecting to a remote cluster from your local windows eclipse environment? Just make sure you meet the following: 1. Set spark.driver.host to your local ip (Where you runs your code, and it should be accessible from the cluster) 2. Make sure no firewall/router

Re: unsubscribe

2014-10-29 Thread Akhil Das
Send it to user-unsubscr...@spark.apache.org. Read the community page https://spark.apache.org/community.html for more info Thanks Best Regards On Wed, Oct 29, 2014 at 3:32 AM, Ricky Thomas ri...@truedash.io wrote:

Re: Batch of updates

2014-10-29 Thread Sean Owen
I don't think accumulators come into play here. Use foreachPartition, not mapPartitions. On Wed, Oct 29, 2014 at 12:43 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Sorry but I wasn't able to code my stuff using accumulators as you suggested :( In my use case I have to to add elements to

Re: Spark Streaming from Kafka

2014-10-29 Thread Akhil Das
Looks like the kafka jar that you are using isn't compatible with your scala version. Thanks Best Regards On Wed, Oct 29, 2014 at 11:50 AM, Harold Nguyen har...@nexgate.com wrote: Hi, Just wondering if you've seen the following error when reading from Kafka: ERROR ReceiverTracker:

Re: Use RDD like a Iterator

2014-10-29 Thread Sean Owen
Call RDD.toLocalIterator()? https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html On Wed, Oct 29, 2014 at 4:15 AM, Dai, Kevin yun...@ebay.com wrote: Hi, ALL I have a RDD[T], can I use it like a iterator. That means I can compute every element of this RDD lazily.

Re: Spark Streaming from Kafka

2014-10-29 Thread harold
Thanks! How do I find out which Kafka jar to use for scala 2.10.4? — Sent from Mailbox On Wed, Oct 29, 2014 at 12:26 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Looks like the kafka jar that you are using isn't compatible with your scala version. Thanks Best Regards On Wed, Oct 29,

Re: Spark Streaming from Kafka

2014-10-29 Thread harold
I using kafka_2.10-1.1.0.jar on spark 1.1.0 — Sent from Mailbox On Wed, Oct 29, 2014 at 12:31 AM, null har...@nexgate.com wrote: Thanks! How do I find out which Kafka jar to use for scala 2.10.4? — Sent from Mailbox On Wed, Oct 29, 2014 at 12:26 AM, Akhil Das ak...@sigmoidanalytics.com

Spark streaming and save to cassandra and elastic search

2014-10-29 Thread aarthi
Hi I ve written a spark streaming code which streams data from flume to kafka which is received by spark. I ve used update state by key and then for each rdd im saving them into cassandra and elsatic search(by calling 2 different methods). The above parts are working fine when streaming job is

Re: pySpark - convert log/txt files into sequenceFile

2014-10-29 Thread Csaba Ragany
Thank you Holden, it works! infile = sc.wholeTextFiles(sys.argv[1]) rdd = sc.parallelize(infile.collect()) rdd.saveAsSequenceFile(sys.argv[2]) Csaba 2014-10-28 17:56 GMT+01:00 Holden Karau hol...@pigscanfly.ca: Hi Csaba, It sounds like the API you are looking for is sc.wholeTextFiles :)

Re: Use RDD like a Iterator

2014-10-29 Thread Yanbo Liang
RDD.toLocalIterator() is the suitable solution. But I doubt whether it conform with the design principle of spark and RDD. All RDD transform is lazily computed until it end with some actions. 2014-10-29 15:28 GMT+08:00 Sean Owen so...@cloudera.com: Call RDD.toLocalIterator()?

sbt/sbt compile error [FATAL]

2014-10-29 Thread HansPeterS
Hi, I have cloned sparked as: git clone g...@github.com:apache/spark.git cd spark sbt/sbt compile Apparently http://repo.maven.apache.org/maven2 is no longer valid. See the error further below. Is this correct? And what should it be changed to? Everything seems to go smooth until :

Re: sbt/sbt compile error [FATAL]

2014-10-29 Thread Soumya Simanta
Are you trying to compile the master branch ? Can you try branch-1.1 ? On Wed, Oct 29, 2014 at 6:55 AM, HansPeterS hanspeter.sl...@gmail.com wrote: Hi, I have cloned sparked as: git clone g...@github.com:apache/spark.git cd spark sbt/sbt compile Apparently

Spark 1.1.0 on Hive 0.13.1

2014-10-29 Thread arthur.hk.c...@gmail.com
Hi, My Hive is 0.13.1, how to make Spark 1.1.0 run on Hive 0.13? Please advise. Or, any news about when will Spark 1.1.0 on Hive 0.1.3.1 be available? Regards Arthur - To unsubscribe, e-mail:

RE: FileNotFoundException in appcache shuffle files

2014-10-29 Thread Ganelin, Ilya
Hi Ryan - I've been fighting the exact same issue for well over a month now. I initially saw the issue in 1.02 but it persists in 1.1. Jerry - I believe you are correct that this happens during a pause on long-running jobs on a large data set. Are there any parameters that you suggest tuning

Re: Spark 1.1.0 on Hive 0.13.1

2014-10-29 Thread Cheng Lian
Spark 1.1.0 doesn't support Hive 0.13.1. We plan to support it in 1.2.0, and related PRs are already merged or being merged to the master branch. On 10/29/14 7:43 PM, arthur.hk.c...@gmail.com wrote: Hi, My Hive is 0.13.1, how to make Spark 1.1.0 run on Hive 0.13? Please advise. Or, any news

Re: Spark 1.1.0 on Hive 0.13.1

2014-10-29 Thread arthur.hk.c...@gmail.com
Hi, Thanks for your update. Any idea when will Spark 1.2 be GA? Regards Arthur On 29 Oct, 2014, at 8:22 pm, Cheng Lian lian.cs@gmail.com wrote: Spark 1.1.0 doesn't support Hive 0.13.1. We plan to support it in 1.2.0, and related PRs are already merged or being merged to the master

Re: Spark streaming and save to cassandra and elastic search

2014-10-29 Thread aarthi
Yes the data is getting processed. I printed the data and the rdd count. The point where data is getting saved is not invoked. I am using the same class and jar for submitting by both methods. Only difference is I am launching by tomcat and there directly by putty. -- View this message in

Streaming Question regarding lazy calculations

2014-10-29 Thread sivarani
Hi All I am using spark streaming with kafka streaming for 24/7 My Code is something like JavaDStreamString data = messages.map(new MapData()); JavaPairDStreamString, Iterablelt;String records = data.mapToPair(new dataPair()).groupByKey(100); records.print(); JavaPairDStreamString, Double

CANNOT FIND ADDRESS

2014-10-29 Thread akhandeshi
SparkApplication UI shows that one of the executor Cannot find Addresss Aggregated Metrics by Executor Executor ID Address Task Time Total Tasks Failed Tasks Succeeded Tasks Input Shuffle ReadShuffle

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-29 Thread Chester At Work
We used both spray and Akka. To avoid comparability issue, we used spark shaded akka version. It works for us. This is 1.1.0 branch, I have not tried with master branch Chester Sent from my iPad On Oct 28, 2014, at 11:48 PM, Prashant Sharma scrapco...@gmail.com wrote: Yes we shade akka to

Re: problem with start-slaves.sh

2014-10-29 Thread Yana Kadiyska
I see this when I start a worker and then try to start it again forgetting it's already running (I don't use start-slaves, I start the slaves individually with start-slave.sh). All this is telling you is that there is already a running process on that machine. You can see it if you do a ps

Spark Performance

2014-10-29 Thread akhandeshi
I am relatively new to spark processing. I am using Spark Java API to process data. I am having trouble processing a data set that I don't think is significantly large. It is joining a dataset that is around 3-4gb each (around 12 gb data). The workflow is: x=RDD1.KeyBy(x).partitionBy(new

Re: CANNOT FIND ADDRESS

2014-10-29 Thread Yana Kadiyska
CANNOT FIND ADDRESS occurs when your executor has crashed. I'll look further down where it shows each task and see if you see any tasks failed. Then you can examine the error log of that executor and see why it died. On Wed, Oct 29, 2014 at 9:35 AM, akhandeshi ami.khande...@gmail.com wrote:

Re: Java api overhead?

2014-10-29 Thread Koert Kuipers
since spark holds data structures on heap (and by default tries to work with all data in memory) and its written in Scala seeing lots of scala Tuple2 is not unexpected. how do these numbers relate to your data size? On Oct 27, 2014 2:26 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi, I wanted

Re: Spark 1.1.0 on Hive 0.13.1

2014-10-29 Thread Mark Hamstra
Sometime after Nov. 15: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage On Wed, Oct 29, 2014 at 5:28 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, Thanks for your update. Any idea when will Spark 1.2 be GA? Regards Arthur On 29 Oct, 2014, at 8:22 pm,

Re: CANNOT FIND ADDRESS

2014-10-29 Thread akhandeshi
Thanks...hmm It is seems to be a timeout issue perhaps?? Not sure what is causing it? or how to debug? I see following error message... 4/10/29 13:26:04 ERROR ContextCleaner: Error cleaning broadcast 9 akka.pattern.AskTimeoutException: Timed out at

XML Utilities for Apache Spark

2014-10-29 Thread Darin McBeath
I developed the spark-xml-utils library because we have a large amount of XML in big datasets and I felt this data could be better served by providing some helpful xml utilities. This includes the ability to filter documents based on an xpath/xquery expression, return specific nodes for an

Re: pySpark - convert log/txt files into sequenceFile

2014-10-29 Thread Davies Liu
Without the second line, it's will much faster: infile = sc.wholeTextFiles(sys.argv[1]) infile.saveAsSequenceFile(sys.argv[2]) On Wed, Oct 29, 2014 at 3:31 AM, Csaba Ragany rag...@gmail.com wrote: Thank you Holden, it works! infile = sc.wholeTextFiles(sys.argv[1]) rdd =

PySpark and Cassandra 2.1 Examples

2014-10-29 Thread Mike Sukmanowsky
Hey all, Just thought I'd share this with the list in case any one else would benefit. Currently working on a proper integration of PySpark and DataStax's new Cassandra-Spark connector, but that's on going. In the meanwhile, I've basically updated the cassandra_inputformat.py and

Re: PySpark and Cassandra 2.1 Examples

2014-10-29 Thread Helena Edelson
Nice! - Helena @helenaedelson On Oct 29, 2014, at 12:01 PM, Mike Sukmanowsky mike.sukmanow...@gmail.com wrote: Hey all, Just thought I'd share this with the list in case any one else would benefit. Currently working on a proper integration of PySpark and DataStax's new

Spark Streaming with Kinesis

2014-10-29 Thread Harold Nguyen
Hi all, I followed the guide here: http://spark.apache.org/docs/latest/streaming-kinesis-integration.html But got this error: Exception in thread main java.lang.NoClassDefFoundError: com/amazonaws/auth/AWSCredentialsProvider Would you happen to know what dependency or jar is needed ? Harold

Re: Unit Testing (JUnit) with Spark

2014-10-29 Thread touchdown
add these to your dependencies: io.netty % netty % 3.6.6.Final exclude(io.netty, netty-all) to the end of spark and hadoop dependencies reference: https://spark-project.atlassian.net/browse/SPARK-1138 I am using Spark 1.1 so the akka issue is already fixed -- View this message in context:

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread Gerard Maas
Hi TD, Thanks a lot for the comprehensive answer. I think this explanation deserves some place in the Spark Streaming tuning guide. -kr, Gerard. On Thu, Oct 23, 2014 at 11:41 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Hey Gerard, This is a very good question! *TL;DR: *The

Re: Spark Streaming with Kinesis

2014-10-29 Thread Harold Nguyen
Hi again, After getting through several dependencies, I finally got to this non-dependency type error: Exception in thread main java.lang.NoSuchMethodError:

Spark SQL and confused about number of partitions/tasks to do a simple join.

2014-10-29 Thread Darin McBeath
I have a SchemaRDD with 100 records in 1 partition.  We'll call this baseline. I have a SchemaRDD with 11 records in 1 partition.  We'll call this daily. After a fairly basic join of these two tables JavaSchemaRDD results = sqlContext.sql(SELECT id, action, daily.epoch, daily.version FROM

RE: how to retrieve the value of a column of type date/timestamp from a Spark SQL Row

2014-10-29 Thread Mohammed Guller
Thanks, guys. Michael Armbrust also suggested the same two approaches. I believe “getAs[Date]” is available only in 1.2 branch and I have Spark 1.1, so I am using row(i).asInstanceOf[Date], which works. Mohammed From: Shixiong Zhu [mailto:zsxw...@gmail.com] Sent: Tuesday, October 28, 2014

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread Tathagata Das
Good idea, will do for 1.2 release. On Oct 29, 2014 9:50 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi TD, Thanks a lot for the comprehensive answer. I think this explanation deserves some place in the Spark Streaming tuning guide. -kr, Gerard. On Thu, Oct 23, 2014 at 11:41 PM,

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread jay vyas
Hi tathagata. I actually had a few minor improvements to spark streaming in SPARK-4040. possibly i could weave this in w/ my pr ? On Wed, Oct 29, 2014 at 1:59 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Good idea, will do for 1.2 release. On Oct 29, 2014 9:50 AM, Gerard Maas

winutils

2014-10-29 Thread Ron Ayoub
Apparently Spark does require Hadoop even if you do not intend to use Hadoop. Is there a workaround for the below error I get when creating the SparkContext in Scala? I will note that I didn't have this problem yesterday when creating the Spark context in Java as part of the getting started

Re: winutils

2014-10-29 Thread Denny Lee
QQ - did you download the Spark 1.1 binaries that included the Hadoop one? Does this happen if you're using the Spark 1.1 binaries that do not include the Hadoop jars? On Wed, Oct 29, 2014 at 11:31 AM, Ron Ayoub ronalday...@live.com wrote: Apparently Spark does require Hadoop even if you do not

Re: CANNOT FIND ADDRESS

2014-10-29 Thread Akhil Das
Can you try setting the following while creating the sparkContext and see if the issue still exists? .set(spark.core.connection.ack.wait.timeout,900) .set(spark.akka.frameSize,50) .set(spark.akka.timeout,900) ​Looks like your executor is stuck on GC Pause.​ Thanks Best Regards On

Re: Spark SQL and confused about number of partitions/tasks to do a simple join.

2014-10-29 Thread Darin McBeath
ok. after reading some documentation, it would appear the issue is the default number of partitions for a join (200). After doing something like the following, I was able to change the value. From: Darin McBeath ddmcbe...@yahoo.com.INVALID To: User user@spark.apache.org Sent: Wednesday,

Re: Spark SQL and confused about number of partitions/tasks to do a simple join.

2014-10-29 Thread Darin McBeath
Sorry, hit the send key a bitt too early. Anyway, this is the code I set. sqlContext.sql(set spark.sql.shuffle.partitions=10); From: Darin McBeath ddmcbe...@yahoo.com To: Darin McBeath ddmcbe...@yahoo.com; User user@spark.apache.org Sent: Wednesday, October 29, 2014 2:47 PM Subject: Re:

RE: winutils

2014-10-29 Thread Ron Ayoub
Well. I got past this problem and the manner was in my own email. I did download the one with Hadoop since it was among the only ones you don't have to compile from source along with CDH and Map. It worked yesterday because I added 1.1.0 as a maven dependency from the repository. I just did the

RE: winutils

2014-10-29 Thread Ron Ayoub
Well. I got past this problem and the manner was in my own email. I did download the one with Hadoop since it was among the only ones you don't have to compile from source along with CDH and Map. It worked yesterday because I added 1.1.0 as a maven dependency from the repository. I just did the

Questions about serialization and SparkConf

2014-10-29 Thread Steve Lewis
Assume in my executor I say SparkConf sparkConf = new SparkConf(); sparkConf.set(spark.kryo.registrator, com.lordjoe.distributed.hydra.HydraKryoSerializer); sparkConf.set(mysparc.data, Some user Data); sparkConf.setAppName(Some App); Now 1) Are there default

Re: Spark Streaming with Kinesis

2014-10-29 Thread Matt Chu
I haven't tried this myself yet, but this sounds relevant: https://github.com/apache/spark/pull/2535 Will be giving this a try today or so, will report back. On Wednesday, October 29, 2014, Harold Nguyen har...@nexgate.com wrote: Hi again, After getting through several dependencies, I

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-29 Thread Michael Armbrust
We are working on more helpful error messages, but in the meantime let me explain how to read this output. org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: 'p.name,'p.age, tree: Project ['p.name,'p.age] Filter ('location.number = 2300) Join Inner,

RE: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-29 Thread Mohammed Guller
I am not sure about that. Can you try a Spray version built with 2.2.x along with Spark 1.1 and include the Akka dependencies in your project’s sbt file? Mohammed From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Tuesday, October 28, 2014 8:58 PM To: Mohammed Guller Cc: user Subject:

Spark SQL - how to query dates stored as millis?

2014-10-29 Thread bkarels
I have been searching and have not found a solution as to how one might query on dates stored as UTC milliseconds from the epoch. The schema I have pulled in from a NoSQL datasource (JSON from MongoDB) has the target date as: |-- dateCreated: struct (nullable = true) ||-- $date: long

Re: winutils

2014-10-29 Thread Sean Owen
cf. https://issues.apache.org/jira/browse/SPARK-2356 On Wed, Oct 29, 2014 at 7:31 PM, Ron Ayoub ronalday...@live.com wrote: Apparently Spark does require Hadoop even if you do not intend to use Hadoop. Is there a workaround for the below error I get when creating the SparkContext in Scala? I

Convert DStream to String

2014-10-29 Thread Harold Nguyen
Hi all, How do I convert a DStream to a string ? For instance, I want to be able to: val myword = words.filter(word = word.startsWith(blah)) And use myword in other places, like tacking it onto (key, value) pairs, like so: val pairs = words.map(word = (myword+_+word, 1)) Thanks for any help,

what does DStream.union() do?

2014-10-29 Thread spr
The documentation at https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.streaming.dstream.DStream describes the union() method as Return a new DStream by unifying data of another DStream with this DStream. Can somebody provide a clear definition of what unifying means in

BUG: when running as extends App, closures don't capture variables

2014-10-29 Thread Michael Albert
Greetings! This might be a documentation issue as opposed to a coding issue, in that perhaps the correct answer is don't do that, but as this is not obvious, I am writing. The following code produces output most would not expect: package misc import org.apache.spark.SparkConfimport

Re: what does DStream.union() do?

2014-10-29 Thread Holden Karau
The union function simply returns a DStream with the elements from both. This is the same behavior as when we call union on RDDs :) (You can think of union as similar to the union operator on sets except without the unique element restrictions). On Wed, Oct 29, 2014 at 3:15 PM, spr

how to extract/combine elements of an Array in DStream element?

2014-10-29 Thread spr
I am processing a log file, from each line of which I want to extract the zeroth and 4th elements (and an integer 1 for counting) into a tuple. I had hoped to be able to index the Array for elements 0 and 4, but Arrays appear not to support vector indexing. I'm not finding a way to extract and

Re: how to extract/combine elements of an Array in DStream element?

2014-10-29 Thread Holden Karau
On Wed, Oct 29, 2014 at 3:29 PM, spr s...@yarcdata.com wrote: I am processing a log file, from each line of which I want to extract the zeroth and 4th elements (and an integer 1 for counting) into a tuple. I had hoped to be able to index the Array for elements 0 and 4, but Arrays appear not

Spark related meet up on Nov 6th in SF

2014-10-29 Thread Alexis Roos
Hi all, We’re organizing a meet up on Nov 6th in our office downtown SF that might be of interest to the Spark community. We will be discussing our experience building our first production Spark based application. More details and sign up info here:

Re: Convert DStream to String

2014-10-29 Thread Sean Owen
What would it mean to make a DStream into a String? it's inherently a sequence of things over time, each of which might be a string but which are usually RDDs of things. On Wed, Oct 29, 2014 at 11:15 PM, Harold Nguyen har...@nexgate.com wrote: Hi all, How do I convert a DStream to a string ?

Re: what does DStream.union() do?

2014-10-29 Thread spr
I need more precision to understand. If the elements of one DStream/RDD are (String) and the elements of the other are (Time, Int), what does union mean? I'm hoping for (String, Time, Int) but that appears optimistic. :) Do the elements have to be of homogeneous type? Holden Karau wrote

Re: Convert DStream to String

2014-10-29 Thread Harold Nguyen
Hi Sean, I'd just like to take the first word of every line, and use it as a variable for later. Is there a way to do that? Here's the gist of what I want to do: val lines = KafkaUtils.createStream(ssc, localhost:2181, test, Map(test - 10)).map(_._2) val words = lines.flatMap(_.split( ))

Re: what does DStream.union() do?

2014-10-29 Thread Holden Karau
On Wed, Oct 29, 2014 at 3:39 PM, spr s...@yarcdata.com wrote: I need more precision to understand. If the elements of one DStream/RDD are (String) and the elements of the other are (Time, Int), what does union mean? I'm hoping for (String, Time, Int) but that appears optimistic. :) It

Re: BUG: when running as extends App, closures don't capture variables

2014-10-29 Thread Matei Zaharia
Good catch! If you'd like, you can send a pull request changing the files in docs/ to do this (see https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark), otherwise maybe open an issue on

Re: Convert DStream to String

2014-10-29 Thread Sean Owen
Sure, that code looks like it does sort of what you describe but it's mixed up in a few ways. It looks like you only want to operate on words that start with SECRETWORD, but then you are prepending acct and _ in the code but expecting something appending in the result. You also seem like you want

Spark with HLists

2014-10-29 Thread Simon Hafner
I tried using shapeless HLists as data storage for data inside spark. Unsurprisingly, it failed. The deserialization isn't well-defined because of all the implicits used by shapeless. How could I make it work? Sample Code: /* SimpleApp.scala */ import org.apache.spark.SparkContext import

How does custom partitioning in PySpark work?

2014-10-29 Thread Def_Os
I want several RDDs (which are the result of my program's operations on existing RDDs) to match the partitioning of an existing RDD, since they will be joined together in the end. Do I understand correctly that I would benefit from using a custom partitioner that would be applied to all RDDs?

Re: Spark with HLists

2014-10-29 Thread Koert Kuipers
looks like a misssing class issue? what makes you think its serialization? shapeless does indeed have a lot of helper classes that get sucked in and are not serializable. see here: https://groups.google.com/forum/#!topic/shapeless-dev/05_DXnoVnI4 and for a project that uses shapeless in spark

use additional ebs volumes for hsdf storage with spark-ec2

2014-10-29 Thread Daniel Mahler
I started my ec2 spark cluster with ./ec2/spark---ebs-vol-{size=100,num=8,type=gp2} -t m3.xlarge -s 10 launch mycluster I see the additional volumes attached but they do not seem to be set up for hdfs. How can I check if they are being utilized on all workers, and how can I get all workers

SparkSQL: Nested Query error

2014-10-29 Thread SK
Hi, I am using Spark 1.1.0. I have the following SQL statement where I am trying to count the number of UIDs that are in the tusers table but not in the device table. val users_with_no_device = sql_cxt.sql(SELECT COUNT (u_uid) FROM tusers WHERE tusers.u_uid NOT IN (SELECT d_uid FROM device)) I

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-10-29 Thread Chris Fregly
jira created with comments/references to this discussion: https://issues.apache.org/jira/browse/SPARK-4144 On Tue, Aug 19, 2014 at 4:47 PM, Xiangrui Meng men...@gmail.com wrote: No. Please create one but it won't be able to catch the v1.1 train. -Xiangrui On Tue, Aug 19, 2014 at 4:22 PM,

Re: Submitting Spark job on Unix cluster from dev environment (Windows)

2014-10-29 Thread Shailesh Birari
Thanks by setting driver host to Windows and specifying some ports (like driver, fileserver, broadcast etc..) it worked perfectly. I need to specify those ports as not all ports are open on my machine. For, driver host name, I was assuming Spark should get it, as in case of linux we are not

Task Size Increases when using loops

2014-10-29 Thread nsareen
Hi,I'm new to spark, and am facing a peculiar problem. I'm writing a simple Java Driver program where i'm creating Key / Value data structure and collecting them, once created. The problem i'm facing is that, when i increase the iterations of a for loop which creates the ArrayList of Long Values

GC Issues with randomSplit on large dataset

2014-10-29 Thread Ganelin, Ilya
Hey all – not writing to necessarily get a fix but more to get an understanding of what’s going on internally here. I wish to take a cross-product of two very large RDDs (using cartesian), the product of which is well in excess of what can be stored on disk . Clearly that is intractable, thus

spark-submit results in NoClassDefFoundError

2014-10-29 Thread Tobias Pfeiffer
Hi, I am trying to get my Spark application to run on YARN and by now I have managed to build a fat jar as described on http://markmail.org/message/c6no2nyaqjdujnkq (which is the only really usable manual on how to get such a jar file). My code runs fine using sbt test and sbt run, but when

Re: spark-submit results in NoClassDefFoundError

2014-10-29 Thread Tobias Pfeiffer
Hi again, On Thu, Oct 30, 2014 at 11:50 AM, Tobias Pfeiffer t...@preferred.jp wrote: Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main java.lang.NoClassDefFoundError: com/typesafe/scalalogging/slf4j/Logger It turned out scalalogging

Re: Questions about serialization and SparkConf

2014-10-29 Thread Ilya Ganelin
Hello Steve . 1) When you call new SparkConf you should get an object with the default config values. You can reference the spark configuration and tuning pages for details on what those are. 2) Yes. Properties set in this configuration will be pushed down to worker nodes actually executing the

RE: problem with start-slaves.sh

2014-10-29 Thread Pagliari, Roberto
hi Yana, in my case I did not start any spark worker. However, shark was definitely running. Do you think that might be a problem? I will take a look Thank you, From: Yana Kadiyska [yana.kadiy...@gmail.com] Sent: Wednesday, October 29, 2014 9:45 AM To:

Re: SparkSQL: Nested Query error

2014-10-29 Thread Sanjiv Mittal
You may use - select count(u_uid) from tusers a left outer join device b on (a.u_uid = b.d_uid) where b.d_uid is null On Wed, Oct 29, 2014 at 5:45 PM, SK skrishna...@gmail.com wrote: Hi, I am using Spark 1.1.0. I have the following SQL statement where I am trying to count the number of

Algebird using spark-shell

2014-10-29 Thread bdev
I'm running into this error when I attempt to launch spark-shell passing in the algebird-core jar: ~~ $ ./bin/spark-shell --jars algebird-core_2.9.2-0.1.11.jar scala import com.twitter.algebird._ import com.twitter.algebird._ scala import HyperLogLog._ import HyperLogLog._ scala

MLLib: libsvm - default value initialization

2014-10-29 Thread Sameer Tilak
Hi All,I have my sparse data in libsvm format. val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, mllib/data/sample_libsvm_data.txt) I am running Linear regression. Let us say that my data has following entry:1 1:0 4:1 I think it will assume 0 for indices 2 and 3, right? I would

Re: Java api overhead?

2014-10-29 Thread Sonal Goyal
Thanks Koert. These numbers indeed tie back to our data and algorithms. Would going the scala route save some memory, as the java API creates wrapper Tuple2 for all pair functions? On Wednesday, October 29, 2014, Koert Kuipers ko...@tresata.com wrote: since spark holds data structures on heap

Re: Spark Worker node accessing Hive metastore

2014-10-29 Thread ken
Thanks Akhil. So the worker spark node doesn't need access to metastore to run Hive queries? If yes, which component accesses the metastore? For Hive, the Hive-cli accesses the metastore before submitting M/R jobs. Thanks, Ken -- View this message in context: