Re: Using data in RDD to specify HDFS directory to write to

2014-11-13 Thread Akhil Das
Why not something like:

lines.foreachRDD(rdd => {

*//Convert rdd(json) to map*
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
val myMap = mapper.readValue[Map[String,String]](x)

val event = myMap.getOrElse("event", System.currentTimeMillis())


rdd.saveAsTextFile("hdfs://akhldz:9000/" + event)

  })

​You​ can use the fasterxml jackson parser. Haven't tested the above code,
but i'm sure it will work.


Thanks
Best Regards

On Thu, Nov 13, 2014 at 6:27 AM, jschindler 
wrote:

> I am having a problem trying to figure out how to solve a problem.  I would
> like to stream events from Kafka to my Spark Streaming app and write the
> contents of each RDD out to a HDFS directory.  Each event that comes into
> the app via kafka will be JSON and have an event field with the name of the
> event.  I would like to grab the event name and then write out the event to
> hdfs:///user/hdfs/.
>
> My first intuition was to grab the event name and put it into the rdd, then
> run a forEachRDD loop and call save as text file where I concatenate the
> event name into the directory path.  I have pasted the code below but it
> will not work since I cannot access the data inside and RDD inside a
> forEachRDD loop.  If I dump all the RDD data into an array using .collect I
> wont be able to use the .saveAstextFile() method.  I'm at a loss for coming
> up with a way to do this.  Any ideas/help would be greatly appreciated,
> thanks!
>
>
> case class Event(EventName: String, Payload: org.json4s.JValue)
>
> object App {
>
>   def main(args: Array[String]) {
>
> val ssc = new StreamingContext("local[6]", "Data", Seconds(20))
> ssc.checkpoint("checkpoint")
>
> val eventMap = Map("Events" -> 1)
> val pipe = KafkaUtils.createStream(ssc,
> "dockerrepo,dockerrepo,dockerrepo", "Cons1",  eventMap).map(_._2)
>
> val eventStream = pipe.map(data => {
>   parse(data)
> }).map(json => {
>   implicit val formats = DefaultFormats
>   val eventName = (json \ "event").extractOpt[String]
>   (eventName, json)
>   Event(eventName.getOrElse("*** NO EVENT NAME ***"), json)
>
> })
>
> eventStream.foreachRDD(event => {
> //val eventName = event.EventName  //CAN'T ACCESS eventName!
> event.saveAsTextFile("hdfs://ip-here/user/hdfs/" + eventName +
> "/rdd="
> + pageHit.id)  //what I would like to do if I could access eventName
> })
>
>
> ssc.start()
> ssc.awaitTermination()
>
>   }
> }
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-data-in-RDD-to-specify-HDFS-directory-to-write-to-tp18789.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


StreamingContext does not stop

2014-11-13 Thread Tobias Pfeiffer
Hi,

I am processing a bunch of HDFS data using the StreamingContext (Spark
1.1.0) which means that all files that exist in the directory at start()
time are processed in the first batch. Now when I try to stop this stream
processing using `streamingContext.stop(false, false)` (that is, even with
stopGracefully = false), it has no effect. The stop() call blocks and data
processing continues (probably it would stop after the batch, but that
would be too long since all my data is in that batch).

I am not exactly sure if this is generally true or only for the first
batch. Also I observed that stopping the stream processing during the first
batch does occasionally lead to a very long time until the stop takes place
(even if there is no data present at all).

Has anyone experienced something similar? In my processing code, do I have
to do something particular (like checking for the state of the
StreamingContext) to allow the interruption? It is quite important for me
that stopping the stream processing takes place rather quickly.

Thanks
Tobias


Re: Joined RDD

2014-11-13 Thread Mayur Rustagi
First of all any action is only performed when you trigger a collect,
When you trigger collect, at that point it retrieves data from disk joins
the datasets together & delivers it to you.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 


On Thu, Nov 13, 2014 at 12:26 PM, ajay garg  wrote:

> Hi,
>  I have two RDDs A and B which are created from reading file from HDFS.
> I have a third RDD C which is created by taking join of A and B. All three
> RDDs (A, B and C ) are not cached.
> Now if I perform any action on C (let say collect), action is served
> without
> reading any data from the disk.
> Since no data is cached in spark how is action on C is served without
> reading data from disk.
>
> Thanks
> --Ajay
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Joined-RDD-tp18820.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


basic twitter stream program not working.

2014-11-13 Thread jishnu.prathap
Hi
   I am trying to run a basic twitter stream program but getting blank 
output. Please correct me if I am missing something.

import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.twitter.TwitterUtils
import org.apache.spark.streaming.Seconds
import org.apache.log4j.LogManager
import org.apache.log4j.Level

object Sparktwiter1 {
  def main(args: Array[String]) {
LogManager.getRootLogger().setLevel(Level.ERROR);
System.setProperty("http.proxyHost", "proxy4.wipro.com");
System.setProperty("http.proxyPort", "8080");
System.setProperty("twitter4j.oauth.consumerKey", "")
System.setProperty("twitter4j.oauth.consumerSecret", "")
System.setProperty("twitter4j.oauth.accessToken", "")
System.setProperty("twitter4j.oauth.accessTokenSecret", "")
val sparkConf = new 
SparkConf().setAppName("TwitterPopularTags").setMaster("local").set("spark.eventLog.enabled",
 "true")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val stream = TwitterUtils.createStream(ssc, None)//, filters)
stream.print
val s1 = stream.flatMap(status => status.getText)
s1.print
val hashTags = stream.flatMap(status => status.getText.split(" 
").filter(_.startsWith("#")))
hashTags.print
 ssc.start()
ssc.awaitTermination()
  }
}

Output

---
Time: 1415869348000 ms
---

---
Time: 1415869348000 ms
---

---
Time: 1415869348000 ms
---

[cid:image005.jpg@01CFFF52.453A17F0]


[cid:image006.jpg@01CFFF52.453A17F0]


The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.

www.wipro.com


Re: Joined RDD

2014-11-13 Thread ajay garg
Yes that is my understanding of how it should work.
But in my case when I call collect first time, it reads the data from files
on the disk.
Subsequent collect queries are not reading data files ( Verified from the
logs.)
On spark ui I see only shuffle read and no shuffle write.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Joined-RDD-tp18820p18829.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: basic twitter stream program not working.

2014-11-13 Thread Akhil Das
Change this line

*val* sparkConf = *new* SparkConf().setAppName("TwitterPopularTags"
).setMaster("local").set("spark.eventLog.enabled","true")

to

*val* sparkConf = *new* SparkConf().setAppName("TwitterPopularTags"
).setMaster(*"local[4]"*).set("spark.eventLog.enabled","true")



Thanks
Best Regards

On Thu, Nov 13, 2014 at 2:58 PM,  wrote:

>   *Hi *
>
> *   I am trying to run a basic twitter stream program but getting
> blank output. Please correct me if I am missing something.*
>
>
>
> *import* org.apache.spark.SparkConf
>
> *import* org.apache.spark.streaming.StreamingContext
>
> *import* org.apache.spark.streaming.twitter.TwitterUtils
>
> *import* org.apache.spark.streaming.Seconds
>
> *import* org.apache.log4j.LogManager
>
> *import* org.apache.log4j.Level
>
>
>
> *object* Sparktwiter1 {
>
>   *def* main(args: Array[String]) {
>
> LogManager.getRootLogger().setLevel(Level.ERROR);
>
> System.setProperty("http.proxyHost", "proxy4.wipro.com");
>
> System.setProperty("http.proxyPort", "8080");
>
> System.setProperty("twitter4j.oauth.consumerKey", "")
>
> System.setProperty("twitter4j.oauth.consumerSecret", "")
>
> System.setProperty("twitter4j.oauth.accessToken", "")
>
> System.setProperty("twitter4j.oauth.accessTokenSecret", "")
>
> *val* sparkConf = *new* SparkConf().setAppName("TwitterPopularTags"
> ).setMaster("local").set("spark.eventLog.enabled", "true")
>
> *val* ssc = *new* StreamingContext(sparkConf, Seconds(2))
>
> *val* stream = TwitterUtils.createStream(ssc, None)//, filters)
>
> stream.print
>
> *val* s1 = stream.flatMap(status => *status.getText*)
>
> s1.print
>
> *val* hashTags = stream.flatMap(status => *status.getText.split(**" "*
> *).filter(_.startsWith(**"#"**))*)
>
> hashTags.print
>
>  ssc.start()
>
> ssc.awaitTermination()
>
>   }
>
> }
>
>
>
> Output
>
>
>
> ---
>
> Time: 1415869348000 ms
>
> ---
>
>
>
> ---
>
> Time: 1415869348000 ms
>
> ---
>
>
>
> ---
>
> Time: 1415869348000 ms
>
> ---
>
>
>
>
>
>
>
>
>
> The information contained in this electronic message and any attachments
> to this message are intended for the exclusive use of the addressee(s) and
> may contain proprietary, confidential or privileged information. If you are
> not the intended recipient, you should not disseminate, distribute or copy
> this e-mail. Please notify the sender immediately and destroy all copies of
> this message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>
> www.wipro.com
>


unable to run streaming

2014-11-13 Thread Niko Gamulin
Hi,

I have tried to run basic streaming example (
https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html)

I have established two ssh connections to the machine where spark is
installed. In one terminal, I have started netcat with command

nc -lk 

In other terminal I have run the command

./bin/run-example streaming.NetworkWordCount localhost 

I get the following error and haven't managed to diagnose the cause:

14/11/13 13:03:43 ERROR ReceiverTracker: Deregistered receiver for stream
0: Restarting receiver with delay 2000ms: Error connecting to
localhost: - java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
at java.net.Socket.connect(Socket.java:579)
at java.net.Socket.connect(Socket.java:528)
at java.net.Socket.(Socket.java:425)
at java.net.Socket.(Socket.java:208)
at
org.apache.spark.streaming.dstream.SocketReceiver.receive(SocketInputDStream.scala:71)
at
org.apache.spark.streaming.dstream.SocketReceiver$$anon$2.run(SocketInputDStream.scala:57)

---
Time: 1415880224000 ms
---

If anyone encountered the same problem and solved the issue, I would be
very thankful if you could describe how to solve the problem or what could
cause it.

Best regards,

Niko


Re: unable to run streaming

2014-11-13 Thread Sean Owen
I suppose it means what it says, that you it can't connect, but that's
strange to be unable to connect to a port on localhost.

What if you "telnet localhost " and type some text? does it show
up in the nc output? if not, it's some other problem locally, like a
firewall, or nc not running, or not actually running all this on one
host, etc.

On Thu, Nov 13, 2014 at 10:06 AM, Niko Gamulin  wrote:
> Hi,
>
> I have tried to run basic streaming example
> (https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html)
>
> I have established two ssh connections to the machine where spark is
> installed. In one terminal, I have started netcat with command
>
> nc -lk 
>
> In other terminal I have run the command
>
> ./bin/run-example streaming.NetworkWordCount localhost 
>
> I get the following error and haven't managed to diagnose the cause:
>
> 14/11/13 13:03:43 ERROR ReceiverTracker: Deregistered receiver for stream 0:
> Restarting receiver with delay 2000ms: Error connecting to localhost: -
> java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
> at java.net.Socket.connect(Socket.java:579)
> at java.net.Socket.connect(Socket.java:528)
> at java.net.Socket.(Socket.java:425)
> at java.net.Socket.(Socket.java:208)
> at
> org.apache.spark.streaming.dstream.SocketReceiver.receive(SocketInputDStream.scala:71)
> at
> org.apache.spark.streaming.dstream.SocketReceiver$$anon$2.run(SocketInputDStream.scala:57)
>
> ---
> Time: 1415880224000 ms
> ---
>
> If anyone encountered the same problem and solved the issue, I would be very
> thankful if you could describe how to solve the problem or what could cause
> it.
>
> Best regards,
>
> Niko
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: No module named pyspark - latest built

2014-11-13 Thread jamborta
it was built with 1.6 (tried 1.7, too)

On Thu, Nov 13, 2014 at 2:52 AM, Andrew Or-2 [via Apache Spark User
List]  wrote:
> Hey Jamborta,
>
> What java version did you build the jar with?
>
> 2014-11-12 16:48 GMT-08:00 jamborta <[hidden email]>:
>>
>> I have figured out that building the fat jar with sbt does not seem to
>> included the pyspark scripts using the following command:
>>
>> sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive clean
>> publish-local assembly
>>
>> however the maven command works OK:
>>
>> mvn -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests
>> clean package
>>
>> am I running the correct sbt command?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18787.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>
>
> 
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18797.html
> To unsubscribe from No module named pyspark - latest built, click here.
> NAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18833.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Which function in spark is used to combine two RDDs by keys

2014-11-13 Thread Blind Faith
Let us say I have the following two RDDs, with the following key-pair
values.

rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ]

and

rdd2 = [ (key1, [value5, value6]), (key2, [value7]) ]

Now, I want to join them by key values, so for example I want to return the
following

ret = [ (key1, [value1, value2, value5, value6]), (key2, [value3,
value4, value7]) ]

How I can I do this, in spark using python or scala? One way is to use
join, but join would create a tuple inside the tuple. But I want to only
have one tuple per key value pair.


Re: unable to run streaming

2014-11-13 Thread Akhil Das
Try *nc -lp *

Thanks
Best Regards

On Thu, Nov 13, 2014 at 3:36 PM, Niko Gamulin 
wrote:

> Hi,
>
> I have tried to run basic streaming example (
> https://spark.apache.org/docs/1.1.0/streaming-programming-guide.html)
>
> I have established two ssh connections to the machine where spark is
> installed. In one terminal, I have started netcat with command
>
> nc -lk 
>
> In other terminal I have run the command
>
> ./bin/run-example streaming.NetworkWordCount localhost 
>
> I get the following error and haven't managed to diagnose the cause:
>
> 14/11/13 13:03:43 ERROR ReceiverTracker: Deregistered receiver for stream
> 0: Restarting receiver with delay 2000ms: Error connecting to
> localhost: - java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
> at java.net.Socket.connect(Socket.java:579)
> at java.net.Socket.connect(Socket.java:528)
> at java.net.Socket.(Socket.java:425)
> at java.net.Socket.(Socket.java:208)
> at
> org.apache.spark.streaming.dstream.SocketReceiver.receive(SocketInputDStream.scala:71)
> at
> org.apache.spark.streaming.dstream.SocketReceiver$$anon$2.run(SocketInputDStream.scala:57)
>
> ---
> Time: 1415880224000 ms
> ---
>
> If anyone encountered the same problem and solved the issue, I would be
> very thankful if you could describe how to solve the problem or what could
> cause it.
>
> Best regards,
>
> Niko
>
>


Re: Which function in spark is used to combine two RDDs by keys

2014-11-13 Thread Sonal Goyal
Check cogroup.

Best Regards,
Sonal
Founder, Nube Technologies 





On Thu, Nov 13, 2014 at 5:11 PM, Blind Faith 
wrote:

> Let us say I have the following two RDDs, with the following key-pair
> values.
>
> rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ]
>
> and
>
> rdd2 = [ (key1, [value5, value6]), (key2, [value7]) ]
>
> Now, I want to join them by key values, so for example I want to return
> the following
>
> ret = [ (key1, [value1, value2, value5, value6]), (key2, [value3,
> value4, value7]) ]
>
> How I can I do this, in spark using python or scala? One way is to use
> join, but join would create a tuple inside the tuple. But I want to only
> have one tuple per key value pair.
>


runexample TwitterPopularTags showing Class Not found error

2014-11-13 Thread jishnu.prathap
Hi
I am getting the following error while running the 
TwitterPopularTags  example .I am using spark-1.1.0-bin-hadoop2.4 .

jishnu@getafix:~/spark/bin$ run-example TwitterPopularTags *** ** ** *** **

spark assembly has been built with Hive, including Datanucleus jars on classpath
java.lang.ClassNotFoundException: 
org.apache.spark.examples.org.apache.spark.streaming.examples.TwitterPopularTags
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:318)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

tried executing in three  different machines but all showed the same error.I am 
able to run other examples like SparkPi .


Thanks & Regards
Jishnu Menath Prathap
BAS EBI(Open Source)



The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.

www.wipro.com


RE: basic twitter stream program not working.

2014-11-13 Thread jishnu.prathap
Hi
Thanks Akhil  you saved the day….  Its working perfectly …

Regards
Jishnu Menath Prathap
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Thursday, November 13, 2014 3:25 PM
To: Jishnu Menath Prathap (WT01 - BAS)
Cc: Akhil [via Apache Spark User List]; user@spark.apache.org
Subject: Re: basic twitter stream program not working.

Change this line

val sparkConf = new 
SparkConf().setAppName("TwitterPopularTags").setMaster("local").set("spark.eventLog.enabled","true")

to

val sparkConf = new 
SparkConf().setAppName("TwitterPopularTags").setMaster("local[4]").set("spark.eventLog.enabled","true")



Thanks
Best Regards

On Thu, Nov 13, 2014 at 2:58 PM, 
mailto:jishnu.prat...@wipro.com>> wrote:
Hi
   I am trying to run a basic twitter stream program but getting blank 
output. Please correct me if I am missing something.

import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.twitter.TwitterUtils
import org.apache.spark.streaming.Seconds
import org.apache.log4j.LogManager
import org.apache.log4j.Level

object Sparktwiter1 {
  def main(args: Array[String]) {
LogManager.getRootLogger().setLevel(Level.ERROR);
System.setProperty("http.proxyHost", 
"proxy4.wipro.com");
System.setProperty("http.proxyPort", "8080");
System.setProperty("twitter4j.oauth.consumerKey", "")
System.setProperty("twitter4j.oauth.consumerSecret", "")
System.setProperty("twitter4j.oauth.accessToken", "")
System.setProperty("twitter4j.oauth.accessTokenSecret", "")
val sparkConf = new 
SparkConf().setAppName("TwitterPopularTags").setMaster("local").set("spark.eventLog.enabled",
 "true")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val stream = TwitterUtils.createStream(ssc, None)//, filters)
stream.print
val s1 = stream.flatMap(status => status.getText)
s1.print
val hashTags = stream.flatMap(status => status.getText.split(" 
").filter(_.startsWith("#")))
hashTags.print
 ssc.start()
ssc.awaitTermination()
  }
}

Output

---
Time: 1415869348000 ms
---

---
Time: 1415869348000 ms
---

---
Time: 1415869348000 ms
---

[cid:image001.jpg@01CFFF64.0FD789F0]


[cid:image002.jpg@01CFFF64.0FD789F0]


The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.

www.wipro.com


The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.

www.wipro.com


Re: runexample TwitterPopularTags showing Class Not found error

2014-11-13 Thread Akhil Das
Run this way:

bin/spark-submit --class
org.apache.spark.examples.streaming.TwitterPopularTags
lib/spark-examples-1.1.0-hadoop1.0.4.jar

Or this way:

 bin/run-example org.apache.spark.examples.streaming.TwitterPopularTags


Thanks
Best Regards

On Thu, Nov 13, 2014 at 5:02 PM,  wrote:

>  Hi
>
> I am getting the following error while running the
> TwitterPopularTags  example .I am using
>
> *spark-1.1.0-bin-hadoop2.4 . *
>
> jishnu@getafix:~/spark/bin$ run-example TwitterPopularTags *** ** ** ***
> **
>
>
>
> spark assembly has been built with Hive, including Datanucleus jars on
> classpath
>
> java.lang.ClassNotFoundException:
> org.apache.spark.examples.org.apache.spark.streaming.examples.TwitterPopularTags
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>
> at java.lang.Class.forName0(Native Method)
>
> at java.lang.Class.forName(Class.java:270)
>
> at
> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:318)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
>
> tried executing in three  different machines but all showed the same
> error.I am able to run other examples like SparkPi .
>
>
>
>
>
> Thanks & Regards
>
> Jishnu Menath Prathap
>
> BAS EBI(Open Source)
>
>
>
>
>
> The information contained in this electronic message and any attachments
> to this message are intended for the exclusive use of the addressee(s) and
> may contain proprietary, confidential or privileged information. If you are
> not the intended recipient, you should not disseminate, distribute or copy
> this e-mail. Please notify the sender immediately and destroy all copies of
> this message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>
> www.wipro.com
>


Spark GCLIB error

2014-11-13 Thread Naveen Kumar Pokala
Hi,

I am receiving following error when I am trying to run sample spark program.


Caused by: java.lang.UnsatisfiedLinkError: 
/tmp/hadoop-npokala/nm-local-dir/usercache/npokala/appcache/application_1415881017544_0003/container_1415881017544_0003_01_01/tmp/snappy-1.0.5.3-ec0ee911-2f8c-44c0-ae41-5e0b2727d880-libsnappyjava.so:
 /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by 
/tmp/hadoop-npokala/nm-local-dir/usercache/npokala/appcache/application_1415881017544_0003/container_1415881017544_0003_01_01/tmp/snappy-1.0.5.3-ec0ee911-2f8c-44c0-ae41-5e0b2727d880-libsnappyjava.so)
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1929)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1814)
at java.lang.Runtime.load0(Runtime.java:809)
at java.lang.System.load(System.java:1083)
at org.xerial.snappy.SnappyNativeLoader.load(SnappyNativeLoader.java:39)
... 29 more


-Naveen.


Re: loading, querying schemaRDD using SparkSQL

2014-11-13 Thread vdiwakar.malladi
Thanks Michael.

I used Parquet files and it could able to solve my initial problem to some
extent (i.e. loading data from one context and reading it from another
context). 

But there I could see another issue. I need to load the parquet file every
time I create the JavaSQLContext using parquetFile method on that (for the
creation of JavaSchemaRDD) and need to register as temp table using
registerTempTable for the querying purpose. It seems to be a problem when
our web application is in cluster mode. Because I need to load the parquet
files on each node. Could you please advice me on this?

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/loading-querying-schemaRDD-using-SparkSQL-tp18052p18841.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: unable to run streaming

2014-11-13 Thread Sean Owen
nc returns an error if you do that. nc -lk is correct.

On Thu, Nov 13, 2014 at 11:46 AM, Akhil Das  wrote:
> Try nc -lp 
>
> Thanks
> Best Regards
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: unable to run streaming

2014-11-13 Thread Akhil Das
I think he's on ubuntu/debain box

Thanks
Best Regards

On Thu, Nov 13, 2014 at 6:23 PM, Sean Owen  wrote:

> nc returns an error if you do that. nc -lk is correct.
>
> On Thu, Nov 13, 2014 at 11:46 AM, Akhil Das 
> wrote:
> > Try nc -lp 
> >
> > Thanks
> > Best Regards
> >
>


Re: Spark streaming cannot receive any message from Kafka

2014-11-13 Thread Helena Edelson
I encounter no issues with streaming from kafka to spark in 1.1.0. Do you
perhaps have a version conflict?

Helena
On Nov 13, 2014 12:55 AM, "Jay Vyas"  wrote:

> Yup , very important that  n>1 for spark streaming jobs, If local use
> local[2]
>
> The thing to remember is that your spark receiver will take a thread to
> itself and produce data , so u need another thread to consume it .
>
> In a cluster manager like yarn or mesos, the word thread Is not used
> anymore, I guess has different meaning- you need 2 or more free compute
> slots, and that should be guaranteed by looking to see how many free node
> managers are running etc.
>
> On Nov 12, 2014, at 7:53 PM, "Shao, Saisai"  wrote:
>
>  Did you configure Spark master as local, it should be local[n], n > 1
> for local mode. Beside there’s a Kafka wordcount example in Spark Streaming
> example, you can try that. I’ve tested with latest master, it’s OK.
>
>
>
> Thanks
>
> Jerry
>
>
>
> *From:* Tobias Pfeiffer [mailto:t...@preferred.jp ]
> *Sent:* Thursday, November 13, 2014 8:45 AM
> *To:* Bill Jay
> *Cc:* u...@spark.incubator.apache.org
> *Subject:* Re: Spark streaming cannot receive any message from Kafka
>
>
>
> Bill,
>
>
>
>   However, when I am currently using Spark 1.1.0. the Spark streaming job
> cannot receive any messages from Kafka. I have not made any change to the
> code.
>
>
>
> Do you see any suspicious messages in the log output?
>
>
>
> Tobias
>
>
>
>


minimizing disk I/O

2014-11-13 Thread rok
I'm trying to understand the disk I/O patterns for Spark -- specifically, I'd
like to reduce the number of files that are being written during shuffle
operations. A couple questions: 

* is the amount of file I/O performed independent of the memory I allocate
for the shuffles? 

* if this is the case, what is the purpose of this memory and is there any
way to see how much of it is actually being used?
 
* how can I minimize the number of files being written? With 24 cores per
node, the filesystem can't handle the large amount of simultaneous I/O very
well so it limits the number of cores I can use... 

Thanks for any insight you might have! 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/minimizing-disk-I-O-tp18845.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: using LogisticRegressionWithSGD.train in Python crashes with "Broken pipe"

2014-11-13 Thread rok
Hi, I'm using Spark 1.1.0. There is no error on the executors -- it appears
as if the job never gets properly dispatched -- the only message is the
"Broken Pipe" message in the driver. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegressionWithSGD-train-in-Python-crashes-with-Broken-pipe-tp18182p18846.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Kafka examples

2014-11-13 Thread Eduardo Costa Alfaia
Hi guys,

The Kafka’s examples in master branch were canceled?

Thanks
-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
Would it make sense to read each file in as a separate RDD? This way you
would be guaranteed the data is partitioned as you expected.

Possibly you could then repartition each of those RDDs into a single
partition and then union them. I think that would achieve what you expect.
But it would be easy to accidentally screw this up (have some operation
that causes a shuffle), so I think you're better off just leaving them as
separate RDDs.

On Wed, Nov 12, 2014 at 10:27 PM, Pala M Muthaia <
mchett...@rocketfuelinc.com> wrote:

> Hi,
>
> I have a set of input files for a spark program, with each file
> corresponding to a logical data partition. What is the API/mechanism to
> assign each input file (or a set of files) to a spark partition, when
> initializing RDDs?
>
> When i create a spark RDD pointing to the directory of files, my
> understanding is it's not guaranteed that each input file will be treated
> as separate partition.
>
> My job semantics require that the data is partitioned, and i want to
> leverage the partitioning that has already been done, rather than
> repartitioning again in the spark job.
>
> I tried to lookup online but haven't found any pointers so far.
>
>
> Thanks
> pala
>



-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io


Re: Assigning input files to spark partitions

2014-11-13 Thread Rishi Yadav
If your data is in hdfs and you are reading as textFile and each file is
less than block size, my understanding is it would always have one
partition per file.

On Thursday, November 13, 2014, Daniel Siegmann 
wrote:

> Would it make sense to read each file in as a separate RDD? This way you
> would be guaranteed the data is partitioned as you expected.
>
> Possibly you could then repartition each of those RDDs into a single
> partition and then union them. I think that would achieve what you expect.
> But it would be easy to accidentally screw this up (have some operation
> that causes a shuffle), so I think you're better off just leaving them as
> separate RDDs.
>
> On Wed, Nov 12, 2014 at 10:27 PM, Pala M Muthaia <
> mchett...@rocketfuelinc.com
> > wrote:
>
>> Hi,
>>
>> I have a set of input files for a spark program, with each file
>> corresponding to a logical data partition. What is the API/mechanism to
>> assign each input file (or a set of files) to a spark partition, when
>> initializing RDDs?
>>
>> When i create a spark RDD pointing to the directory of files, my
>> understanding is it's not guaranteed that each input file will be treated
>> as separate partition.
>>
>> My job semantics require that the data is partitioned, and i want to
>> leverage the partitioning that has already been done, rather than
>> repartitioning again in the spark job.
>>
>> I tried to lookup online but haven't found any pointers so far.
>>
>>
>> Thanks
>> pala
>>
>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 54 W 40th St, New York, NY 10018
> E: daniel.siegm...@velos.io
>  W: www.velos.io
>


-- 
- Rishi


Building a hash table from a csv file using yarn-cluster, and giving it to each executor

2014-11-13 Thread YaoPau
I built my Spark Streaming app on my local machine, and an initial step in
log processing is filtering out rows with spam IPs.  I use the following
code which works locally:

// Creates a HashSet for badIPs read in from file
val badIpSource = scala.io.Source.fromFile("wrongIPlist.csv")
val ipLines = badIpSource.getLines()


val set = new HashSet[String]()
val badIpSet = set ++ ipLines
badIpSource.close()

def isGoodIp(ip: String): Boolean = !badIpSet.contains(ip)

But when I try this using "--master yarn-cluster" I get "Exception in thread
"Thread-4" java.lang.reflect.InvocationTargetException ... Caused by:
java.io.FileNotFoundException: wrongIPlist.csv (No such file or directory)". 
The file is there (I wasn't sure which directory it was accessing so it's in
both my current client directory and my HDFS home directory), so now I'm
wondering if reading a file in parallel is just not allowed in general and
that's why I'm getting the error.

I'd like each executor to have access to this HashSet (not a huge file,
about 3000 IPs) instead of having to do a more expensive JOIN.  Any
recommendations on a better way to handle this?  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Building-a-hash-table-from-a-csv-file-using-yarn-cluster-and-giving-it-to-each-executor-tp18850.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Does Spark Streaming calculate during a batch?

2014-11-13 Thread Michael Campbell
I was running a proof of concept for my company with spark streaming, and
the conclusion I came to is that spark collects data for the
batch-duration, THEN starts the data-pipeline calculations.

My batch size was 5 minutes, and the CPU was all but dead for 5, then when
the 5 minutes were up the CPU's would spike for a while presumably doing
the calculations.

Is this presumption true, or is it running the data through the calculation
pipeline before the batch is up?

What could lead to the periodic CPU spike - I had a reduceByKey, so was it
doing that only after all the batch data was in?

Thanks


Re: Does Spark Streaming calculate during a batch?

2014-11-13 Thread jay vyas
1) Your have a receiver thread. That thread might use alot of CPU, or not,
depending on how  you implement the thread in onStart.

2) Every 5 minutes, spark will submit a job which process
every RDD which was created (i.e using the store() call) in the
receiver .  That job will run asynchronously to the receiver, which
is still working to produce new RDDs for the next batch,


So, maybe you're monitoring the CPU only on the
spark workers which is running the batch jobs, and not
on the spark worker which is doing the RDD ingestion?





On Thu, Nov 13, 2014 at 10:35 AM, Michael Campbell <
michael.campb...@gmail.com> wrote:

> I was running a proof of concept for my company with spark streaming, and
> the conclusion I came to is that spark collects data for the
> batch-duration, THEN starts the data-pipeline calculations.
>
> My batch size was 5 minutes, and the CPU was all but dead for 5, then when
> the 5 minutes were up the CPU's would spike for a while presumably doing
> the calculations.
>
> Is this presumption true, or is it running the data through the
> calculation pipeline before the batch is up?
>
> What could lead to the periodic CPU spike - I had a reduceByKey, so was it
> doing that only after all the batch data was in?
>
> Thanks
>



-- 
jay vyas


Re: Does Spark Streaming calculate during a batch?

2014-11-13 Thread Sean Owen
Yes. Data is collected for 5 minutes, then processing starts at the
end. The result may be an arbitrary function of the data in the
interval, so the interval has to finish before computation can start.

If you want more continuous processing, you can simply reduce the
batch interval to, say, 1 minute.

On Thu, Nov 13, 2014 at 3:35 PM, Michael Campbell
 wrote:
> I was running a proof of concept for my company with spark streaming, and
> the conclusion I came to is that spark collects data for the batch-duration,
> THEN starts the data-pipeline calculations.
>
> My batch size was 5 minutes, and the CPU was all but dead for 5, then when
> the 5 minutes were up the CPU's would spike for a while presumably doing the
> calculations.
>
> Is this presumption true, or is it running the data through the calculation
> pipeline before the batch is up?
>
> What could lead to the periodic CPU spike - I had a reduceByKey, so was it
> doing that only after all the batch data was in?
>
> Thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
I believe Rishi is correct. I wouldn't rely on that though - all it would
take is for one file to exceed the block size and you'd be setting yourself
up for pain. Also, if your files are small - small enough to fit in a
single record - you could use SparkContext.wholeTextFile.

On Thu, Nov 13, 2014 at 10:11 AM, Rishi Yadav  wrote:

> If your data is in hdfs and you are reading as textFile and each file is
> less than block size, my understanding is it would always have one
> partition per file.
>
>
> On Thursday, November 13, 2014, Daniel Siegmann 
> wrote:
>
>> Would it make sense to read each file in as a separate RDD? This way you
>> would be guaranteed the data is partitioned as you expected.
>>
>> Possibly you could then repartition each of those RDDs into a single
>> partition and then union them. I think that would achieve what you expect.
>> But it would be easy to accidentally screw this up (have some operation
>> that causes a shuffle), so I think you're better off just leaving them as
>> separate RDDs.
>>
>> On Wed, Nov 12, 2014 at 10:27 PM, Pala M Muthaia <
>> mchett...@rocketfuelinc.com> wrote:
>>
>>> Hi,
>>>
>>> I have a set of input files for a spark program, with each file
>>> corresponding to a logical data partition. What is the API/mechanism to
>>> assign each input file (or a set of files) to a spark partition, when
>>> initializing RDDs?
>>>
>>> When i create a spark RDD pointing to the directory of files, my
>>> understanding is it's not guaranteed that each input file will be treated
>>> as separate partition.
>>>
>>> My job semantics require that the data is partitioned, and i want to
>>> leverage the partitioning that has already been done, rather than
>>> repartitioning again in the spark job.
>>>
>>> I tried to lookup online but haven't found any pointers so far.
>>>
>>>
>>> Thanks
>>> pala
>>>
>>
>>
>>
>> --
>> Daniel Siegmann, Software Developer
>> Velos
>> Accelerating Machine Learning
>>
>> 54 W 40th St, New York, NY 10018
>> E: daniel.siegm...@velos.io W: www.velos.io
>>
>
>
> --
> - Rishi
>



-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io


Re: Spark/HIVE Insert Into values Error

2014-11-13 Thread Vasu C
Hi Arthur,

May I know what is the solution., I have similar requirements.


Regards,
  Vasu C

On Sun, Oct 26, 2014 at 12:09 PM, arthur.hk.c...@gmail.com <
arthur.hk.c...@gmail.com> wrote:

> Hi,
>
> I have already found the way about how to “insert into HIVE_TABLE values
> (…..)
>
> Regards
> Arthur
>
> On 18 Oct, 2014, at 10:09 pm, Cheng Lian  wrote:
>
>  Currently Spark SQL uses Hive 0.12.0, which doesn't support the INSERT
> INTO ... VALUES ... syntax.
>
> On 10/18/14 1:33 AM, arthur.hk.c...@gmail.com wrote:
>
> Hi,
>
>  When trying to insert records into HIVE, I got error,
>
>  My Spark is 1.1.0 and Hive 0.12.0
>
>  Any idea what would be wrong?
> Regards
> Arthur
>
>
>
>  hive> CREATE TABLE students (name VARCHAR(64), age INT, gpa int);
>
> OK
>
>  hive> INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1);
> NoViableAltException(26@[])
>  at
> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:693)
>  at
> org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:31374)
>  at
> org.apache.hadoop.hive.ql.parse.HiveParser.regular_body(HiveParser.java:29083)
>  at
> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatement(HiveParser.java:28968)
>  at
> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:28762)
>  at
> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1238)
>  at
> org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:938)
>  at
> org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:190)
>  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
>  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
>  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
>  at
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
>  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
>  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
>  at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
>  at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
>  at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> FAILED: ParseException line 1:27 cannot recognize input near 'VALUES' '('
> ''fred flintstone'' in select clause
>
>
>
>
>


how to convert System.currentTimeMillis to calendar time

2014-11-13 Thread spr
Apologies for what seems an egregiously simple question, but I can't find the
answer anywhere.  

I have timestamps from the Spark Streaming Time() interface, in milliseconds
since an epoch, and I want to print out a human-readable calendar date and
time.  How does one do that?  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-convert-System-currentTimeMillis-to-calendar-time-tp18856.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: data locality, task distribution

2014-11-13 Thread Nathan Kronenfeld
I am seeing skewed execution times.  As far as I can tell, they are
attributable to differences in data locality - tasks with locality
PROCESS_LOCAL run fast, NODE_LOCAL, slower, and ANY, slowest.

This seems entirely as it should be - the question is, why the different
locality levels?

I am seeing skewed caching, as I mentioned before - in the case I isolated,
with 4 nodes, they were distributed at about 42%, 31%, 20%, and 6%.
However, the total amount was significantly less than the memory of any
single node, so I don't think they could have overpopulated their cache.  I
am occasionally seeing task failures, but the re-execute themselves, and
work fine the next time.  Yet I'm still seeing incomplete caching (from 65%
cached up to 100%, depending on the run).

I shouldn't have much variance in task time - this is simply a foreach over
the data, adding to an accumulator, and the data is completely randomly
distributed, so should be pretty even overall.

I am seeing GC regressions occasionally - they slow a request from about 2
seconds to about 5 seconds.  They 8 minute slowdown seems to be solely
attributable to the data locality issue, as far as I can tell.  There was
some further confusion though in the times I mentioned - the list I gave
(3.1 min, 2 seconds, ... 8 min) were not different runs with different
cache %s, they were iterations within a single run with 100% caching.

   -Nathan



On Thu, Nov 13, 2014 at 1:45 AM, Aaron Davidson  wrote:

> Spark's scheduling is pretty simple: it will allocate tasks to open cores
> on executors, preferring ones where the data is local. It even performs
> "delay scheduling", which means waiting a bit to see if an executor where
> the data resides locally becomes available.
>
> Are yours tasks seeing very skewed execution times? If some tasks are
> taking a very long time and using all the resources on a node, perhaps the
> other nodes are quickly finishing many tasks, and actually overpopulating
> their caches. If a particular machine were not overpopulating its cache,
> and there are no failures, then you should see 100% cached after the first
> run.
>
> It's also strange that running totally uncached takes 3.1 minutes, but
> running 80-90% cached may take 8 minutes. Does your workload produce
> nondeterministic variance in task times? Was it a single straggler, or many
> tasks, that was keeping the job from finishing? It's not too uncommon to
> see occasional performance regressions while caching due to GC, though 2
> seconds to 8 minutes is a bit extreme.
>
> On Wed, Nov 12, 2014 at 9:01 PM, Nathan Kronenfeld <
> nkronenf...@oculusinfo.com> wrote:
>
>> Sorry, I think I was not clear in what I meant.
>> I didn't mean it went down within a run, with the same instance.
>>
>> I meant I'd run the whole app, and one time, it would cache 100%, and the
>> next run, it might cache only 83%
>>
>> Within a run, it doesn't change.
>>
>> On Wed, Nov 12, 2014 at 11:31 PM, Aaron Davidson 
>> wrote:
>>
>>> The fact that the caching percentage went down is highly suspicious. It
>>> should generally not decrease unless other cached data took its place, or
>>> if unless executors were dying. Do you know if either of these were the
>>> case?
>>>
>>> On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld <
>>> nkronenf...@oculusinfo.com> wrote:
>>>
 Can anyone point me to a good primer on how spark decides where to send
 what task, how it distributes them, and how it determines data locality?

 I'm trying a pretty simple task - it's doing a foreach over cached
 data, accumulating some (relatively complex) values.

 So I see several inconsistencies I don't understand:

 (1) If I run it a couple times, as separate applications (i.e.,
 reloading, recaching, etc), I will get different %'s cached each time.
 I've got about 5x as much memory as I need overall, so it isn't running
 out.  But one time, 100% of the data will be cached; the next, 83%, the
 next, 92%, etc.

 (2) Also, the data is very unevenly distributed. I've got 400
 partitions, and 4 workers (with, I believe, 3x replication), and on my last
 run, my distribution was 165/139/25/71.  Is there any way to get spark to
 distribute the tasks more evenly?

 (3) If I run the problem several times in the same execution (to take
 advantage of caching etc.), I get very inconsistent results.  My latest
 try, I get:

- 1st run: 3.1 min
- 2nd run: 2 seconds
- 3rd run: 8 minutes
- 4th run: 2 seconds
- 5th run: 2 seconds
- 6th run: 6.9 minutes
- 7th run: 2 seconds
- 8th run: 2 seconds
- 9th run: 3.9 minuts
- 10th run: 8 seconds

 I understand the difference for the first run; it was caching that
 time.  Later times, when it manages to work in 2 seconds, it's because all
 the tasks were PROCESS_LOCAL; when it takes longer, the last 10-2

Re: Which function in spark is used to combine two RDDs by keys

2014-11-13 Thread Davies Liu
rdd1.union(rdd2).groupByKey()

On Thu, Nov 13, 2014 at 3:41 AM, Blind Faith  wrote:
> Let us say I have the following two RDDs, with the following key-pair
> values.
>
> rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ]
>
> and
>
> rdd2 = [ (key1, [value5, value6]), (key2, [value7]) ]
>
> Now, I want to join them by key values, so for example I want to return the
> following
>
> ret = [ (key1, [value1, value2, value5, value6]), (key2, [value3,
> value4, value7]) ]
>
> How I can I do this, in spark using python or scala? One way is to use join,
> but join would create a tuple inside the tuple. But I want to only have one
> tuple per key value pair.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: using LogisticRegressionWithSGD.train in Python crashes with "Broken pipe"

2014-11-13 Thread Davies Liu
It seems that the JVM failed to start to crash silently.

On Thu, Nov 13, 2014 at 6:06 AM, rok  wrote:
> Hi, I'm using Spark 1.1.0. There is no error on the executors -- it appears
> as if the job never gets properly dispatched -- the only message is the
> "Broken Pipe" message in the driver.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegressionWithSGD-train-in-Python-crashes-with-Broken-pipe-tp18182p18846.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark and Play

2014-11-13 Thread Mohammed Guller
Hi Patrick,
Although we are able to use Spark 1.1.0 with Play 2.2.x, as you mentioned, Akka 
incompatibility prevents us from using Spark with the current stable releases 
of Play (2.3.6) and Akka (2.3.7). Are there any plans to address this issue in 
Spark 1.2?

Thanks,
Mohammed

From: John Meehan [mailto:jnmee...@gmail.com]
Sent: Tuesday, November 11, 2014 11:35 PM
To: Mohammed Guller
Cc: Patrick Wendell; Akshat Aranya; user@spark.apache.org
Subject: Re: Spark and Play

You can also build a Play 2.2.x + Spark 1.1.0 fat jar with sbt-assembly for, 
e.g. yarn-client support or using with spark-shell for debugging:

play.Project.playScalaSettings

libraryDependencies ~= { _ map {
  case m if m.organization == "com.typesafe.play" =>
m.exclude("commons-logging", "commons-logging")
  case m => m
}}

assemblySettings

test in assembly := {}

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
  {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.startsWith("META-INF") => MergeStrategy.discard
case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
case PathList("org", "apache", xs @ _*) => MergeStrategy.first
case PathList("org", "jboss", xs @ _*) => MergeStrategy.first
case PathList("org", "slf4j", xs @ _*) => MergeStrategy.discard
case "about.html"  => MergeStrategy.rename
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
  }
}

On Tue, Nov 11, 2014 at 3:04 PM, Mohammed Guller 
mailto:moham...@glassbeam.com>> wrote:
Actually, it is possible to integrate Spark 1.1.0 with Play 2.2.x

Here is a sample build.sbt file:

name := """xyz"""

version := "0.1 "

scalaVersion := "2.10.4"

libraryDependencies ++= Seq(
  jdbc,
  anorm,
  cache,
  "org.apache.spark" %% "spark-core" % "1.1.0",
  "com.typesafe.akka" %% "akka-actor" % "2.2.3",
  "com.typesafe.akka" %% "akka-slf4j" % "2.2.3",
  "org.apache.spark" %% "spark-sql" % "1.1.0"
)

play.Project.playScalaSettings


Mohammed

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Tuesday, November 11, 2014 2:06 PM
To: Akshat Aranya
Cc: user@spark.apache.org
Subject: Re: Spark and Play

Hi There,

Because Akka versions are not binary compatible with one another, it might not 
be possible to integrate Play with Spark 1.1.0.

- Patrick

On Tue, Nov 11, 2014 at 8:21 AM, Akshat Aranya 
mailto:aara...@gmail.com>> wrote:
> Hi,
>
> Sorry if this has been asked before; I didn't find a satisfactory
> answer when searching.  How can I integrate a Play application with
> Spark?  I'm getting into issues of akka-actor versions.  Play 2.2.x
> uses akka-actor 2.0, whereas Play 2.3.x uses akka-actor 2.3.4, neither
> of which work fine with Spark 1.1.0.  Is there something I should do
> with libraryDependencies in my build.sbt to make it work?
>
> Thanks,
> Akshat

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org For 
additional commands, e-mail: 
user-h...@spark.apache.org


-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org



Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Sadhan Sood
Thanks Chneg, Just one more question - does that mean that we still need
enough memory in the cluster to uncompress the data before it can be
compressed again or does that just read the raw data as is?

On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian  wrote:

>  Currently there’s no way to cache the compressed sequence file directly.
> Spark SQL uses in-memory columnar format while caching table rows, so we
> must read all the raw data and convert them into columnar format. However,
> you can enable in-memory columnar compression by setting
> spark.sql.inMemoryColumnarStorage.compressed to true. This property is
> already set to true by default in master branch and branch-1.2.
>
> On 11/13/14 7:16 AM, Sadhan Sood wrote:
>
>   We noticed while caching data from our hive tables which contain data
> in compressed sequence file format that it gets uncompressed in memory when
> getting cached. Is there a way to turn this off and cache the compressed
> data as is ?
>
>   ​
>


suggest pyspark using 'with' for sparkcontext to be more 'pythonic'

2014-11-13 Thread freedafeng
It seems sparkcontext is good fit to be used with 'with' in python. A context
manager will do. 

example:

with SparkContext(conf=conf, batchSize=512) as sc:



Then sc.stop() is not necessary to write any more.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/suggest-pyspark-using-with-for-sparkcontext-to-be-more-pythonic-tp18863.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does Spark Streaming calculate during a batch?

2014-11-13 Thread Michael Campbell
On Thu, Nov 13, 2014 at 11:02 AM, Sean Owen  wrote:

> Yes. Data is collected for 5 minutes, then processing starts at the
> end. The result may be an arbitrary function of the data in the
> interval, so the interval has to finish before computation can start.
>

Thanks everyone.


Re: data locality, task distribution

2014-11-13 Thread Aaron Davidson
You mentioned that the 3.1 min run was the one that did the actual caching,
so did that run before any data was cached, or after?

I would recommend checking the Storage tab of the UI, and clicking on the
RDD, to see both how full the executors' storage memory is (which may be
significantly less than the instance's memory). When a task completes over
data that should be cached, it will try to cache it, so it's pretty weird
that you're seeing <100% cache with memory to spare. It's possible that
some partitions are significantly larger than others, which may cause us to
not attempt to cache it (defined by spark.storage.unrollFraction).

You can also try increasing the spark.locality.wait flag to ensure that
Spark will wait longer for tasks to complete before running them
non-locally. One possible situation is that a node hits GC for a few
seconds, a task is scheduled non-locally, and then attempting to read from
the other executors' cache is much more expensive than computing the data
initially. Increasing the locality wait beyond the GC time would avoid this.

On Thu, Nov 13, 2014 at 9:08 AM, Nathan Kronenfeld <
nkronenf...@oculusinfo.com> wrote:

> I am seeing skewed execution times.  As far as I can tell, they are
> attributable to differences in data locality - tasks with locality
> PROCESS_LOCAL run fast, NODE_LOCAL, slower, and ANY, slowest.
>
> This seems entirely as it should be - the question is, why the different
> locality levels?
>
> I am seeing skewed caching, as I mentioned before - in the case I
> isolated, with 4 nodes, they were distributed at about 42%, 31%, 20%, and
> 6%.  However, the total amount was significantly less than the memory of
> any single node, so I don't think they could have overpopulated their
> cache.  I am occasionally seeing task failures, but the re-execute
> themselves, and work fine the next time.  Yet I'm still seeing incomplete
> caching (from 65% cached up to 100%, depending on the run).
>
> I shouldn't have much variance in task time - this is simply a foreach
> over the data, adding to an accumulator, and the data is completely
> randomly distributed, so should be pretty even overall.
>
> I am seeing GC regressions occasionally - they slow a request from about 2
> seconds to about 5 seconds.  They 8 minute slowdown seems to be solely
> attributable to the data locality issue, as far as I can tell.  There was
> some further confusion though in the times I mentioned - the list I gave
> (3.1 min, 2 seconds, ... 8 min) were not different runs with different
> cache %s, they were iterations within a single run with 100% caching.
>
>-Nathan
>
>
>
> On Thu, Nov 13, 2014 at 1:45 AM, Aaron Davidson 
> wrote:
>
>> Spark's scheduling is pretty simple: it will allocate tasks to open cores
>> on executors, preferring ones where the data is local. It even performs
>> "delay scheduling", which means waiting a bit to see if an executor where
>> the data resides locally becomes available.
>>
>> Are yours tasks seeing very skewed execution times? If some tasks are
>> taking a very long time and using all the resources on a node, perhaps the
>> other nodes are quickly finishing many tasks, and actually overpopulating
>> their caches. If a particular machine were not overpopulating its cache,
>> and there are no failures, then you should see 100% cached after the first
>> run.
>>
>> It's also strange that running totally uncached takes 3.1 minutes, but
>> running 80-90% cached may take 8 minutes. Does your workload produce
>> nondeterministic variance in task times? Was it a single straggler, or many
>> tasks, that was keeping the job from finishing? It's not too uncommon to
>> see occasional performance regressions while caching due to GC, though 2
>> seconds to 8 minutes is a bit extreme.
>>
>> On Wed, Nov 12, 2014 at 9:01 PM, Nathan Kronenfeld <
>> nkronenf...@oculusinfo.com> wrote:
>>
>>> Sorry, I think I was not clear in what I meant.
>>> I didn't mean it went down within a run, with the same instance.
>>>
>>> I meant I'd run the whole app, and one time, it would cache 100%, and
>>> the next run, it might cache only 83%
>>>
>>> Within a run, it doesn't change.
>>>
>>> On Wed, Nov 12, 2014 at 11:31 PM, Aaron Davidson 
>>> wrote:
>>>
 The fact that the caching percentage went down is highly suspicious. It
 should generally not decrease unless other cached data took its place, or
 if unless executors were dying. Do you know if either of these were the
 case?

 On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld <
 nkronenf...@oculusinfo.com> wrote:

> Can anyone point me to a good primer on how spark decides where to
> send what task, how it distributes them, and how it determines data
> locality?
>
> I'm trying a pretty simple task - it's doing a foreach over cached
> data, accumulating some (relatively complex) values.
>
> So I see several inconsistencies I don't understand:
>
> (1) I

Re: Map output statuses exceeds frameSize

2014-11-13 Thread pouryas
Anyone experienced this before? Any help would be appreciated 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Map-output-statuses-exceeds-frameSize-tp18783p18866.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: how to convert System.currentTimeMillis to calendar time

2014-11-13 Thread Akhil Das
This way?

scala> val epoch = System.currentTimeMillis
epoch: Long = 1415903974545

scala> val date = new Date(epoch)
date: java.util.Date = Fri Nov 14 00:09:34 IST 2014



Thanks
Best Regards

On Thu, Nov 13, 2014 at 10:17 PM, spr  wrote:

> Apologies for what seems an egregiously simple question, but I can't find
> the
> answer anywhere.
>
> I have timestamps from the Spark Streaming Time() interface, in
> milliseconds
> since an epoch, and I want to print out a human-readable calendar date and
> time.  How does one do that?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-convert-System-currentTimeMillis-to-calendar-time-tp18856.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


serial data import from master node without leaving spark

2014-11-13 Thread aappddeevv
I have large files that need to be imported into hdfs for further spark
processing. Obviously, I can import it in using hadoop fs however, there is
some minor processing that needs to be performed around a few
transformations, stripping the header line, and other such stuff. 

I would like to stay in the spark environment for doing this versus
switching to other tools either prior to the parallel tools or after loading
the file.

sc.textFile() requires files to be on the nodes when running in cluster
mode, which defeats the purposes of importing the serial file into the
parallel world without any extra steps.

After searching, it was not obvious how to do this. Should I use spark
streaming and pretend that I am streaming the data in through the master
node from the file?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/serial-data-import-from-master-node-without-leaving-spark-tp18869.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: how to convert System.currentTimeMillis to calendar time

2014-11-13 Thread Jimmy McErlain
You could also use the jodatime library, which has a ton of great other
options in it.
J
ᐧ




*JIMMY MCERLAIN*

DATA SCIENTIST (NERD)

*. . . . . . . . . . . . . . . . . .*


*IF WE CAN’T DOUBLE YOUR SALES,*



*ONE OF US IS IN THE WRONG BUSINESS.*

*E*: ji...@sellpoints.com

*M*: *510.303.7751*

On Thu, Nov 13, 2014 at 10:40 AM, Akhil Das 
wrote:

> This way?
>
> scala> val epoch = System.currentTimeMillis
> epoch: Long = 1415903974545
>
> scala> val date = new Date(epoch)
> date: java.util.Date = Fri Nov 14 00:09:34 IST 2014
>
>
>
> Thanks
> Best Regards
>
> On Thu, Nov 13, 2014 at 10:17 PM, spr  wrote:
>
>> Apologies for what seems an egregiously simple question, but I can't find
>> the
>> answer anywhere.
>>
>> I have timestamps from the Spark Streaming Time() interface, in
>> milliseconds
>> since an epoch, and I want to print out a human-readable calendar date and
>> time.  How does one do that?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-convert-System-currentTimeMillis-to-calendar-time-tp18856.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Accessing RDD within another RDD map

2014-11-13 Thread Simone Franzini
The following code fails with NullPointerException in RDD class on the
count function:

val rdd1 = sc.parallelize(1 to 10)
val rdd2 = sc.parallelize(11 to 20)
rdd1.map{ i =>
 rdd2.count
}
.foreach(println(_))

The same goes for any other action I am trying to perform inside the map
statement. I am failing to understand what I am doing wrong.
Can anyone help with this?

Thanks,
Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini


Re: PySpark issue with sortByKey: "IndexError: list index out of range"

2014-11-13 Thread santon
Thanks for the thoughts. I've been testing on Spark 1.1 and haven't seen
the IndexError yet. I've run into some other errors ("too many open
files"), but these issues seem to have been discussed already. The dataset,
by the way, was about 40 Gb and 188 million lines; I'm running a sort on 3
worker nodes with a total of about 80 cores.

Thanks again for the tips!

On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List] <
ml-node+s1001560n18393...@n3.nabble.com> wrote:

> Could you tell how large is the data set? It will help us to debug this
> issue.
>
> On Thu, Nov 6, 2014 at 10:39 AM, skane <[hidden email]
> > wrote:
>
> > I don't have any insight into this bug, but on Spark version 1.0.0 I ran
> into
> > the same bug running the 'sort.py' example. On a smaller data set, it
> worked
> > fine. On a larger data set I got this error:
> >
> > Traceback (most recent call last):
> >   File "/home/skane/spark/examples/src/main/python/sort.py", line 30, in
> > 
> > .sortByKey(lambda x: x)
> >   File "/usr/lib/spark/python/pyspark/rdd.py", line 480, in sortByKey
> > bounds.append(samples[index])
> > IndexError: list index out of range
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: [hidden email]
> 
> > For additional commands, e-mail: [hidden email]
> 
> >
>
> -
> To unsubscribe, e-mail: [hidden email]
> 
> For additional commands, e-mail: [hidden email]
> 
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18393.html
>  To unsubscribe from PySpark issue with sortByKey: "IndexError: list index
> out of range", click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18871.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Confused why I'm losing workers/executors when writing a large file to S3

2014-11-13 Thread Darin McBeath
For one of my Spark jobs, my workers/executors are dying and leaving the 
cluster.

On the master, I see something like the following in the log file.  I'm 
surprised to see the '60' seconds in the master log below because I explicitly 
set it to '600' (or so I thought) in my spark job (see below).   This is 
happening at the end of my job when I'm trying to persist a large RDD (probably 
around 300+GB) back to S3 (in 256 partitions).  My cluster consists of 6 
r3.8xlarge machines.  The job successfully works when I'm outputting 100GB or 
200GB.

If  you have any thoughts/insights, it would be appreciated. 

Thanks.

Darin.

Here is where I'm setting the 'timeout' in my spark job.
SparkConf conf = new SparkConf().setAppName("SparkSync 
Application").set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer").set("spark.rdd.compress","true")  
 .set("spark.core.connection.ack.wait.timeout","600");​
On the master, I see the following in the log file.

4/11/13 17:20:39 WARN master.Master: Removing 
worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 because we got no 
heartbeat in 60 seconds14/11/13 17:20:39 INFO master.Master: Removing worker 
worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 on 
ip-10-35-184-232.ec2.internal:5187714/11/13 17:20:39 INFO master.Master: 
Telling app of lost executor: 2

On a worker, I see something like the following in the log file.

14/11/13 17:20:58 WARN util.AkkaUtils: Error sending message in 1 
attemptsjava.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]  at 
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)  at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)  at 
scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)  at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
  at scala.concurrent.Await$.result(package.scala:107)  at 
org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)  at 
org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362)14/11/13 
17:21:11 INFO httpclient.HttpMethodDirector: I/O exception 
(java.net.SocketException) caught when processing request: Broken pipe14/11/13 
17:21:11 INFO httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:32 
INFO httpclient.HttpMethodDirector: I/O exception (java.net.SocketException) 
caught when processing request: Broken pipe14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: I/O exception (java.io.IOException) caught when 
processing request: Resetting to invalid mark14/11/13 17:21:34 INFO 
httpclient.HttpMethodDirector: Retrying request14/11/13 17:21:58 WARN 
utils.RestUtils: Retried connection 6 times, which exceeds the maximum retry 
count of 514/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, 
which exceeds the maximum retry count of 514/11/13 17:22:57 WARN 
util.AkkaUtils: Error sending message in 1 
attemptsjava.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]


Re: Accessing RDD within another RDD map

2014-11-13 Thread Daniel Siegmann
You cannot reference an RDD within a closure passed to another RDD. Your
code should instead look like this:

val rdd1 = sc.parallelize(1 to 10)
val rdd2 = sc.parallelize(11 to 20)
val rdd2Count = rdd2.count
rdd1.map{ i =>
 rdd2Count
}
.foreach(println(_))

You should also note that even if your original code did work, you would be
re-counting rdd2 for every single record in rdd1. Unless your RDD is cached
/ persisted, the count will be recomputed every time it is called. So that
would be extremely inefficient.


On Thu, Nov 13, 2014 at 2:28 PM, Simone Franzini 
wrote:

> The following code fails with NullPointerException in RDD class on the
> count function:
>
> val rdd1 = sc.parallelize(1 to 10)
> val rdd2 = sc.parallelize(11 to 20)
> rdd1.map{ i =>
>  rdd2.count
> }
> .foreach(println(_))
>
> The same goes for any other action I am trying to perform inside the map
> statement. I am failing to understand what I am doing wrong.
> Can anyone help with this?
>
> Thanks,
> Simone Franzini, PhD
>
> http://www.linkedin.com/in/simonefranzini
>



-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io


Re: PySpark issue with sortByKey: "IndexError: list index out of range"

2014-11-13 Thread Davies Liu
The errors maybe happens because that there is not enough memory in
worker, so it keeping spilling with many small files, could you verify
that the PR [1] could fix your problem?

[1] https://github.com/apache/spark/pull/3252

On Thu, Nov 13, 2014 at 11:28 AM, santon  wrote:
> Thanks for the thoughts. I've been testing on Spark 1.1 and haven't seen the
> IndexError yet. I've run into some other errors ("too many open files"), but
> these issues seem to have been discussed already. The dataset, by the way,
> was about 40 Gb and 188 million lines; I'm running a sort on 3 worker nodes
> with a total of about 80 cores.
>
> Thanks again for the tips!
>
> On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List]
> <[hidden email]> wrote:
>>
>> Could you tell how large is the data set? It will help us to debug this
>> issue.
>>
>> On Thu, Nov 6, 2014 at 10:39 AM, skane <[hidden email]> wrote:
>>
>> > I don't have any insight into this bug, but on Spark version 1.0.0 I ran
>> > into
>> > the same bug running the 'sort.py' example. On a smaller data set, it
>> > worked
>> > fine. On a larger data set I got this error:
>> >
>> > Traceback (most recent call last):
>> >   File "/home/skane/spark/examples/src/main/python/sort.py", line 30, in
>> > 
>> > .sortByKey(lambda x: x)
>> >   File "/usr/lib/spark/python/pyspark/rdd.py", line 480, in sortByKey
>> > bounds.append(samples[index])
>> > IndexError: list index out of range
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>>
>> -
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>> 
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18393.html
>> To unsubscribe from PySpark issue with sortByKey: "IndexError: list index
>> out of range", click here.
>> NAML
>
>
>
> 
> View this message in context: Re: PySpark issue with sortByKey: "IndexError:
> list index out of range"
>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Assigning input files to spark partitions

2014-11-13 Thread Pala M Muthaia
Thanks for the responses Daniel and Rishi.

No i don't want separate RDD because each of these partitions are being
processed the same way (in my case, each partition corresponds to HBase
keys belonging to one region server, and i will do HBase lookups). After
that i have aggregations too, hence all these partitions should be in the
same RDD. The reason to follow the partition structure is to limit
concurrent HBase lookups targeting a single region server.

Not sure what the block size is here (HDFS block size?), but my files may
get large over time, so cannot depend on block size assumption. That said,
from your description, it seems like i don't have to worry too much because
Spark does assign files to partitions while maintaining 'locality' (i.e. a
given file's data would fit in ceil(filesize/blocksize) partitions, as
opposed to spread across numerous partitions).

Yes, i saw the wholeTextFile(), it won't apply in my case because input
file size can be quite large.

On Thu, Nov 13, 2014 at 8:04 AM, Daniel Siegmann 
wrote:

> I believe Rishi is correct. I wouldn't rely on that though - all it would
> take is for one file to exceed the block size and you'd be setting yourself
> up for pain. Also, if your files are small - small enough to fit in a
> single record - you could use SparkContext.wholeTextFile.
>
> On Thu, Nov 13, 2014 at 10:11 AM, Rishi Yadav 
> wrote:
>
>> If your data is in hdfs and you are reading as textFile and each file is
>> less than block size, my understanding is it would always have one
>> partition per file.
>>
>>
>> On Thursday, November 13, 2014, Daniel Siegmann 
>> wrote:
>>
>>> Would it make sense to read each file in as a separate RDD? This way you
>>> would be guaranteed the data is partitioned as you expected.
>>>
>>> Possibly you could then repartition each of those RDDs into a single
>>> partition and then union them. I think that would achieve what you expect.
>>> But it would be easy to accidentally screw this up (have some operation
>>> that causes a shuffle), so I think you're better off just leaving them as
>>> separate RDDs.
>>>
>>> On Wed, Nov 12, 2014 at 10:27 PM, Pala M Muthaia <
>>> mchett...@rocketfuelinc.com> wrote:
>>>
 Hi,

 I have a set of input files for a spark program, with each file
 corresponding to a logical data partition. What is the API/mechanism to
 assign each input file (or a set of files) to a spark partition, when
 initializing RDDs?

 When i create a spark RDD pointing to the directory of files, my
 understanding is it's not guaranteed that each input file will be treated
 as separate partition.

 My job semantics require that the data is partitioned, and i want to
 leverage the partitioning that has already been done, rather than
 repartitioning again in the spark job.

 I tried to lookup online but haven't found any pointers so far.


 Thanks
 pala

>>>
>>>
>>>
>>> --
>>> Daniel Siegmann, Software Developer
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 54 W 40th St, New York, NY 10018
>>> E: daniel.siegm...@velos.io W: www.velos.io
>>>
>>
>>
>> --
>> - Rishi
>>
>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 54 W 40th St, New York, NY 10018
> E: daniel.siegm...@velos.io W: www.velos.io
>


Re: Building a hash table from a csv file using yarn-cluster, and giving it to each executor

2014-11-13 Thread aappddeevv
If the file is not present on each node, it may not find it.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Building-a-hash-table-from-a-csv-file-using-yarn-cluster-and-giving-it-to-each-executor-tp18850p18877.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Assigning input files to spark partitions

2014-11-13 Thread Daniel Siegmann
On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia  wrote

>
> No i don't want separate RDD because each of these partitions are being
> processed the same way (in my case, each partition corresponds to HBase
> keys belonging to one region server, and i will do HBase lookups). After
> that i have aggregations too, hence all these partitions should be in the
> same RDD. The reason to follow the partition structure is to limit
> concurrent HBase lookups targeting a single region server.
>

Neither of these is necessarily a barrier to using separate RDDs. You can
define the function you want to use and then pass it to multiple map
methods. Then you could union all the RDDs to do your aggregations. For
example, it might look something like this:

val paths: String = ... // the paths to the files you want to load
def myFunc(t: T) = ... // the function to apply to every RDD
val rdds = paths.map { path =>
sc.textFile(path).map(myFunc)
}
val completeRdd = sc.union(rdds)

Does that make any sense?

-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io


GraphX / PageRank with edge weights

2014-11-13 Thread Ommen, Jurgen
Hi,

I'm using GraphX and playing around with its PageRank algorithm. However, I 
can't see from the documentation how to use edge weight when running PageRank.
Is this possible to consider edge weights and how would I do it?

Thank you very much for your help and my best regards,
Jürgen


GraphX: Get edges for a vertex

2014-11-13 Thread Daniil Osipov
Hello,

I'm attempting to implement a clustering algorithm on top of Pregel
implementation in GraphX, however I'm hitting a wall. Ideally, I'd like to
be able to get all edges for a specific vertex, since they factor into the
calculation. My understanding was that sendMsg function would receive all
relevant edges in participating vertices (all initially, declining as they
converge and stop changing state), and I was planning to keep vertex edges
associated to each vertex and propagate to other vertices that need to know
about these edges.

What I'm finding is that not all edges get iterated on by sendMsg before
sending messages to vprog. Even if I try to keep track of edges, I don't
account all of them, leading to incorrect results.

The graph I'm testing on has edges between all nodes, one for each
direction, and I'm using EdgeDirection.Both.

Anyone seen something similar, and have some suggestions?
Dan


Spark- How can I run MapReduce only on one partition in an RDD?

2014-11-13 Thread Tim Chou
Hi All,

I use textFile to create a RDD. However, I don't want to handle the whole
data in this RDD. For example, maybe I only want to solve the data in 3rd
partition of the RDD.

How can I do it? Here are some possible solutions that I'm thinking:
1. Create multiple RDDs when reading the file
2.  Run MapReduce functions with the specific partition for an RDD.

However, I cannot find any appropriate function.

Thank you and wait for your suggestions.

Best,
Tim


RE: Spark- How can I run MapReduce only on one partition in an RDD?

2014-11-13 Thread Ganelin, Ilya
Why do you only want the third partition? You can access individual partitions 
using the partitions() function. You can also filter your data using the 
filter() function to only contain the data you care about. Moreover, when you 
create your RDDs unless you define a custom partitioner you have no way of 
controlling what data is in partition #3. Therefore, there is almost no reason 
to want to operate on an individual partition.

-Original Message-
From: Tim Chou [timchou@gmail.com]
Sent: Thursday, November 13, 2014 06:01 PM Eastern Standard Time
To: user@spark.apache.org
Subject: Spark- How can I run MapReduce only on one partition in an RDD?

Hi All,

I use textFile to create a RDD. However, I don't want to handle the whole data 
in this RDD. For example, maybe I only want to solve the data in 3rd partition 
of the RDD.

How can I do it? Here are some possible solutions that I'm thinking:
1. Create multiple RDDs when reading the file
2.  Run MapReduce functions with the specific partition for an RDD.

However, I cannot find any appropriate function.

Thank you and wait for your suggestions.

Best,
Tim


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: Spark- How can I run MapReduce only on one partition in an RDD?

2014-11-13 Thread adrian
The direct answere you are looking for may be in RDD.mapPartitionsWithIndex()

The better question is, why are you looking into only the 3rd partition? To
analyze a random sample? Then look into RDD.sample(). Are you sure the data
you are looking for is in the 3rd partition? What if you end up with only 2
partitions after loading your data? Or you may want to filter() your RDD?

Adrian


Tim Chou wrote
> Hi All,
> 
> I use textFile to create a RDD. However, I don't want to handle the whole
> data in this RDD. For example, maybe I only want to solve the data in 3rd
> partition of the RDD.
> 
> How can I do it? Here are some possible solutions that I'm thinking:
> 1. Create multiple RDDs when reading the file
> 2.  Run MapReduce functions with the specific partition for an RDD.
> 
> However, I cannot find any appropriate function.
> 
> Thank you and wait for your suggestions.
> 
> Best,
> Tim





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-How-can-I-run-MapReduce-only-on-one-partition-in-an-RDD-tp18882p18884.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark JDBC Thirft Server over HTTP

2014-11-13 Thread vs
Does Spark JDBC thrift server allow connections over HTTP?

http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#running-the-thrift-jdbc-server
doesn't see to indicate this feature.

If the feature isn't there it it planned? Is there a tracking JIRA?

Thank you,
Vinay



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-JDBC-Thirft-Server-over-HTTP-tp18885.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: GraphX: Get edges for a vertex

2014-11-13 Thread Takeshi Yamamuro
Hi,

I think that there are two solutions;

1. Invalid edges send Iterator.empty messages in sendMsg of the Pregel API.
These messages make no effect on corresponding vertices.

2. Use GraphOps.(collectNeighbors/collectNeighborIds), not the Pregel API
 so as to
handle active edge lists by yourself.
I think that it is hard to handle  edge active lists in Pregel APIs.

Thought?

Best regards,
takeshi

On Fri, Nov 14, 2014 at 7:32 AM, Daniil Osipov 
wrote:

> Hello,
>
> I'm attempting to implement a clustering algorithm on top of Pregel
> implementation in GraphX, however I'm hitting a wall. Ideally, I'd like to
> be able to get all edges for a specific vertex, since they factor into the
> calculation. My understanding was that sendMsg function would receive all
> relevant edges in participating vertices (all initially, declining as they
> converge and stop changing state), and I was planning to keep vertex edges
> associated to each vertex and propagate to other vertices that need to know
> about these edges.
>
> What I'm finding is that not all edges get iterated on by sendMsg before
> sending messages to vprog. Even if I try to keep track of edges, I don't
> account all of them, leading to incorrect results.
>
> The graph I'm testing on has edges between all nodes, one for each
> direction, and I'm using EdgeDirection.Both.
>
> Anyone seen something similar, and have some suggestions?
> Dan
>


Spark Custom Receiver

2014-11-13 Thread Jacob Abraham
Hi Folks,

I have written a custom Spark receiver and in my testing I have found that
its doing its job properly.

However, I am wondering if someone could shed some light on how the
"driver" could query the "receiver" for some information. In other words,
how can I make the driver talk to the receivers to get some information
from it ? This information could be something like the state of the
receiver or some statistics about the receiver itself...

Regards,
-Abe


Streaming: getting total count over all windows

2014-11-13 Thread SK
Hi,

I am using the following code to generate the (score, count) for each
window:

val score_count_by_window  = topic.map(r =>  r._2)   // r._2 is the integer
score
 .countByValue()
   
score_count_by_window.print()   

E.g. output for a window is as follows, which means that within the Dstream
for that window, there are 2 rdds with score 0; 3 with score 1, and 1 with
score -1.
(0, 2)
(1, 3)
(-1, 1)

I would like to get the aggregate count for each score over all windows
until program terminates. I tried countByValueAndWindow() but the result is
same as countByValue() (i.e. it is producing only per window counts). 
reduceByWindow also does not produce the result I am expecting. What is the
correct way to sum up the counts over multiple windows?

thanks










--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-getting-total-count-over-all-windows-tp1.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Custom Receiver

2014-11-13 Thread Tathagata Das
Haha, there is a actor-based messaging path that exists between the
driver (see ReceiverTracker) and the Receiver (see
ReceiverSupervisorImpl). But none of that is exposed to the public
API. So either you could hack in Spark Streaming code to expose that,
or it might be better to setup you own actor in the custom receiver,
and communicate with another actor in the application driver.

Does that make sense?

On Thu, Nov 13, 2014 at 5:24 PM, Jacob Abraham  wrote:
> Hi Folks,
>
> I have written a custom Spark receiver and in my testing I have found that
> its doing its job properly.
>
> However, I am wondering if someone could shed some light on how the "driver"
> could query the "receiver" for some information. In other words, how can I
> make the driver talk to the receivers to get some information from it ? This
> information could be something like the state of the receiver or some
> statistics about the receiver itself...
>
> Regards,
> -Abe

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is there setup and cleanup function in spark?

2014-11-13 Thread Dai, Kevin
HI, all

Is there setup and cleanup function as in hadoop mapreduce in spark which does 
some initialization and cleanup work?

Best Regards,
Kevin.


Re: StreamingContext does not stop

2014-11-13 Thread Tobias Pfeiffer
Hi,

I guess I found part of the issue: I said
  dstream.transform(rdd => { rdd.foreachPartition(...); rdd })
instead of
  dstream.transform(rdd => { rdd.mapPartitions(...) }),
that's why stop() would not stop the processing.

Now with the new version a non-graceful shutdown works in the sense that
Spark does not wait for my processing to complete; job generator, job
scheduler, job executor etc. all seem to be shut down fine, just the
threads that do the actual processing are not. Even after
streamingContext.stop() is done, I see logging output from my processing
task.

Is there any way to signal to my processing tasks that they should stop the
processing?

Thanks
Tobias


Re: Streaming: getting total count over all windows

2014-11-13 Thread jay vyas
I would think this should be done at the application level.
After all, the core functionality of SparkStreaming is to capture RDDs in
some real time interval and process them -
not to aggregate their results.

But maybe there is a better way...

On Thu, Nov 13, 2014 at 8:28 PM, SK  wrote:

> Hi,
>
> I am using the following code to generate the (score, count) for each
> window:
>
> val score_count_by_window  = topic.map(r =>  r._2)   // r._2 is the integer
> score
>  .countByValue()
>
> score_count_by_window.print()
>
> E.g. output for a window is as follows, which means that within the Dstream
> for that window, there are 2 rdds with score 0; 3 with score 1, and 1 with
> score -1.
> (0, 2)
> (1, 3)
> (-1, 1)
>
> I would like to get the aggregate count for each score over all windows
> until program terminates. I tried countByValueAndWindow() but the result is
> same as countByValue() (i.e. it is producing only per window counts).
> reduceByWindow also does not produce the result I am expecting. What is the
> correct way to sum up the counts over multiple windows?
>
> thanks
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-getting-total-count-over-all-windows-tp1.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
jay vyas


Re: Spark JDBC Thirft Server over HTTP

2014-11-13 Thread Cheng Lian

HTTP is not supported yet, and I don't think there's an JIRA ticket for it.

On 11/14/14 8:21 AM, vs wrote:

Does Spark JDBC thrift server allow connections over HTTP?

http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#running-the-thrift-jdbc-server
doesn't see to indicate this feature.

If the feature isn't there it it planned? Is there a tracking JIRA?

Thank you,
Vinay



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-JDBC-Thirft-Server-over-HTTP-tp18885.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Cheng Lian
No, the columnar buffer is built in a small batching manner, the batch 
size is controlled by the |spark.sql.inMemoryColumnarStorage.batchSize| 
property. The default value for this in master and branch-1.2 is 10,000 
rows per batch.


On 11/14/14 1:27 AM, Sadhan Sood wrote:

Thanks Chneg, Just one more question - does that mean that we still 
need enough memory in the cluster to uncompress the data before it can 
be compressed again or does that just read the raw data as is?


On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian > wrote:


Currently there’s no way to cache the compressed sequence file
directly. Spark SQL uses in-memory columnar format while caching
table rows, so we must read all the raw data and convert them into
columnar format. However, you can enable in-memory columnar
compression by setting
|spark.sql.inMemoryColumnarStorage.compressed| to |true|. This
property is already set to true by default in master branch and
branch-1.2.

On 11/13/14 7:16 AM, Sadhan Sood wrote:


We noticed while caching data from our hive tables which contain
data in compressed sequence file format that it gets uncompressed
in memory when getting cached. Is there a way to turn this off
and cache the compressed data as is ?

​



​


Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Jianshi Huang
Where do you want the setup and cleanup functions to run? Driver or the
worker nodes?

Jianshi

On Fri, Nov 14, 2014 at 10:44 AM, Dai, Kevin  wrote:

>  HI, all
>
>
>
> Is there setup and cleanup function as in hadoop mapreduce in spark which
> does some initialization and cleanup work?
>
>
>
> Best Regards,
>
> Kevin.
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Cheng Lian
If you’re looking for executor side setup and cleanup functions, there 
ain’t any yet, but you can achieve the same semantics via 
|RDD.mapPartitions|.


Please check the “setup() and cleanup” section of this blog from 
Cloudera for details: 
http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/


On 11/14/14 10:44 AM, Dai, Kevin wrote:


HI, all

Is there setup and cleanup function as in hadoop mapreduce in spark 
which does some initialization and cleanup work?


Best Regards,

Kevin.


​


Using a compression codec in saveAsSequenceFile in Pyspark (Python API)

2014-11-13 Thread sahanbull
Hi, 

I am trying to save an RDD to an S3 bucket using
RDD.saveAsSequenceFile(self, path, CompressionCodec) function in python. I
need to save the RDD in GZIP. Can anyone tell me how to send the gzip codec
class as a parameter into the function. 

I tried
*RDD.saveAsSequenceFile('{0}{1}'.format(outputFolder,datePath),compressionCodecClass=gzip.GzipFile)*

but it hits me with a : *AttributeError: type object 'GzipFile' has no
attribute '_get_object_id' *
Which I think is because JVM cant find the scala mapping gzip. 

*If you can guide me about any method to write the RDD as a gzip(.gz) into
disc that is very much appreciated. *

Many thanks
SahanB



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-a-compression-codec-in-saveAsSequenceFile-in-Pyspark-Python-API-tp18899.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Communication between Driver and Executors

2014-11-13 Thread Tobias Pfeiffer
Hi,

(this is related to my previous question about stopping the
StreamingContext)

is there any way to send a message from the driver to the executors? There
is all this Akka machinery running, so it should be easy to have something
like

  sendToAllExecutors(message)

on the driver and

  handleMessage {
case _ => ...
  }

on the executors, right? Surely at least for Broadcast.unpersist() such a
thing must exist, so can I use it somehow (dirty way is also ok) to send a
message to my Spark nodes?

Thanks
Tobias


Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Jianshi Huang
So can I write it like this?

rdd.mapPartition(i => setup(); i).map(...).mapPartition(i => cleanup(); i)

So I don't need to mess up the logic and still can use map, filter and
other transformations for RDD.

Jianshi

On Fri, Nov 14, 2014 at 12:20 PM, Cheng Lian  wrote:

>  If you’re looking for executor side setup and cleanup functions, there
> ain’t any yet, but you can achieve the same semantics via
> RDD.mapPartitions.
>
> Please check the “setup() and cleanup” section of this blog from Cloudera
> for details:
> http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/
>
> On 11/14/14 10:44 AM, Dai, Kevin wrote:
>
>   HI, all
>
>
>
> Is there setup and cleanup function as in hadoop mapreduce in spark which
> does some initialization and cleanup work?
>
>
>
> Best Regards,
>
> Kevin.
>
>   ​
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


RE: Backporting spark 1.1.0 to CDH 5.1.3

2014-11-13 Thread Zalzberg, Idan (Agoda)
Thank you,
Recompiling spark was not as complicated as I feared and it seems to work.

Since then we have decided to migrate to 5.2.0 so the problem was mitigated but 
if anyone else has this issue, I can verify this method works

-Original Message-
From: Marcelo Vanzin [mailto:van...@cloudera.com]
Sent: Tuesday, November 11, 2014 2:52 AM
To: Zalzberg, Idan (Agoda)
Cc: user@spark.apache.org
Subject: Re: Backporting spark 1.1.0 to CDH 5.1.3

Hello,

CDH 5.1.3 ships with a version of Hive that's not entirely the same as the Hive 
Spark 1.1 supports. So when building your custom Spark, you should make sure 
you change all the dependency versions to point to the CDH versions.

IIRC Spark depends on org.spark-project.hive:0.12.0, you'd have to change it to 
something like org.apache.hive:0.12.0-cdh5.1.3. And you'll probably run into 
compilation errors at that point (you can check out cloudera's public repo for 
the patches needed to make Spark
1.0 compile against CDH's Hive 0.12 in [1]).

If you're still willing to go forward at this point, feel free to ask 
questions, although CDH-specific questions would probably be better asked on 
our mailing list instead (cdh-us...@cloudera.org).

[1] https://github.com/cloudera/spark/commits/cdh5-1.0.0_5.1.0

On Mon, Nov 10, 2014 at 3:58 AM, Zalzberg, Idan (Agoda) 
 wrote:
> Hello,
>
> I have a big cluster running CDH 5.1.3 which I can’t upgrade to 5.2.0
> at the current time.
>
> I would like to run Spark-On-Yarn in that cluster.
>
>
>
> I tried to compile spark with CDH-5.1.3 and I got HDFS to work but I
> am having problems with the connection to hive:
>
>
>
> java.sql.SQLException: Could not establish connection to
> jdbc:hive2://localhost.localdomain:1/: Required field
> 'serverProtocolVersion' is unset!
> Struct:TOpenSessionResp(status:TStatus(statusCode:SUCCESS_STATUS),
> serverProtocolVersion:null,
> sessionHandle:TSessionHandle(sessionId:THandleIdentifier(guid:C7 86 85
> 3D 38
> 91 41 A1 AF 02 83 DA 80 74 A5 B1, secret:62 80 00
>
> 99 D6 73 48 9B 81 13 FB D7 DB 32 32 26)), configuration:{})
>
> [info]   at
> org.apache.hive.jdbc.HiveConnection.openSession(HiveConnection.java:24
> 6)
>
> [info]   at
> org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:132)
>
> [info]   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
>
> [info]   at java.sql.DriverManager.getConnection(DriverManager.java:571)
>
> [info]   at java.sql.DriverManager.getConnection(DriverManager.java:215)
>
> [info]   at
> com.agoda.mse.hadooputils.HiveTools$.getHiveConnection(HiveTools.scala
> :135)
>
> [info]   at
> com.agoda.mse.hadooputils.HiveTools$.withConnection(HiveTools.scala:19
> )
>
> [info]   at
> com.agoda.mse.hadooputils.HiveTools$.withStatement(HiveTools.scala:30)
>
> [info]   at
> com.agoda.mse.hadooputils.HiveTools$.copyFileToHdfsThenRunQuery(HiveTo
> ols.scala:110)
>
> [info]   at
> SparkAssemblyTest$$anonfun$4.apply$mcV$sp(SparkAssemblyTest.scala:41)
>
>
>
> This happens when I try to create a hive connection myself, using the
> hive-jdbc-cdh5.1.3 package ( I can connect if I don’t have the spark
> in the
> classpath)
>
>
>
> How can I get spark jar to be consistent with hive-jdbc for CDH5.1.3?
>
> Thanks
>
>
> 
>
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by
> copyright or other legal rules. If you have received it by mistake
> please let us know by reply email and delete it from your system. It
> is prohibited to copy this message or disclose its content to anyone.
> Any confidentiality or privilege is not waived or lost by any mistaken
> delivery or unauthorized disclosure of the message. All messages sent
> to and from Agoda may be monitored to ensure compliance with company
> policies, to protect the company's interests and to remove potential
> malware. Electronic messages may be intercepted, amended, lost or deleted, or 
> contain viruses.



--
Marcelo


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.


Re: Confused why I'm losing workers/executors when writing a large file to S3

2014-11-13 Thread Sonal Goyal
Hi Darin,

In our case, we were getting the error gue to long GC pauses in our app.
Fixing the underlying code helped us remove this error. This is also
mentioned as point 1 in the link below:

http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/%3cca+-p3ah5aamgtke6viycwb24ohsnmaqm1q9x53abwb_arvw...@mail.gmail.com%3E


Best Regards,
Sonal
Founder, Nube Technologies 





On Fri, Nov 14, 2014 at 1:01 AM, Darin McBeath 
wrote:

> For one of my Spark jobs, my workers/executors are dying and leaving the
> cluster.
>
> On the master, I see something like the following in the log file.  I'm
> surprised to see the '60' seconds in the master log below because I
> explicitly set it to '600' (or so I thought) in my spark job (see below).
> This is happening at the end of my job when I'm trying to persist a large
> RDD (probably around 300+GB) back to S3 (in 256 partitions).  My cluster
> consists of 6 r3.8xlarge machines.  The job successfully works when I'm
> outputting 100GB or 200GB.
>
> If  you have any thoughts/insights, it would be appreciated.
>
> Thanks.
>
> Darin.
>
> Here is where I'm setting the 'timeout' in my spark job.
>
> SparkConf conf = new SparkConf()
> .setAppName("SparkSync Application")
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> .set("spark.rdd.compress","true")
> .set("spark.core.connection.ack.wait.timeout","600");
> ​
> On the master, I see the following in the log file.
>
> 4/11/13 17:20:39 WARN master.Master: Removing
> worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 because we got no
> heartbeat in 60 seconds
> 14/11/13 17:20:39 INFO master.Master: Removing worker
> worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 on
> ip-10-35-184-232.ec2.internal:51877
> 14/11/13 17:20:39 INFO master.Master: Telling app of lost executor: 2
>
> On a worker, I see something like the following in the log file.
>
> 14/11/13 17:20:58 WARN util.AkkaUtils: Error sending message in 1 attempts
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362)
> 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: I/O exception
> (java.net.SocketException) caught when processing request: Broken pipe
> 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:32 INFO httpclient.HttpMethodDirector: I/O exception
> (java.net.SocketException) caught when processing request: Broken pipe
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, which
> exceeds the maximum retry 

pyspark and hdfs file name

2014-11-13 Thread Oleg Ruchovets
Hi ,
  I am running pyspark job.
I need serialize final result to *hdfs in binary files* and having ability
to give a *name for output files*.

I found this post:
http://stackoverflow.com/questions/25293962/specifying-the-output-file-name-in-apache-spark


but it explains how to do it using scala.

Question:
 How to do it using pyspark

Thanks
Oleg.


Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Cheng Lian
If you’re just relying on the side effect of |setup()| and |cleanup()| 
then I think this trick is OK and pretty cleaner.


But if |setup()| returns, say, a DB connection, then the |map(...)| part 
and |cleanup()| can’t get the connection object.


On 11/14/14 1:20 PM, Jianshi Huang wrote:


So can I write it like this?

rdd.mapPartition(i => setup(); i).map(...).mapPartition(i => cleanup(); i)

So I don't need to mess up the logic and still can use map, filter and 
other transformations for RDD.


Jianshi

On Fri, Nov 14, 2014 at 12:20 PM, Cheng Lian > wrote:


If you’re looking for executor side setup and cleanup functions,
there ain’t any yet, but you can achieve the same semantics via
|RDD.mapPartitions|.

Please check the “setup() and cleanup” section of this blog from
Cloudera for details:

http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/

On 11/14/14 10:44 AM, Dai, Kevin wrote:


HI, all

Is there setup and cleanup function as in hadoop mapreduce in
spark which does some initialization and cleanup work?

Best Regards,

Kevin.


​




--
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


​


Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Jianshi Huang
Ok, then we need another trick.

let's have an *implicit lazy var connection/context* around our code. And
setup() will trigger the eval and initialization.

The implicit lazy val/var trick is actually invented by Kevin. :)

Jianshi

On Fri, Nov 14, 2014 at 1:41 PM, Cheng Lian  wrote:

>  If you’re just relying on the side effect of setup() and cleanup() then
> I think this trick is OK and pretty cleaner.
>
> But if setup() returns, say, a DB connection, then the map(...) part and
> cleanup() can’t get the connection object.
>
> On 11/14/14 1:20 PM, Jianshi Huang wrote:
>
>   So can I write it like this?
>
>  rdd.mapPartition(i => setup(); i).map(...).mapPartition(i => cleanup();
> i)
>
>  So I don't need to mess up the logic and still can use map, filter and
> other transformations for RDD.
>
>  Jianshi
>
> On Fri, Nov 14, 2014 at 12:20 PM, Cheng Lian 
> wrote:
>
>>  If you’re looking for executor side setup and cleanup functions, there
>> ain’t any yet, but you can achieve the same semantics via
>> RDD.mapPartitions.
>>
>> Please check the “setup() and cleanup” section of this blog from Cloudera
>> for details:
>> http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/
>>
>> On 11/14/14 10:44 AM, Dai, Kevin wrote:
>>
>>  HI, all
>>
>>
>>
>> Is there setup and cleanup function as in hadoop mapreduce in spark which
>> does some initialization and cleanup work?
>>
>>
>>
>> Best Regards,
>>
>> Kevin.
>>
>>  ​
>>
>
>
>
>  --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>
>   ​
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/


Re: Communication between Driver and Executors

2014-11-13 Thread Mayur Rustagi
I wonder if SparkConf is dynamically updated on all worker nodes or only
during initialization. It can be used to piggyback information.
Otherwise I guess you are stuck with Broadcast.
Primarily I have had these issues moving legacy MR operators to Spark where
MR piggybacks on Hadoop conf pretty  heavily, in spark Native application
its rarely required. Do you have a usecase like that?


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 


On Fri, Nov 14, 2014 at 10:28 AM, Tobias Pfeiffer  wrote:

> Hi,
>
> (this is related to my previous question about stopping the
> StreamingContext)
>
> is there any way to send a message from the driver to the executors? There
> is all this Akka machinery running, so it should be easy to have something
> like
>
>   sendToAllExecutors(message)
>
> on the driver and
>
>   handleMessage {
> case _ => ...
>   }
>
> on the executors, right? Surely at least for Broadcast.unpersist() such a
> thing must exist, so can I use it somehow (dirty way is also ok) to send a
> message to my Spark nodes?
>
> Thanks
> Tobias
>


toLocalIterator in Spark 1.0.0

2014-11-13 Thread Deep Pradhan
Hi,

I am using Spark 1.0.0 and Scala 2.10.3.

I want to use toLocalIterator in a code but the spark shell tells

*not found: value toLocalIterator*

I also did import org.apache.spark.rdd but even after this the shell tells

*object toLocalIterator is not a member of package org.apache.spark.rdd*

Can anyone help me in this?

Thank You


Re: Streaming: getting total count over all windows

2014-11-13 Thread Mayur Rustagi
So if you want to do from beginning to end of time the interface is
updateStatebykey, if only over a particular set of windows you can
construct broader windows from smaller windows/batches.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 


On Fri, Nov 14, 2014 at 9:17 AM, jay vyas 
wrote:

> I would think this should be done at the application level.
> After all, the core functionality of SparkStreaming is to capture RDDs in
> some real time interval and process them -
> not to aggregate their results.
>
> But maybe there is a better way...
>
> On Thu, Nov 13, 2014 at 8:28 PM, SK  wrote:
>
>> Hi,
>>
>> I am using the following code to generate the (score, count) for each
>> window:
>>
>> val score_count_by_window  = topic.map(r =>  r._2)   // r._2 is the
>> integer
>> score
>>  .countByValue()
>>
>> score_count_by_window.print()
>>
>> E.g. output for a window is as follows, which means that within the
>> Dstream
>> for that window, there are 2 rdds with score 0; 3 with score 1, and 1 with
>> score -1.
>> (0, 2)
>> (1, 3)
>> (-1, 1)
>>
>> I would like to get the aggregate count for each score over all windows
>> until program terminates. I tried countByValueAndWindow() but the result
>> is
>> same as countByValue() (i.e. it is producing only per window counts).
>> reduceByWindow also does not produce the result I am expecting. What is
>> the
>> correct way to sum up the counts over multiple windows?
>>
>> thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-getting-total-count-over-all-windows-tp1.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> jay vyas
>


Re: Status of MLLib exporting models to PMML

2014-11-13 Thread Manish Amde
@Aris, we are closely following the PMML work that is going on and as
Xiangrui mentioned, it might be easier to migrate models such as logistic
regression and then migrate trees. Some of the models get fairly large (as
pointed out by Sung Chung) with deep trees as building blocks and we might
have to consider a distributed storage and prediction strategy.


On Tuesday, November 11, 2014, Xiangrui Meng  wrote:

> Vincenzo sent a PR and included k-means as an example. Sean is helping
> review it. PMML standard is quite large. So we may start with simple
> model export, like linear methods, then move forward to tree-based.
> -Xiangrui
>
> On Mon, Nov 10, 2014 at 11:27 AM, Aris  > wrote:
> > Hello Spark and MLLib folks,
> >
> > So a common problem in the real world of using machine learning is that
> some
> > data analysis use tools like R, but the more "data engineers" out there
> will
> > use more advanced systems like Spark MLLib or even Python Scikit Learn.
> >
> > In the real world, I want to have "a system" where multiple different
> > modeling environments can learn from data / build models, represent the
> > models in a common language, and then have a layer which just takes the
> > model and run model.predict() all day long -- scores the models in other
> > words.
> >
> > It looks like the project openscoring.io and jpmml-evaluator are some
> > amazing systems for this, but they fundamentally use PMML as the model
> > representation here.
> >
> > I have read some JIRA tickets that Xiangrui Meng is interested in getting
> > PMML implemented to export MLLib models, is that happening? Further,
> would
> > something like Manish Amde's boosted ensemble tree methods be
> representable
> > in PMML?
> >
> > Thank you!!
> > Aris
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
>
>


Re: toLocalIterator in Spark 1.0.0

2014-11-13 Thread Patrick Wendell
It looks like you are trying to directly import the toLocalIterator
function. You can't import functions, it should just appear as a
method of an existing RDD if you have one.

- Patrick

On Thu, Nov 13, 2014 at 10:21 PM, Deep Pradhan
 wrote:
> Hi,
>
> I am using Spark 1.0.0 and Scala 2.10.3.
>
> I want to use toLocalIterator in a code but the spark shell tells
>
> not found: value toLocalIterator
>
> I also did import org.apache.spark.rdd but even after this the shell tells
>
> object toLocalIterator is not a member of package org.apache.spark.rdd
>
> Can anyone help me in this?
>
> Thank You

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Confused why I'm losing workers/executors when writing a large file to S3

2014-11-13 Thread Reynold Xin
Darin,

You might want to increase these config options also:

spark.akka.timeout 300
spark.storage.blockManagerSlaveTimeoutMs 30

On Thu, Nov 13, 2014 at 11:31 AM, Darin McBeath  wrote:

> For one of my Spark jobs, my workers/executors are dying and leaving the
> cluster.
>
> On the master, I see something like the following in the log file.  I'm
> surprised to see the '60' seconds in the master log below because I
> explicitly set it to '600' (or so I thought) in my spark job (see below).
> This is happening at the end of my job when I'm trying to persist a large
> RDD (probably around 300+GB) back to S3 (in 256 partitions).  My cluster
> consists of 6 r3.8xlarge machines.  The job successfully works when I'm
> outputting 100GB or 200GB.
>
> If  you have any thoughts/insights, it would be appreciated.
>
> Thanks.
>
> Darin.
>
> Here is where I'm setting the 'timeout' in my spark job.
>
> SparkConf conf = new SparkConf()
> .setAppName("SparkSync Application")
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> .set("spark.rdd.compress","true")
> .set("spark.core.connection.ack.wait.timeout","600");
> ​
> On the master, I see the following in the log file.
>
> 4/11/13 17:20:39 WARN master.Master: Removing
> worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 because we got no
> heartbeat in 60 seconds
> 14/11/13 17:20:39 INFO master.Master: Removing worker
> worker-20141113134801-ip-10-35-184-232.ec2.internal-51877 on
> ip-10-35-184-232.ec2.internal:51877
> 14/11/13 17:20:39 INFO master.Master: Telling app of lost executor: 2
>
> On a worker, I see something like the following in the log file.
>
> 14/11/13 17:20:58 WARN util.AkkaUtils: Error sending message in 1 attempts
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:362)
> 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: I/O exception
> (java.net.SocketException) caught when processing request: Broken pipe
> 14/11/13 17:21:11 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:32 INFO httpclient.HttpMethodDirector: I/O exception
> (java.net.SocketException) caught when processing request: Broken pipe
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark
> 14/11/13 17:21:34 INFO httpclient.HttpMethodDirector: Retrying request
> 14/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, which
> exceeds the maximum retry count of 5
> 14/11/13 17:21:58 WARN utils.RestUtils: Retried connection 6 times, which
> exceeds the maximum retry count of 5
> 14/11/13 17:22:57 WARN util.AkkaUtils: Error sending message in 1 attempts
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>
>


Re: pyspark and hdfs file name

2014-11-13 Thread Davies Liu
One option maybe call HDFS tools or client to rename them after saveAsXXXFile().

On Thu, Nov 13, 2014 at 9:39 PM, Oleg Ruchovets  wrote:
> Hi ,
>   I am running pyspark job.
> I need serialize final result to hdfs in binary files and having ability to
> give a name for output files.
>
> I found this post:
> http://stackoverflow.com/questions/25293962/specifying-the-output-file-name-in-apache-spark
>
> but it explains how to do it using scala.
>
> Question:
>  How to do it using pyspark
>
> Thanks
> Oleg.
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using a compression codec in saveAsSequenceFile in Pyspark (Python API)

2014-11-13 Thread Davies Liu
You could use the following as compressionCodecClass:

DEFLATE   org.apache.hadoop.io.compress.DefaultCodec
gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
LZO  com.hadoop.compression.lzo.LzopCodec

for gzip, compressionCodecClass should be
org.apache.hadoop.io.compress.GzipCodec



On Thu, Nov 13, 2014 at 8:28 PM, sahanbull  wrote:
> Hi,
>
> I am trying to save an RDD to an S3 bucket using
> RDD.saveAsSequenceFile(self, path, CompressionCodec) function in python. I
> need to save the RDD in GZIP. Can anyone tell me how to send the gzip codec
> class as a parameter into the function.
>
> I tried
> *RDD.saveAsSequenceFile('{0}{1}'.format(outputFolder,datePath),compressionCodecClass=gzip.GzipFile)*
>
> but it hits me with a : *AttributeError: type object 'GzipFile' has no
> attribute '_get_object_id' *
> Which I think is because JVM cant find the scala mapping gzip.
>
> *If you can guide me about any method to write the RDD as a gzip(.gz) into
> disc that is very much appreciated. *
>
> Many thanks
> SahanB
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-a-compression-codec-in-saveAsSequenceFile-in-Pyspark-Python-API-tp18899.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org