date:20140424

thank you, i add setJars, but nothing changes

val conf = new SparkConf()
  .setMaster(spark://127.0.0.1:7077)
  .setAppName(Simple App)
  .set(spark.executor.memory, 1g)
  .setJars(Seq(target/scala-2.10/simple-project_2.10-1.0.jar))
val sc = new SparkContext(conf)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4732.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Re: how to set spark.executor.memory and heap size

2014-04-24 Thread qinwei







try the complete path


qinwei
 From: wxhsdpDate: 2014-04-24 14:21To: userSubject: Re: how to set 
spark.executor.memory and heap sizethank you, i add setJars, but nothing changes
 
    val conf = new SparkConf()
  .setMaster(spark://127.0.0.1:7077)
  .setAppName(Simple App)
  .set(spark.executor.memory, 1g)
  .setJars(Seq(target/scala-2.10/simple-project_2.10-1.0.jar))
    val sc = new SparkContext(conf)
 
 
 
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4732.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Need help about how hadoop works.

2014-04-24 Thread Carter

Thanks Mayur.

So without Hadoop and any other distributed file systems, by running:
 val doc = sc.textFile(/home/scalatest.txt,5)
 doc.count
we can only get parallelization within the computer where the file is
loaded, but not the parallelization within the computers in the cluster
(Spark can not automatically duplicate the file to the other computers in
the cluster), is this understanding correct? Thank you.

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4734.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Need help about how hadoop works.

2014-04-24 Thread Prashant Sharma

Prashant Sharma


On Thu, Apr 24, 2014 at 12:15 PM, Carter gyz...@hotmail.com wrote:

 Thanks Mayur.

 So without Hadoop and any other distributed file systems, by running:
  val doc = sc.textFile(/home/scalatest.txt,5)
  doc.count
 we can only get parallelization within the computer where the file is
 loaded, but not the parallelization within the computers in the cluster
 (Spark can not automatically duplicate the file to the other computers in
 the cluster), is this understanding correct? Thank you.


Spark will not distribute that file for you on other systems, however if
the file(/home/scalatest.txt) is present on the same path on all systems
it will be processed on all nodes. We generally use hdfs which takes care
of this distribution.





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4734.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Re: how to set spark.executor.memory and heap size

i tried, but no effect


Qin Wei wrote
 try the complete path
 
 
 qinwei
  From: wxhsdpDate: 2014-04-24 14:21To: userSubject: Re: how to set
 spark.executor.memory and heap sizethank you, i add setJars, but nothing
 changes
  
     val conf = new SparkConf()
   .setMaster(spark://127.0.0.1:7077)
   .setAppName(Simple App)
   .set(spark.executor.memory, 1g)
   .setJars(Seq(target/scala-2.10/simple-project_2.10-1.0.jar))
     val sc = new SparkContext(conf)
  
  
  
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4732.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4736.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: SPARK_YARN_APP_JAR, SPARK_CLASSPATH and ADD_JARS in a spark-shell on YARN

2014-04-24 Thread Christophe Préaud


Good to know, thanks for pointing this out to me!

On 23/04/2014 19:55, Sandy Ryza wrote:
Ah, you're right about SPARK_CLASSPATH and ADD_JARS.  My bad.

SPARK_YARN_APP_JAR is going away entirely - 
https://issues.apache.org/jira/browse/SPARK-1053


On Wed, Apr 23, 2014 at 8:07 AM, Christophe Préaud 
christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com wrote:
Hi Sandy,

Thanks for your reply !

I thought adding the jars in both SPARK_CLASSPATH and ADD_JARS was only 
required as a temporary workaround in spark 0.9.0 (see 
https://issues.apache.org/jira/browse/SPARK-1089), and that it was not 
necessary anymore in 0.9.1

As for SPARK_YARN_APP_JAR, is it really useful, or is it planned to be removed 
in future versions of Spark? I personally always set it to /dev/null when 
launching a spark-shell in yarn-client mode.

Thanks again for your time!
Christophe.


On 21/04/2014 19:16, Sandy Ryza wrote:
Hi Christophe,

Adding the jars to both SPARK_CLASSPATH and ADD_JARS is required.  The former 
makes them available to the spark-shell driver process, and the latter tells 
Spark to make them available to the executor processes running on the cluster.

-Sandy


On Wed, Apr 16, 2014 at 9:27 AM, Christophe Préaud 
christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com wrote:
Hi,

I am running Spark 0.9.1 on a YARN cluster, and I am wondering which is the
correct way to add external jars when running a spark shell on a YARN cluster.

Packaging all this dependencies in an assembly which path is then set in
SPARK_YARN_APP_JAR (as written in the doc:
http://spark.apache.org/docs/latest/running-on-yarn.html) does not work in my
case: it pushes the jar on HDFS in .sparkStaging/application_XXX, but the
spark-shell is still unable to find it (unless ADD_JARS and/or SPARK_CLASSPATH
is defined)

Defining all the dependencies (either in an assembly, or separately) in ADD_JARS
or SPARK_CLASSPATH works (even if SPARK_YARN_APP_JAR is set to /dev/null), but
defining some dependencies in ADD_JARS and the rest in SPARK_CLASSPATH does not!

Hence I'm still wondering which are the differences between ADD_JARS and
SPARK_CLASSPATH, and the purpose of SPARK_YARN_APP_JAR.

Thanks for any insights!
Christophe.



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.




Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.




Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.

Re: Need help about how hadoop works.

2014-04-24 Thread Carter

Thank you very much for your help Prashant.

Sorry I still have another question about your answer: however if the
file(/home/scalatest.txt) is present on the same path on all systems it
will be processed on all nodes.

When presenting the file to the same path on all nodes, do we just simply
copy the same file to all nodes, or do we need to split the original file
into different parts (each part is still with the same file name
scalatest.txt), and copy each part to a different node for
parallelization? 

Thank you very much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Need help about how hadoop works.

2014-04-24 Thread Prashant Sharma

It is the same file and hadoop library that we use for splitting takes care
of assigning the right split to each node.

Prashant Sharma


On Thu, Apr 24, 2014 at 1:36 PM, Carter gyz...@hotmail.com wrote:

 Thank you very much for your help Prashant.

 Sorry I still have another question about your answer: however if the
 file(/home/scalatest.txt) is present on the same path on all systems it
 will be processed on all nodes.

 When presenting the file to the same path on all nodes, do we just simply
 copy the same file to all nodes, or do we need to split the original file
 into different parts (each part is still with the same file name
 scalatest.txt), and copy each part to a different node for
 parallelization?

 Thank you very much.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to set spark.executor.memory and heap size

i think maybe it's the problem of read local file

val logFile = /home/wxhsdp/spark/example/standalone/README.md
val logData = sc.textFile(logFile).cache() 

if i replace the above code with

val logData = sc.parallelize(Array(1,2,3,4)).cache() 

the job can complete successfully

can't i read a file located at local file system? anyone knows the reason?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4740.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread Adnan Yaqoob

You need to use proper url format:

file://home/wxhsdp/spark/example/standalone/README.md


On Thu, Apr 24, 2014 at 1:29 PM, wxhsdp wxh...@gmail.com wrote:

 i think maybe it's the problem of read local file

 val logFile = /home/wxhsdp/spark/example/standalone/README.md
 val logData = sc.textFile(logFile).cache()

 if i replace the above code with

 val logData = sc.parallelize(Array(1,2,3,4)).cache()

 the job can complete successfully

 can't i read a file located at local file system? anyone knows the reason?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4740.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread Adnan Yaqoob

Sorry wrong format:

file:///home/wxhsdp/spark/example/standalone/README.md

An extra / is needed at the start.


On Thu, Apr 24, 2014 at 1:46 PM, Adnan Yaqoob nsyaq...@gmail.com wrote:

 You need to use proper url format:

 file://home/wxhsdp/spark/example/standalone/README.md


 On Thu, Apr 24, 2014 at 1:29 PM, wxhsdp wxh...@gmail.com wrote:

 i think maybe it's the problem of read local file

 val logFile = /home/wxhsdp/spark/example/standalone/README.md
 val logData = sc.textFile(logFile).cache()

 if i replace the above code with

 val logData = sc.parallelize(Array(1,2,3,4)).cache()

 the job can complete successfully

 can't i read a file located at local file system? anyone knows the reason?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4740.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to set spark.executor.memory and heap size

thanks for your reply, adnan, i tried
val logFile = file:///home/wxhsdp/spark/example/standalone/README.md
i think there needs three left slash behind file:

it's just the same as val logFile =
home/wxhsdp/spark/example/standalone/README.md
the error remains:(



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4743.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread Arpit Tak

Hi,

You should be able to read it, file://or file:/// not even required for
reading locally , just path is enough..
what error message you getting on spark-shell while reading...
for local:


Also read the same from hdfs file also ...
put your README file there and read , it  works in both ways..
val a= sc.textFile(hdfs://localhost:54310/t/README.md)

also, print stack message of your spark-shell...


On Thu, Apr 24, 2014 at 2:25 PM, wxhsdp wxh...@gmail.com wrote:

 thanks for your reply, adnan, i tried
 val logFile = file:///home/wxhsdp/spark/example/standalone/README.md
 i think there needs three left slash behind file:

 it's just the same as val logFile =
 home/wxhsdp/spark/example/standalone/README.md
 the error remains:(



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4743.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to set spark.executor.memory and heap size

hi arpit,
on spark shell, i can read local file properly,
but when i use sbt run, error occurs. 
the sbt error message is in the beginning of the thread


Arpit Tak-2 wrote
 Hi,
 
 You should be able to read it, file://or file:/// not even required for
 reading locally , just path is enough..
 what error message you getting on spark-shell while reading...
 for local:
 
 
 Also read the same from hdfs file also ...
 put your README file there and read , it  works in both ways..
 val a= sc.textFile(hdfs://localhost:54310/t/README.md)
 
 also, print stack message of your spark-shell...
 
 
 On Thu, Apr 24, 2014 at 2:25 PM, wxhsdp lt;

 wxhsdp@

 gt; wrote:
 
 thanks for your reply, adnan, i tried
 val logFile = file:///home/wxhsdp/spark/example/standalone/README.md
 i think there needs three left slash behind file:

 it's just the same as val logFile =
 home/wxhsdp/spark/example/standalone/README.md
 the error remains:(



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4743.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4745.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Need help about how hadoop works.

2014-04-24 Thread Carter

Thank you very much Prashant.

Date: Thu, 24 Apr 2014 01:24:39 -0700
From: ml-node+s1001560n4739...@n3.nabble.com
To: gyz...@hotmail.com
Subject: Re: Need help about how hadoop works.

It is the same file and hadoop library that we use for splitting takes 
care of assigning the right split to each node.Prashant Sharma

On Thu, Apr 24, 2014 at 1:36 PM, Carter [hidden email] wrote:

Thank you very much for your help Prashant.

Sorry I still have another question about your answer: however if the

file(/home/scalatest.txt) is present on the same path on all systems it

will be processed on all nodes.

When presenting the file to the same path on all nodes, do we just simply

copy the same file to all nodes, or do we need to split the original file

into different parts (each part is still with the same file name

scalatest.txt), and copy each part to a different node for

parallelization?

Thank you very much.

--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

If you reply to this email, your message will be added to the 
discussion below:

http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4739.html

To unsubscribe from Need help about how hadoop works., click 
here.

NAML

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4746.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread Arpit Tak

Okk fine,

try like this , i tried and it works..
specify spark path also in constructor...
and also
export SPARK_JAVA_OPTS=-Xms300m -Xmx512m -XX:MaxPermSize=1g

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object SimpleApp {
   def main(args: Array[String]) {
  val logFile = /var/log/auth.log // read any file.
  val sc = new SparkContext(spark://localhost:7077, Simple App,
/home/ubuntu/spark-0.9.1-incubating/,
  List(target/scala-2.10/simple-project_2.10-2.0.jar))
  val tr = sc.textFile(logFile).cache
  tr.take(100).foreach(println)

   }
}

This will work


On Thu, Apr 24, 2014 at 3:00 PM, wxhsdp wxh...@gmail.com wrote:

 hi arpit,
 on spark shell, i can read local file properly,
 but when i use sbt run, error occurs.
 the sbt error message is in the beginning of the thread


 Arpit Tak-2 wrote
  Hi,
 
  You should be able to read it, file://or file:/// not even required for
  reading locally , just path is enough..
  what error message you getting on spark-shell while reading...
  for local:
 
 
  Also read the same from hdfs file also ...
  put your README file there and read , it  works in both ways..
  val a= sc.textFile(hdfs://localhost:54310/t/README.md)
 
  also, print stack message of your spark-shell...
 
 
  On Thu, Apr 24, 2014 at 2:25 PM, wxhsdp lt;

  wxhsdp@

  gt; wrote:
 
  thanks for your reply, adnan, i tried
  val logFile = file:///home/wxhsdp/spark/example/standalone/README.md
  i think there needs three left slash behind file:
 
  it's just the same as val logFile =
  home/wxhsdp/spark/example/standalone/README.md
  the error remains:(
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4743.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4745.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: error in mllib lr example code

2014-04-24 Thread Arpit Tak

Also try out these examples, all of them works

http://docs.sigmoidanalytics.com/index.php/MLlib

if you spot any problems in those, let us know.

Regards,
arpit


On Wed, Apr 23, 2014 at 11:08 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 See http://people.csail.mit.edu/matei/spark-unified-docs/ for a more
 recent build of the docs; if you spot any problems in those, let us know.

 Matei

 On Apr 23, 2014, at 9:49 AM, Xiangrui Meng men...@gmail.com wrote:

  The doc is for 0.9.1. You are running a later snapshot, which added
  sparse vectors. Try LabeledPoint(parts(0).toDouble,
  Vectors.dense(parts(1).split(' ').map(x = x.toDouble)). The examples
  are updated in the master branch. You can also check the examples
  there. -Xiangrui
 
  On Wed, Apr 23, 2014 at 9:34 AM, Mohit Jaggi mohitja...@gmail.com
 wrote:
 
  sorry...added a subject now
 
  On Wed, Apr 23, 2014 at 9:32 AM, Mohit Jaggi mohitja...@gmail.com
 wrote:
 
  I am trying to run the example linear regression code from
 
  http://spark.apache.org/docs/latest/mllib-guide.html
 
  But I am getting the following error...am I missing an import?
 
  code
 
  import org.apache.spark._
 
  import org.apache.spark.mllib.regression.LinearRegressionWithSGD
 
  import org.apache.spark.mllib.regression.LabeledPoint
 
 
  object ModelLR {
 
   def main(args: Array[String]) {
 
 val sc = new SparkContext(args(0), SparkLR,
 
   System.getenv(SPARK_HOME),
  SparkContext.jarOfClass(this.getClass).toSeq)
 
  // Load and parse the data
 
  val data = sc.textFile(mllib/data/ridge-data/lpsa.data)
 
  val parsedData = data.map { line =
 
   val parts = line.split(',')
 
   LabeledPoint(parts(0).toDouble, parts(1).split(' ').map(x =
  x.toDouble).toArray)
 
  }
 
  ...snip...
 
  }
 
  error
 
  - polymorphic expression cannot be instantiated to expected type;
 found :
  [U : Double]Array[U] required:
 
  org.apache.spark.mllib.linalg.Vector
 
  - polymorphic expression cannot be instantiated to expected type;
 found :
  [U : Double]Array[U] required:
 
  org.apache.spark.mllib.linalg.Vector

Re: Access Last Element of RDD

2014-04-24 Thread Sai Prasanna

Hi All, Finally i wrote the following code, which is felt does optimally if
not the most optimum one.
Using file pointers, seeking the byte after the last \n but backwards !!
This is memory efficient and i hope even unix tail implementation should be
something similar !!

import java.io.RandomAccessFile
import java.io.IOException
var FILEPATH=/home/sparkcluster/hadoop-2.3.0/temp;
var fileHandler = new RandomAccessFile( FILEPATH, r );
var fileLength = fileHandler.length() - 1;
var cond = 1;
var filePointer = fileLength-1;
var toRead= -1;
while(filePointer != -1  cond!=0){
 fileHandler.seek( filePointer );
 var readByte = fileHandler.readByte();
 if( readByte == 0xA  filePointer != fileLength ) cond=0;
 else if( readByte == 0xD  filePointer != fileLength - 1
) cond=0;

 filePointer=filePointer-1; toRead=toRead+1;
}
filePointer=filePointer+2;
var bytes : Array[Byte] = new Array[Byte](toRead);
fileHandler.seek(filePointer);
fileHandler.read(bytes);
var bdd=new String(bytes);  /*bdd contains the last line*/




On Thu, Apr 24, 2014 at 11:42 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:

 Thanks Guys !


 On Thu, Apr 24, 2014 at 11:29 AM, Sourav Chandra 
 sourav.chan...@livestream.com wrote:

 Also same thing can be done using rdd.top(1)(reverseOrdering)



 On Thu, Apr 24, 2014 at 11:28 AM, Sourav Chandra 
 sourav.chan...@livestream.com wrote:

 You can use rdd.takeOrdered(1)(reverseOrdrering)

 reverseOrdering is you Ordering[T] instance where you define the
 ordering logic. This you have to pass in the method



 On Thu, Apr 24, 2014 at 11:21 AM, Frank Austin Nothaft 
 fnoth...@berkeley.edu wrote:

 If you do this, you could simplify to:

 RDD.collect().last

 However, this has the problem of collecting all data to the driver.

 Is your data sorted? If so, you could reverse the sort and take the
 first. Alternatively, a hackey implementation might involve a
 mapPartitionsWithIndex that returns an empty iterator for all partitions
 except for the last. For the last partition, you would filter all elements
 except for the last element in your iterator. This should leave one
 element, which is your last element.

 Frank Austin Nothaft
 fnoth...@berkeley.edu
 fnoth...@eecs.berkeley.edu
 202-340-0466

 On Apr 23, 2014, at 10:44 PM, Adnan Yaqoob nsyaq...@gmail.com wrote:

 This function will return scala List, you can use List's last function
 to get the last element.

 For example:

 RDD.take(RDD.count()).last


 On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna ansaiprasa...@gmail.com
  wrote:

 Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD.

 I want only to access the last element.


 On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna 
 ansaiprasa...@gmail.com wrote:

 Oh ya, Thanks Adnan.


 On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.comwrote:

 You can use following code:

 RDD.take(RDD.count())


 On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna 
 ansaiprasa...@gmail.com wrote:

 Hi All, Some help !
 RDD.first or RDD.take(1) gives the first item, is there a straight
 forward way to access the last element in a similar way ?

 I coudnt fine a tail/last method for RDD. !!









 --

 Sourav Chandra

 Senior Software Engineer

 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

 sourav.chan...@livestream.com

 o: +91 80 4121 8723

 m: +91 988 699 3746

 skype: sourav.chandra

 Livestream

 Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
 Block, Koramangala Industrial Area,

 Bangalore 560034

 www.livestream.com




 --

 Sourav Chandra

 Senior Software Engineer

 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

 sourav.chan...@livestream.com

 o: +91 80 4121 8723

 m: +91 988 699 3746

 skype: sourav.chandra

 Livestream

 Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
 Block, Koramangala Industrial Area,

 Bangalore 560034

 www.livestream.com

Re: SparkPi performance-3 cluster standalone mode

2014-04-24 Thread Adnan

Hi,

Relatively new on spark and have tried running SparkPi example on a
standalone 12 core three machine cluster. What I'm failing to understand is,
that running this example with a single slice gives better performance as
compared to using 12 slices. Same was the case when I was using parallelize
function. The time is scaling almost linearly with adding each slice. Please
let me know if I'm doing anything wrong. The code snippet is given below:

 

Regards,

Ahsan Ijaz



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkPi-performance-3-cluster-standalone-mode-tp4530p4751.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to set spark.executor.memory and heap size

it seems that it's nothing about settings, i tried take action, and find it's
ok, but error occurs when i tried count and collect

val a = sc.textFile(any file)
a.take(n).foreach(println) //ok

a.count() //failed
a.collect()//failed


val b = sc.parallelize((Array(1,2,3,4))

b.take(n).foreach(println) //ok

b.count() //ok
b.collect()//ok

it's so weird


Arpit Tak-2 wrote
 Okk fine,
 
 try like this , i tried and it works..
 specify spark path also in constructor...
 and also
 export SPARK_JAVA_OPTS=-Xms300m -Xmx512m -XX:MaxPermSize=1g
 
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 object SimpleApp {
def main(args: Array[String]) {
   val logFile = /var/log/auth.log // read any file.
   val sc = new SparkContext(spark://localhost:7077, Simple
 App,
 /home/ubuntu/spark-0.9.1-incubating/,
   List(target/scala-2.10/simple-project_2.10-2.0.jar))
   val tr = sc.textFile(logFile).cache
   tr.take(100).foreach(println)
 
}
 }
 
 This will work
 
 
 On Thu, Apr 24, 2014 at 3:00 PM, wxhsdp lt;

 wxhsdp@

 gt; wrote:
 
 hi arpit,
 on spark shell, i can read local file properly,
 but when i use sbt run, error occurs.
 the sbt error message is in the beginning of the thread


 Arpit Tak-2 wrote
  Hi,
 
  You should be able to read it, file://or file:/// not even required for
  reading locally , just path is enough..
  what error message you getting on spark-shell while reading...
  for local:
 
 
  Also read the same from hdfs file also ...
  put your README file there and read , it  works in both ways..
  val a= sc.textFile(hdfs://localhost:54310/t/README.md)
 
  also, print stack message of your spark-shell...
 
 
  On Thu, Apr 24, 2014 at 2:25 PM, wxhsdp lt;

  wxhsdp@

  gt; wrote:
 
  thanks for your reply, adnan, i tried
  val logFile = file:///home/wxhsdp/spark/example/standalone/README.md
  i think there needs three left slash behind file:
 
  it's just the same as val logFile =
  home/wxhsdp/spark/example/standalone/README.md
  the error remains:(
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4743.html
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4745.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4752.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Access Last Element of RDD

2014-04-24 Thread Cheng Lian

You may try this:

val lastOption = sc.textFile(input).mapPartitions { iterator =
  if (iterator.isEmpty) {
iterator
  } else {
Iterator
  .continually((iterator.next(), iterator.hasNext()))
  .collect { case (value, false) = value }
  .take(1)
  }
}.collect().lastOption

Iterator based data access ensures O(1) space complexity and it runs faster
because different partitions are processed in parallel. lastOption is used
instead of last to deal with empty file.


On Thu, Apr 24, 2014 at 7:38 PM, Sai Prasanna ansaiprasa...@gmail.comwrote:

 Hi All, Finally i wrote the following code, which is felt does optimally
 if not the most optimum one.
 Using file pointers, seeking the byte after the last \n but backwards !!
 This is memory efficient and i hope even unix tail implementation should
 be something similar !!

 import java.io.RandomAccessFile
 import java.io.IOException
 var FILEPATH=/home/sparkcluster/hadoop-2.3.0/temp;
 var fileHandler = new RandomAccessFile( FILEPATH, r );
 var fileLength = fileHandler.length() - 1;
 var cond = 1;
 var filePointer = fileLength-1;
 var toRead= -1;
 while(filePointer != -1  cond!=0){
  fileHandler.seek( filePointer );
  var readByte = fileHandler.readByte();
  if( readByte == 0xA  filePointer != fileLength )
 cond=0;
   else if( readByte == 0xD  filePointer != fileLength -
 1 ) cond=0;

  filePointer=filePointer-1; toRead=toRead+1;
 }
 filePointer=filePointer+2;
 var bytes : Array[Byte] = new Array[Byte](toRead);
 fileHandler.seek(filePointer);
 fileHandler.read(bytes);
 var bdd=new String(bytes);  /*bdd contains the last line*/




 On Thu, Apr 24, 2014 at 11:42 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:

 Thanks Guys !


 On Thu, Apr 24, 2014 at 11:29 AM, Sourav Chandra 
 sourav.chan...@livestream.com wrote:

 Also same thing can be done using rdd.top(1)(reverseOrdering)



 On Thu, Apr 24, 2014 at 11:28 AM, Sourav Chandra 
 sourav.chan...@livestream.com wrote:

 You can use rdd.takeOrdered(1)(reverseOrdrering)

 reverseOrdering is you Ordering[T] instance where you define the
 ordering logic. This you have to pass in the method



 On Thu, Apr 24, 2014 at 11:21 AM, Frank Austin Nothaft 
 fnoth...@berkeley.edu wrote:

 If you do this, you could simplify to:

 RDD.collect().last

 However, this has the problem of collecting all data to the driver.

 Is your data sorted? If so, you could reverse the sort and take the
 first. Alternatively, a hackey implementation might involve a
 mapPartitionsWithIndex that returns an empty iterator for all partitions
 except for the last. For the last partition, you would filter all elements
 except for the last element in your iterator. This should leave one
 element, which is your last element.

 Frank Austin Nothaft
 fnoth...@berkeley.edu
 fnoth...@eecs.berkeley.edu
 202-340-0466

 On Apr 23, 2014, at 10:44 PM, Adnan Yaqoob nsyaq...@gmail.com wrote:

 This function will return scala List, you can use List's last function
 to get the last element.

 For example:

 RDD.take(RDD.count()).last


 On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna 
 ansaiprasa...@gmail.com wrote:

 Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD.

 I want only to access the last element.


 On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna 
 ansaiprasa...@gmail.com wrote:

 Oh ya, Thanks Adnan.


 On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob 
 nsyaq...@gmail.comwrote:

 You can use following code:

 RDD.take(RDD.count())


 On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna 
 ansaiprasa...@gmail.com wrote:

 Hi All, Some help !
 RDD.first or RDD.take(1) gives the first item, is there a straight
 forward way to access the last element in a similar way ?

 I coudnt fine a tail/last method for RDD. !!









 --

 Sourav Chandra

 Senior Software Engineer

 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

 sourav.chan...@livestream.com

 o: +91 80 4121 8723

 m: +91 988 699 3746

 skype: sourav.chandra

 Livestream

 Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
 Block, Koramangala Industrial Area,

 Bangalore 560034

 www.livestream.com




 --

 Sourav Chandra

 Senior Software Engineer

 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

 sourav.chan...@livestream.com

 o: +91 80 4121 8723

 m: +91 988 699 3746

 skype: sourav.chandra

 Livestream

 Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
 Block, Koramangala Industrial Area,

 Bangalore 560034

 www.livestream.com

Re: Is Spark a good choice for geospatial/GIS applications? Is a community volunteer needed in this area?

2014-04-24 Thread neveroutgunned

Thanks for the info. It seems like the JTS library is exactly what I
need (I'm not doing any raster processing at this point).

So, once they successfully finish the Scala wrappers for JTS, I would
theoretically be able to use Scala to write a Spark job that includes
the JTS library, and then run it across a Spark cluster? That is
absolutely fantastic!

I'll have to look into contributing to the JTS wrapping effort.

Thanks again!
On 14-04-2014 at 2:25 PM, Josh Marcus  wrote:Hey there,
I'd encourage you to check out the development currently going on with
the GeoTrellis project(http://github.com/geotrellis/geotrellis) or
talk to the developers on irc (freenode, #geotrellis) as they're
currently developing raster processing capabilities with spark as a
backend, as well as scala wrappersfor JTS (for calculations about
geometry features).
--j
On Wed, Apr 23, 2014 at 2:00 PM, neveroutgunned  wrote:
 Greetings Spark users/devs! I'm interested in using Spark to process
large volumes of data with a geospatial component, and I haven't been
able to find much information on Spark's ability to handle this kind
of operation. I don't need anything too complex; just distance between
two points, point-in-polygon and the like.
Does Spark (or possibly Shark) support this kind of query? Has anyone
written a plugin/extension along these lines?

If there isn't anything like this so far, then it seems like I have
two options. I can either abandon Spark and fall back on Hadoop and
Hive with the ESRI Tools extension, or I can stick with Spark and try
to write/port a GIS toolkit. Which option do you think I should
pursue? How hard is it for someone that's new to the Spark codebase to
write an extension? Is there anyone else in the community that would
be interested in having geospatial capability in Spark?
Thanks for your help!
-
 View this message in context: Is Spark a good choice for
geospatial/GIS applications? Is a community volunteer needed in this
area?
 Sent from the Apache Spark User List mailing list archive at
Nabble.com.

Re: Deploying a python code on a spark EC2 cluster

2014-04-24 Thread Shubhabrata

Moreover it seems all the workers are registered and have sufficient memory
(2.7GB where as I have asked for 512 MB). The UI also shows the jobs are
running on the slaves. But on the termial it is still the same error
Initial job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient memory

Please see the screenshot. Thanks

http://apache-spark-user-list.1001560.n3.nabble.com/file/n4761/33.png 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-a-python-code-on-a-spark-EC2-cluster-tp4758p4761.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

reduceByKeyAndWindow - spark internals

2014-04-24 Thread Adrian Mocanu

If I have this code:
val stream1= doublesInputStream.window(Seconds(10), Seconds(2))
val stream2= stream1.reduceByKeyAndWindow(_ + _, Seconds(10), Seconds(10))

Does reduceByKeyAndWindow merge all RDDs from stream1 that came in the 10 
second window?

Example, in the first 10 secs stream1 will have 5 RDDS. Does 
reduceByKeyAndWindow merge these 5RDDs into 1 RDD and remove duplicates?

-Adrian

Re: How do I access the SPARK SQL

2014-04-24 Thread Andrew Or

Did you build it with SPARK_HIVE=true?


On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru
diplomaticg...@gmail.comwrote:

 Hi Matei,

 I checked out the git repository and built it. However, I'm still getting
 below error. It couldn't find those SQL packages. Please advice.

 package org.apache.spark.sql.api.java does not exist
 [ERROR]
 /home/VirtualBoxImages.com/Documents/projects/errCount/src/main/java/errorCount/TransDriverSQL.java:[49,8]
 cannot find symbol
 [ERROR] symbol  : class JavaSchemaRDD

 Kind regards,

 Raj.



 On 23 April 2014 22:09, Matei Zaharia matei.zaha...@gmail.com wrote:

 It's currently in the master branch, on https://github.com/apache/spark.
 You can check that out from git, build it with sbt/sbt assembly, and then
 try it out. We're also going to post some release candidates soon that will
 be pre-built.

 Matei

 On Apr 23, 2014, at 1:30 PM, diplomatic Guru diplomaticg...@gmail.com
 wrote:

  Hello Team,
 
  I'm new to SPARK and just came across SPARK SQL, which appears to be
 interesting but not sure how I could get it.
 
  I know it's an Alpha version but not sure if its available for
 community yet.
 
  Many thanks.
 
  Raj.

Re: How do I access the SPARK SQL

2014-04-24 Thread Michael Armbrust

You shouldn't need to set SPARK_HIVE=true unless you want to use the
JavaHiveContext.  You should be able to access
org.apache.spark.sql.api.java.JavaSQLContext with the default build.

How are you building your application?

Michael


On Thu, Apr 24, 2014 at 9:17 AM, Andrew Or and...@databricks.com wrote:

 Did you build it with SPARK_HIVE=true?


 On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru diplomaticg...@gmail.com
  wrote:

 Hi Matei,

 I checked out the git repository and built it. However, I'm still getting
 below error. It couldn't find those SQL packages. Please advice.

 package org.apache.spark.sql.api.java does not exist
 [ERROR]
 /home/VirtualBoxImages.com/Documents/projects/errCount/src/main/java/errorCount/TransDriverSQL.java:[49,8]
 cannot find symbol
 [ERROR] symbol  : class JavaSchemaRDD

 Kind regards,

 Raj.



 On 23 April 2014 22:09, Matei Zaharia matei.zaha...@gmail.com wrote:

 It's currently in the master branch, on https://github.com/apache/spark.
 You can check that out from git, build it with sbt/sbt assembly, and then
 try it out. We're also going to post some release candidates soon that will
 be pre-built.

 Matei

 On Apr 23, 2014, at 1:30 PM, diplomatic Guru diplomaticg...@gmail.com
 wrote:

  Hello Team,
 
  I'm new to SPARK and just came across SPARK SQL, which appears to be
 interesting but not sure how I could get it.
 
  I know it's an Alpha version but not sure if its available for
 community yet.
 
  Many thanks.
 
  Raj.

Re: How do I access the SPARK SQL

2014-04-24 Thread Aaron Davidson

Looks like you're depending on Spark 0.9.1, which doesn't have Spark SQL.
Assuming you've downloaded Spark, just run 'mvn install' to publish Spark
locally, and depend on Spark version 1.0.0-SNAPSHOT.

On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru
diplomaticg...@gmail.comwrote:

It's a simple application based on the People example.

I'm using Maven for building and below is the pom.xml. Perhaps, I need to
change the version?

project
groupIdUthay.Test.App/groupId
artifactIdtest-app/artifactId
modelVersion4.0.0/modelVersion
nameTestApp/name
packagingjar/packaging
version1.0/version

repositories
repository
idAkka repository/id
urlhttp://repo.akka.io/releases/url
/repository
/repositories

dependencies
dependency !-- Spark dependency --
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version0.9.1/version
/dependency
/dependencies
/project

On 24 April 2014 17:47, Michael Armbrust mich...@databricks.com wrote:

You shouldn't need to set SPARK_HIVE=true unless you want to use the
JavaHiveContext. You should be able to access
org.apache.spark.sql.api.java.JavaSQLContext with the default build.

How are you building your application?

Michael

On Thu, Apr 24, 2014 at 9:17 AM, Andrew Or and...@databricks.com wrote:

Did you build it with SPARK_HIVE=true?

On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru
diplomaticg...@gmail.com wrote:

Hi Matei,

I checked out the git repository and built it. However, I'm still
getting below error. It couldn't find those SQL packages. Please advice.

package org.apache.spark.sql.api.java does not exist
[ERROR]
/home/VirtualBoxImages.com/Documents/projects/errCount/src/main/java/errorCount/TransDriverSQL.java:[49,8]
cannot find symbol
[ERROR] symbol : class JavaSchemaRDD

Kind regards,

Raj.

On 23 April 2014 22:09, Matei Zaharia matei.zaha...@gmail.com wrote:

It’s currently in the master branch, on
https://github.com/apache/spark. You can check that out from git,
build it with sbt/sbt assembly, and then try it out. We’re also going to
post some release candidates soon that will be pre-built.

Matei

On Apr 23, 2014, at 1:30 PM, diplomatic Guru diplomaticg...@gmail.com
wrote:

Hello Team,

I'm new to SPARK and just came across SPARK SQL, which appears to be
interesting but not sure how I could get it.

I know it's an Alpha version but not sure if its available for
community yet.

Many thanks.

Raj.

Re: How do I access the SPARK SQL

2014-04-24 Thread Michael Armbrust

Oh, and you'll also need to add a dependency on spark-sql_2.10.

On Thu, Apr 24, 2014 at 10:13 AM, Michael Armbrust
mich...@databricks.comwrote:

Yeah, you'll need to run `sbt publish-local` to push the jars to your
local maven repository (~/.m2) and then depend on version 1.0.0-SNAPSHOT.

On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru diplomaticg...@gmail.com
wrote:

It's a simple application based on the People example.

I'm using Maven for building and below is the pom.xml. Perhaps, I need to
change the version?

project
groupIdUthay.Test.App/groupId
artifactIdtest-app/artifactId
modelVersion4.0.0/modelVersion
nameTestApp/name
packagingjar/packaging
version1.0/version

repositories
repository
idAkka repository/id
urlhttp://repo.akka.io/releases/url
/repository
/repositories

dependencies
dependency !-- Spark dependency --
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version0.9.1/version
/dependency
/dependencies
/project

On 24 April 2014 17:47, Michael Armbrust mich...@databricks.com wrote:

You shouldn't need to set SPARK_HIVE=true unless you want to use the
JavaHiveContext. You should be able to access
org.apache.spark.sql.api.java.JavaSQLContext with the default build.

How are you building your application?

Michael

On Thu, Apr 24, 2014 at 9:17 AM, Andrew Or and...@databricks.comwrote:

Did you build it with SPARK_HIVE=true?

On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru
diplomaticg...@gmail.com wrote:

Hi Matei,

I checked out the git repository and built it. However, I'm still
getting below error. It couldn't find those SQL packages. Please advice.

Kind regards,

Raj.

On 23 April 2014 22:09, Matei Zaharia matei.zaha...@gmail.com wrote:

It's currently in the master branch, on
https://github.com/apache/spark. You can check that out from git,
build it with sbt/sbt assembly, and then try it out. We're also going to
post some release candidates soon that will be pre-built.

Matei

On Apr 23, 2014, at 1:30 PM, diplomatic Guru
diplomaticg...@gmail.com wrote:

Hello Team,

I'm new to SPARK and just came across SPARK SQL, which appears to
be interesting but not sure how I could get it.

I know it's an Alpha version but not sure if its available for
community yet.

Many thanks.

Raj.

Re: How do I access the SPARK SQL

2014-04-24 Thread diplomatic Guru

Many thanks for your prompt reply. I'll try your suggestions and will get
back to you.

On 24 April 2014 18:17, Michael Armbrust mich...@databricks.com wrote:

Oh, and you'll also need to add a dependency on spark-sql_2.10.

On Thu, Apr 24, 2014 at 10:13 AM, Michael Armbrust mich...@databricks.com
wrote:

Yeah, you'll need to run `sbt publish-local` to push the jars to your
local maven repository (~/.m2) and then depend on version 1.0.0-SNAPSHOT.

On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru
diplomaticg...@gmail.com wrote:

It's a simple application based on the People example.

I'm using Maven for building and below is the pom.xml. Perhaps, I need
to change the version?

project
groupIdUthay.Test.App/groupId
artifactIdtest-app/artifactId
modelVersion4.0.0/modelVersion
nameTestApp/name
packagingjar/packaging
version1.0/version

repositories
repository
idAkka repository/id
urlhttp://repo.akka.io/releases/url
/repository
/repositories

dependencies
dependency !-- Spark dependency --
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version0.9.1/version
/dependency
/dependencies
/project

On 24 April 2014 17:47, Michael Armbrust mich...@databricks.com wrote:

You shouldn't need to set SPARK_HIVE=true unless you want to use the
JavaHiveContext. You should be able to access
org.apache.spark.sql.api.java.JavaSQLContext with the default build.

How are you building your application?

Michael

On Thu, Apr 24, 2014 at 9:17 AM, Andrew Or and...@databricks.comwrote:

Did you build it with SPARK_HIVE=true?

On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru
diplomaticg...@gmail.com wrote:

Hi Matei,

I checked out the git repository and built it. However, I'm still
getting below error. It couldn't find those SQL packages. Please advice.

Kind regards,

Raj.

On 23 April 2014 22:09, Matei Zaharia matei.zaha...@gmail.comwrote:

Matei

On Apr 23, 2014, at 1:30 PM, diplomatic Guru
diplomaticg...@gmail.com wrote:

Hello Team,

I'm new to SPARK and just came across SPARK SQL, which appears to
be interesting but not sure how I could get it.

I know it's an Alpha version but not sure if its available for
community yet.

Many thanks.

Raj.

Re: IDE for sparkR

2014-04-24 Thread maxpar

Rstudio should be fine.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/IDE-for-sparkR-tp4764p4772.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Access Last Element of RDD

2014-04-24 Thread Sai Prasanna

Thanks Cheng !!


On Thu, Apr 24, 2014 at 5:43 PM, Cheng Lian lian.cs@gmail.com wrote:

 You may try this:

 val lastOption = sc.textFile(input).mapPartitions { iterator =
   if (iterator.isEmpty) {
 iterator
   } else {
 Iterator
   .continually((iterator.next(), iterator.hasNext()))
   .collect { case (value, false) = value }
   .take(1)
   }
 }.collect().lastOption

 Iterator based data access ensures O(1) space complexity and it runs
 faster because different partitions are processed in parallel. lastOptionis 
 used instead of
 last to deal with empty file.


 On Thu, Apr 24, 2014 at 7:38 PM, Sai Prasanna ansaiprasa...@gmail.comwrote:

 Hi All, Finally i wrote the following code, which is felt does optimally
 if not the most optimum one.
 Using file pointers, seeking the byte after the last \n but backwards !!
 This is memory efficient and i hope even unix tail implementation should
 be something similar !!

 import java.io.RandomAccessFile
 import java.io.IOException
 var FILEPATH=/home/sparkcluster/hadoop-2.3.0/temp;
 var fileHandler = new RandomAccessFile( FILEPATH, r );
 var fileLength = fileHandler.length() - 1;
 var cond = 1;
 var filePointer = fileLength-1;
 var toRead= -1;
 while(filePointer != -1  cond!=0){
  fileHandler.seek( filePointer );
  var readByte = fileHandler.readByte();
  if( readByte == 0xA  filePointer != fileLength )
 cond=0;
   else if( readByte == 0xD  filePointer != fileLength
 - 1 ) cond=0;

  filePointer=filePointer-1; toRead=toRead+1;
 }
 filePointer=filePointer+2;
 var bytes : Array[Byte] = new Array[Byte](toRead);
 fileHandler.seek(filePointer);
 fileHandler.read(bytes);
 var bdd=new String(bytes);  /*bdd contains the last line*/




 On Thu, Apr 24, 2014 at 11:42 AM, Sai Prasanna 
 ansaiprasa...@gmail.comwrote:

 Thanks Guys !


 On Thu, Apr 24, 2014 at 11:29 AM, Sourav Chandra 
 sourav.chan...@livestream.com wrote:

 Also same thing can be done using rdd.top(1)(reverseOrdering)



 On Thu, Apr 24, 2014 at 11:28 AM, Sourav Chandra 
 sourav.chan...@livestream.com wrote:

 You can use rdd.takeOrdered(1)(reverseOrdrering)

 reverseOrdering is you Ordering[T] instance where you define the
 ordering logic. This you have to pass in the method



 On Thu, Apr 24, 2014 at 11:21 AM, Frank Austin Nothaft 
 fnoth...@berkeley.edu wrote:

 If you do this, you could simplify to:

 RDD.collect().last

 However, this has the problem of collecting all data to the driver.

 Is your data sorted? If so, you could reverse the sort and take the
 first. Alternatively, a hackey implementation might involve a
 mapPartitionsWithIndex that returns an empty iterator for all partitions
 except for the last. For the last partition, you would filter all 
 elements
 except for the last element in your iterator. This should leave one
 element, which is your last element.

 Frank Austin Nothaft
 fnoth...@berkeley.edu
 fnoth...@eecs.berkeley.edu
 202-340-0466

 On Apr 23, 2014, at 10:44 PM, Adnan Yaqoob nsyaq...@gmail.com
 wrote:

 This function will return scala List, you can use List's last
 function to get the last element.

 For example:

 RDD.take(RDD.count()).last


 On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna 
 ansaiprasa...@gmail.com wrote:

 Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD.

 I want only to access the last element.


 On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna 
 ansaiprasa...@gmail.com wrote:

 Oh ya, Thanks Adnan.


 On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob 
 nsyaq...@gmail.comwrote:

 You can use following code:

 RDD.take(RDD.count())


 On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna 
 ansaiprasa...@gmail.com wrote:

 Hi All, Some help !
 RDD.first or RDD.take(1) gives the first item, is there a
 straight forward way to access the last element in a similar way ?

 I coudnt fine a tail/last method for RDD. !!









 --

 Sourav Chandra

 Senior Software Engineer

 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

 sourav.chan...@livestream.com

 o: +91 80 4121 8723

 m: +91 988 699 3746

 skype: sourav.chandra

 Livestream

 Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main,
 3rd Block, Koramangala Industrial Area,

 Bangalore 560034

 www.livestream.com




 --

 Sourav Chandra

 Senior Software Engineer

 · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

 sourav.chan...@livestream.com

 o: +91 80 4121 8723

 m: +91 988 699 3746

 skype: sourav.chandra

 Livestream

 Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd
 Block, Koramangala Industrial Area,

 Bangalore 560034

 www.livestream.com

Spark mllib throwing error

./spark-shell: line 153: 17654 Killed
$FWDIR/bin/spark-class org.apache.spark.repl.Main $@


Any ideas?

Re: Deploying a python code on a spark EC2 cluster

Did you launch this using our EC2 scripts
(http://spark.apache.org/docs/latest/ec2-scripts.html) or did you manually set
up the daemons? My guess is that their hostnames are not being resolved
properly on all nodes, so executor processes can’t connect back to your driver
app. This error message indicates that:

14/04/24 09:00:49 WARN util.Utils: Your hostname, spark-node resolves to a
loopback address: 127.0.0.1; using 10.74.149.251 instead (on interface eth0)
14/04/24 09:00:49 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to
another address

If you launch with your EC2 scripts, or don’t manually change the hostnames,
this should not happen.

Matei

On Apr 24, 2014, at 11:36 AM, John King usedforprinting...@gmail.com wrote:

Same problem.

On Thu, Apr 24, 2014 at 10:54 AM, Shubhabrata mail2shu...@gmail.com wrote:
Moreover it seems all the workers are registered and have sufficient memory
(2.7GB where as I have asked for 512 MB). The UI also shows the jobs are
running on the slaves. But on the termial it is still the same error
Initial job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient memory

Please see the screenshot. Thanks

http://apache-spark-user-list.1001560.n3.nabble.com/file/n4761/33.png

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-a-python-code-on-a-spark-EC2-cluster-tp4758p4761.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: SparkPi performance-3 cluster standalone mode

The problem is that SparkPi uses Math.random(), which is a synchronized method, 
so it can’t scale to multiple cores. In fact it will be slower on multiple 
cores due to lock contention. Try another example and you’ll see better 
scaling. I think we’ll have to update SparkPi to create a new Random in each 
task to avoid this.

Matei

On Apr 24, 2014, at 4:43 AM, Adnan nsyaq...@gmail.com wrote:

 Hi,
 
 Relatively new on spark and have tried running SparkPi example on a
 standalone 12 core three machine cluster. What I'm failing to understand is,
 that running this example with a single slice gives better performance as
 compared to using 12 slices. Same was the case when I was using parallelize
 function. The time is scaling almost linearly with adding each slice. Please
 let me know if I'm doing anything wrong. The code snippet is given below:
 
 
 
 Regards,
 
 Ahsan Ijaz
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkPi-performance-3-cluster-standalone-mode-tp4530p4751.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark mllib throwing error

Could you share the command you used and more of the error message?
Also, is it an MLlib specific problem? -Xiangrui

On Thu, Apr 24, 2014 at 11:49 AM, John King
usedforprinting...@gmail.com wrote:
 ./spark-shell: line 153: 17654 Killed
 $FWDIR/bin/spark-class org.apache.spark.repl.Main $@


 Any ideas?

Re: Trying to use pyspark mllib NaiveBayes

Is your Spark cluster running? Try to start with generating simple
RDDs and counting. -Xiangrui

On Thu, Apr 24, 2014 at 11:38 AM, John King
usedforprinting...@gmail.com wrote:
 I receive this error:

 Traceback (most recent call last):

   File stdin, line 1, in module

   File
 /home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py, line
 178, in train

 ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd, lambda_)

   File
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 535, in __call__

   File
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 368, in send_command

   File
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 361, in send_command

   File
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 317, in _get_connection

   File
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 324, in _create_connection

   File
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 431, in start

 py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to
 the Java server

Re: Spark mllib throwing error

Last command was:

val model = new NaiveBayes().run(points)


On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng men...@gmail.com wrote:

 Could you share the command you used and more of the error message?
 Also, is it an MLlib specific problem? -Xiangrui

 On Thu, Apr 24, 2014 at 11:49 AM, John King
 usedforprinting...@gmail.com wrote:
  ./spark-shell: line 153: 17654 Killed
  $FWDIR/bin/spark-class org.apache.spark.repl.Main $@
 
 
  Any ideas?

Re: Trying to use pyspark mllib NaiveBayes

Yes, I got it running for large RDD (~7 million lines) and mapping. Just
received this error when trying to classify.


On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng men...@gmail.com wrote:

 Is your Spark cluster running? Try to start with generating simple
 RDDs and counting. -Xiangrui

 On Thu, Apr 24, 2014 at 11:38 AM, John King
 usedforprinting...@gmail.com wrote:
  I receive this error:
 
  Traceback (most recent call last):
 
File stdin, line 1, in module
 
File
  /home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py,
 line
  178, in train
 
  ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd,
 lambda_)
 
File
 
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 535, in __call__
 
File
 
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 368, in send_command
 
File
 
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 361, in send_command
 
File
 
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 317, in _get_connection
 
File
 
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 324, in _create_connection
 
File
 
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 431, in start
 
  py4j.protocol.Py4JNetworkError: An error occurred while trying to
 connect to
  the Java server

Re: Deploying a python code on a spark EC2 cluster

This happens to me when using the EC2 scripts for v1.0.0rc2 recent release.
The Master connects and then disconnects immediately, eventually saying
Master disconnected from cluster.

On Thu, Apr 24, 2014 at 4:01 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

Did you launch this using our EC2 scripts (
http://spark.apache.org/docs/latest/ec2-scripts.html) or did you manually
set up the daemons? My guess is that their hostnames are not being resolved
properly on all nodes, so executor processes can’t connect back to your
driver app. This error message indicates that:

14/04/24 09:00:49 WARN util.Utils: Your hostname, spark-node resolves to a
loopback address: 127.0.0.1; using 10.74.149.251 instead (on interface
eth0)
14/04/24 09:00:49 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind
to
another address

If you launch with your EC2 scripts, or don’t manually change the
hostnames, this should not happen.

Matei

On Apr 24, 2014, at 11:36 AM, John King usedforprinting...@gmail.com
wrote:

Same problem.

On Thu, Apr 24, 2014 at 10:54 AM, Shubhabrata mail2shu...@gmail.comwrote:

Moreover it seems all the workers are registered and have sufficient
memory
(2.7GB where as I have asked for 512 MB). The UI also shows the jobs are
running on the slaves. But on the termial it is still the same error
Initial job has not accepted any resources; check your cluster UI to
ensure
that workers are registered and have sufficient memory

Please see the screenshot. Thanks

http://apache-spark-user-list.1001560.n3.nabble.com/file/n4761/33.png

Re: How do I access the SPARK SQL

2014-04-24 Thread diplomatic Guru

It worked!! Many thanks for your brilliant support.

On 24 April 2014 18:20, diplomatic Guru diplomaticg...@gmail.com wrote:

Many thanks for your prompt reply. I'll try your suggestions and will get
back to you.

On 24 April 2014 18:17, Michael Armbrust mich...@databricks.com wrote:

Oh, and you'll also need to add a dependency on spark-sql_2.10.

On Thu, Apr 24, 2014 at 10:13 AM, Michael Armbrust
mich...@databricks.com wrote:

Yeah, you'll need to run `sbt publish-local` to push the jars to your
local maven repository (~/.m2) and then depend on version 1.0.0-SNAPSHOT.

On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru
diplomaticg...@gmail.com wrote:

It's a simple application based on the People example.

I'm using Maven for building and below is the pom.xml. Perhaps, I need
to change the version?

project
groupIdUthay.Test.App/groupId
artifactIdtest-app/artifactId
modelVersion4.0.0/modelVersion
nameTestApp/name
packagingjar/packaging
version1.0/version

repositories
repository
idAkka repository/id
urlhttp://repo.akka.io/releases/url
/repository
/repositories

dependencies
dependency !-- Spark dependency --
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version0.9.1/version
/dependency
/dependencies
/project

On 24 April 2014 17:47, Michael Armbrust mich...@databricks.comwrote:

You shouldn't need to set SPARK_HIVE=true unless you want to use the
JavaHiveContext. You should be able to access
org.apache.spark.sql.api.java.JavaSQLContext with the default build.

How are you building your application?

Michael

On Thu, Apr 24, 2014 at 9:17 AM, Andrew Or and...@databricks.comwrote:

Did you build it with SPARK_HIVE=true?

On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru
diplomaticg...@gmail.com wrote:

Hi Matei,

I checked out the git repository and built it. However, I'm still
getting below error. It couldn't find those SQL packages. Please advice.

Kind regards,

Raj.

On 23 April 2014 22:09, Matei Zaharia matei.zaha...@gmail.comwrote:

It’s currently in the master branch, on
https://github.com/apache/spark. You can check that out from git,
build it with sbt/sbt assembly, and then try it out. We’re also going
to
post some release candidates soon that will be pre-built.

Matei

On Apr 23, 2014, at 1:30 PM, diplomatic Guru
diplomaticg...@gmail.com wrote:

Hello Team,

I'm new to SPARK and just came across SPARK SQL, which appears to
be interesting but not sure how I could get it.

I know it's an Alpha version but not sure if its available for
community yet.

Many thanks.

Raj.

Re: error in mllib lr example code

2014-04-24 Thread Mohit Jaggi

Thanks Xiangrui, Matei and Arpit. It does work fine after adding
Vector.dense. I have a follow up question, I will post on a new thread.


On Thu, Apr 24, 2014 at 2:49 AM, Arpit Tak arpi...@sigmoidanalytics.comwrote:

 Also try out these examples, all of them works

 http://docs.sigmoidanalytics.com/index.php/MLlib

 if you spot any problems in those, let us know.

 Regards,
 arpit


 On Wed, Apr 23, 2014 at 11:08 PM, Matei Zaharia 
 matei.zaha...@gmail.comwrote:

 See http://people.csail.mit.edu/matei/spark-unified-docs/ for a more
 recent build of the docs; if you spot any problems in those, let us know.

 Matei

 On Apr 23, 2014, at 9:49 AM, Xiangrui Meng men...@gmail.com wrote:

  The doc is for 0.9.1. You are running a later snapshot, which added
  sparse vectors. Try LabeledPoint(parts(0).toDouble,
  Vectors.dense(parts(1).split(' ').map(x = x.toDouble)). The examples
  are updated in the master branch. You can also check the examples
  there. -Xiangrui
 
  On Wed, Apr 23, 2014 at 9:34 AM, Mohit Jaggi mohitja...@gmail.com
 wrote:
 
  sorry...added a subject now
 
  On Wed, Apr 23, 2014 at 9:32 AM, Mohit Jaggi mohitja...@gmail.com
 wrote:
 
  I am trying to run the example linear regression code from
 
  http://spark.apache.org/docs/latest/mllib-guide.html
 
  But I am getting the following error...am I missing an import?
 
  code
 
  import org.apache.spark._
 
  import org.apache.spark.mllib.regression.LinearRegressionWithSGD
 
  import org.apache.spark.mllib.regression.LabeledPoint
 
 
  object ModelLR {
 
   def main(args: Array[String]) {
 
 val sc = new SparkContext(args(0), SparkLR,
 
   System.getenv(SPARK_HOME),
  SparkContext.jarOfClass(this.getClass).toSeq)
 
  // Load and parse the data
 
  val data = sc.textFile(mllib/data/ridge-data/lpsa.data)
 
  val parsedData = data.map { line =
 
   val parts = line.split(',')
 
   LabeledPoint(parts(0).toDouble, parts(1).split(' ').map(x =
  x.toDouble).toArray)
 
  }
 
  ...snip...
 
  }
 
  error
 
  - polymorphic expression cannot be instantiated to expected type;
 found :
  [U : Double]Array[U] required:
 
  org.apache.spark.mllib.linalg.Vector
 
  - polymorphic expression cannot be instantiated to expected type;
 found :
  [U : Double]Array[U] required:
 
  org.apache.spark.mllib.linalg.Vector

spark mllib to jblas calls..and comparison with VW

2014-04-24 Thread Mohit Jaggi

Folks,
I am wondering how mllib interacts with jblas and lapack. Does it make
copies of data from my RDD format to jblas's format? Does jblas copy it
again before passing to lapack native code?

I also saw some comparisons with VW and it seems mllib is slower on a
single node but scales better and outperforms VW on 16 nodes. Any idea why?
Are improvements in the pipeline?

Mohit.

Re: Trying to use pyspark mllib NaiveBayes

I tried locally with the example described in the latest guide:
http://54.82.157.211:4000/mllib-naive-bayes.html , and it worked fine.
Do you mind sharing the code you used? -Xiangrui

On Thu, Apr 24, 2014 at 1:57 PM, John King usedforprinting...@gmail.com wrote:
 Yes, I got it running for large RDD (~7 million lines) and mapping. Just
 received this error when trying to classify.


 On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng men...@gmail.com wrote:

 Is your Spark cluster running? Try to start with generating simple
 RDDs and counting. -Xiangrui

 On Thu, Apr 24, 2014 at 11:38 AM, John King
 usedforprinting...@gmail.com wrote:
  I receive this error:
 
  Traceback (most recent call last):
 
File stdin, line 1, in module
 
File
  /home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py,
  line
  178, in train
 
  ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd,
  lambda_)
 
File
 
  /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 535, in __call__
 
File
 
  /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 368, in send_command
 
File
 
  /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 361, in send_command
 
File
 
  /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 317, in _get_connection
 
File
 
  /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 324, in _create_connection
 
File
 
  /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 431, in start
 
  py4j.protocol.Py4JNetworkError: An error occurred while trying to
  connect to
  the Java server

Re: Spark mllib throwing error

Do you mind sharing more code and error messages? The information you
provided is too little to identify the problem. -Xiangrui

On Thu, Apr 24, 2014 at 1:55 PM, John King usedforprinting...@gmail.com wrote:
 Last command was:

 val model = new NaiveBayes().run(points)



 On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng men...@gmail.com wrote:

 Could you share the command you used and more of the error message?
 Also, is it an MLlib specific problem? -Xiangrui

 On Thu, Apr 24, 2014 at 11:49 AM, John King
 usedforprinting...@gmail.com wrote:
  ./spark-shell: line 153: 17654 Killed
  $FWDIR/bin/spark-class org.apache.spark.repl.Main $@
 
 
  Any ideas?

Re: Trying to use pyspark mllib NaiveBayes

I was able to run simple examples as well.

Which version of Spark? Did you use the most recent commit or from
branch-1.0?

Some background: I tried to build both on Amazon EC2, but the master kept
disconnecting from the client and executors failed after connecting. So I
tried to just use one machine with a lot of ram. I can set up a cluster on
released 0.9.1, but I need the sparse vector representations as my data is
very sparse. Any way I can access a version of 1.0 that doesn't have to be
compiled and is proven to work on EC2?

My code:

*import* numpy


*from* numpy *import* array, dot, shape

*from* pyspark *import* SparkContext

*from* math *import* exp, log

from pyspark.mllib.classification import NaiveBayes


from pyspark.mllib.linalg import SparseVector

from pyspark.mllib.regression import LabeledPoint, LinearModel

def isSpace(line):

if line.isspace() or not line.strip():

 return True

sizeOfDict = 2357815

def parsePoint(line):

values = line.split('\t')

feat = values[1].split(' ')

features = {}

for f in feat:

f = f.split(':')

if len(f)  1:

features[f[0]] = f[1]

return LabeledPoint(float(values[0]), SparseVector(sizeOfDict,
features))


data = sc.textFile(.../data.txt, 6)

empty = data.filter(lambda x: not isSpace(x)) // I had an extra new line
between each line

points = empty.map(parsePoint)

model = NaiveBayes.train(points)



On Thu, Apr 24, 2014 at 6:55 PM, Xiangrui Meng men...@gmail.com wrote:

 I tried locally with the example described in the latest guide:
 http://54.82.157.211:4000/mllib-naive-bayes.html , and it worked fine.
 Do you mind sharing the code you used? -Xiangrui

 On Thu, Apr 24, 2014 at 1:57 PM, John King usedforprinting...@gmail.com
 wrote:
  Yes, I got it running for large RDD (~7 million lines) and mapping. Just
  received this error when trying to classify.
 
 
  On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Is your Spark cluster running? Try to start with generating simple
  RDDs and counting. -Xiangrui
 
  On Thu, Apr 24, 2014 at 11:38 AM, John King
  usedforprinting...@gmail.com wrote:
   I receive this error:
  
   Traceback (most recent call last):
  
 File stdin, line 1, in module
  
 File
   /home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py,
   line
   178, in train
  
   ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd,
   lambda_)
  
 File
  
  
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
   line 535, in __call__
  
 File
  
  
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
   line 368, in send_command
  
 File
  
  
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
   line 361, in send_command
  
 File
  
  
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
   line 317, in _get_connection
  
 File
  
  
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
   line 324, in _create_connection
  
 File
  
  
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
   line 431, in start
  
   py4j.protocol.Py4JNetworkError: An error occurred while trying to
   connect to
   the Java server

Re: Spark mllib throwing error

In the other thread I had an issue with Python. In this issue, I tried
switching to Scala. The code is:

*import* org.apache.spark.mllib.regression.*LabeledPoint**;*

*import org.apache.spark.mllib.linalg.SparseVector;*

*import org.apache.spark.mllib.classification.NaiveBayes;*

import scala.collection.mutable.ArrayBuffer



def isEmpty(a: String): Boolean = a != null 
!a.replaceAll((?m)\s+$, ).isEmpty()

def parsePoint(a: String): LabeledPoint = {

   val values = a.split('\t')

   val feat = values(1).split(' ')

   val indices = ArrayBuffer.empty[Int]

   val featValues = ArrayBuffer.empty[Double]

   for (f - feat) {

   val q = f.split(':')

   if (q.length == 2) {

  indices += (q(0).toInt)

  featValues += (q(1).toDouble)

   }

   }

   val vector = new SparseVector(2357815, indices.toArray,
featValues.toArray)

   return LabeledPoint(values(0).toDouble, vector)

   }


val data = sc.textFile(data.txt)

val empty = data.filter(isEmpty)

val points = empty.map(parsePoint)

points.cache()

val model = new NaiveBayes().run(points)


On Thu, Apr 24, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote:

 Do you mind sharing more code and error messages? The information you
 provided is too little to identify the problem. -Xiangrui

 On Thu, Apr 24, 2014 at 1:55 PM, John King usedforprinting...@gmail.com
 wrote:
  Last command was:
 
  val model = new NaiveBayes().run(points)
 
 
 
  On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Could you share the command you used and more of the error message?
  Also, is it an MLlib specific problem? -Xiangrui
 
  On Thu, Apr 24, 2014 at 11:49 AM, John King
  usedforprinting...@gmail.com wrote:
   ./spark-shell: line 153: 17654 Killed
   $FWDIR/bin/spark-class org.apache.spark.repl.Main $@
  
  
   Any ideas?

Re: Trying to use pyspark mllib NaiveBayes

Also when will the official 1.0 be released?


On Thu, Apr 24, 2014 at 7:04 PM, John King usedforprinting...@gmail.comwrote:

 I was able to run simple examples as well.

 Which version of Spark? Did you use the most recent commit or from
 branch-1.0?

 Some background: I tried to build both on Amazon EC2, but the master kept
 disconnecting from the client and executors failed after connecting. So I
 tried to just use one machine with a lot of ram. I can set up a cluster on
 released 0.9.1, but I need the sparse vector representations as my data is
 very sparse. Any way I can access a version of 1.0 that doesn't have to be
 compiled and is proven to work on EC2?

 My code:

 *import* numpy


 *from* numpy *import* array, dot, shape

 *from* pyspark *import* SparkContext

 *from* math *import* exp, log

 from pyspark.mllib.classification import NaiveBayes


 from pyspark.mllib.linalg import SparseVector

 from pyspark.mllib.regression import LabeledPoint, LinearModel

 def isSpace(line):

 if line.isspace() or not line.strip():

  return True

 sizeOfDict = 2357815

 def parsePoint(line):

 values = line.split('\t')

 feat = values[1].split(' ')

 features = {}

 for f in feat:

 f = f.split(':')

 if len(f)  1:

 features[f[0]] = f[1]

 return LabeledPoint(float(values[0]), SparseVector(sizeOfDict,
 features))


 data = sc.textFile(.../data.txt, 6)

 empty = data.filter(lambda x: not isSpace(x)) // I had an extra new line
 between each line

 points = empty.map(parsePoint)

 model = NaiveBayes.train(points)



 On Thu, Apr 24, 2014 at 6:55 PM, Xiangrui Meng men...@gmail.com wrote:

 I tried locally with the example described in the latest guide:
 http://54.82.157.211:4000/mllib-naive-bayes.html , and it worked fine.
 Do you mind sharing the code you used? -Xiangrui

 On Thu, Apr 24, 2014 at 1:57 PM, John King usedforprinting...@gmail.com
 wrote:
  Yes, I got it running for large RDD (~7 million lines) and mapping. Just
  received this error when trying to classify.
 
 
  On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng men...@gmail.com
 wrote:
 
  Is your Spark cluster running? Try to start with generating simple
  RDDs and counting. -Xiangrui
 
  On Thu, Apr 24, 2014 at 11:38 AM, John King
  usedforprinting...@gmail.com wrote:
   I receive this error:
  
   Traceback (most recent call last):
  
 File stdin, line 1, in module
  
 File
  
 /home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py,
   line
   178, in train
  
   ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd,
   lambda_)
  
 File
  
  
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
   line 535, in __call__
  
 File
  
  
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
   line 368, in send_command
  
 File
  
  
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
   line 361, in send_command
  
 File
  
  
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
   line 317, in _get_connection
  
 File
  
  
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
   line 324, in _create_connection
  
 File
  
  
 /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
   line 431, in start
  
   py4j.protocol.Py4JNetworkError: An error occurred while trying to
   connect to
   the Java server

Re: Spark mllib throwing error

I don't see anything wrong with your code. Could you do points.count()
to see how many training examples you have? Also, make sure you don't
have negative feature values. The error message you sent did not say
NaiveBayes went wrong, but the Spark shell was killed. -Xiangrui

On Thu, Apr 24, 2014 at 4:05 PM, John King usedforprinting...@gmail.com wrote:
 In the other thread I had an issue with Python. In this issue, I tried
 switching to Scala. The code is:

 import org.apache.spark.mllib.regression.LabeledPoint;

 import org.apache.spark.mllib.linalg.SparseVector;

 import org.apache.spark.mllib.classification.NaiveBayes;

 import scala.collection.mutable.ArrayBuffer



 def isEmpty(a: String): Boolean = a != null  !a.replaceAll((?m)\s+$,
 ).isEmpty()

 def parsePoint(a: String): LabeledPoint = {

val values = a.split('\t')

val feat = values(1).split(' ')

val indices = ArrayBuffer.empty[Int]

val featValues = ArrayBuffer.empty[Double]

for (f - feat) {

val q = f.split(':')

if (q.length == 2) {

   indices += (q(0).toInt)

   featValues += (q(1).toDouble)

}

}

val vector = new SparseVector(2357815, indices.toArray,
 featValues.toArray)

return LabeledPoint(values(0).toDouble, vector)

}


 val data = sc.textFile(data.txt)

 val empty = data.filter(isEmpty)

 val points = empty.map(parsePoint)

 points.cache()

 val model = new NaiveBayes().run(points)



 On Thu, Apr 24, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote:

 Do you mind sharing more code and error messages? The information you
 provided is too little to identify the problem. -Xiangrui

 On Thu, Apr 24, 2014 at 1:55 PM, John King usedforprinting...@gmail.com
 wrote:
  Last command was:
 
  val model = new NaiveBayes().run(points)
 
 
 
  On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Could you share the command you used and more of the error message?
  Also, is it an MLlib specific problem? -Xiangrui
 
  On Thu, Apr 24, 2014 at 11:49 AM, John King
  usedforprinting...@gmail.com wrote:
   ./spark-shell: line 153: 17654 Killed
   $FWDIR/bin/spark-class org.apache.spark.repl.Main $@
  
  
   Any ideas?

Re: Spark mllib throwing error

It just displayed this error and stopped on its own. Do the lines of code
mentioned in the error have anything to do with it?


On Thu, Apr 24, 2014 at 7:54 PM, Xiangrui Meng men...@gmail.com wrote:

 I don't see anything wrong with your code. Could you do points.count()
 to see how many training examples you have? Also, make sure you don't
 have negative feature values. The error message you sent did not say
 NaiveBayes went wrong, but the Spark shell was killed. -Xiangrui

 On Thu, Apr 24, 2014 at 4:05 PM, John King usedforprinting...@gmail.com
 wrote:
  In the other thread I had an issue with Python. In this issue, I tried
  switching to Scala. The code is:
 
  import org.apache.spark.mllib.regression.LabeledPoint;
 
  import org.apache.spark.mllib.linalg.SparseVector;
 
  import org.apache.spark.mllib.classification.NaiveBayes;
 
  import scala.collection.mutable.ArrayBuffer
 
 
 
  def isEmpty(a: String): Boolean = a != null 
 !a.replaceAll((?m)\s+$,
  ).isEmpty()
 
  def parsePoint(a: String): LabeledPoint = {
 
 val values = a.split('\t')
 
 val feat = values(1).split(' ')
 
 val indices = ArrayBuffer.empty[Int]
 
 val featValues = ArrayBuffer.empty[Double]
 
 for (f - feat) {
 
 val q = f.split(':')
 
 if (q.length == 2) {
 
indices += (q(0).toInt)
 
featValues += (q(1).toDouble)
 
 }
 
 }
 
 val vector = new SparseVector(2357815, indices.toArray,
  featValues.toArray)
 
 return LabeledPoint(values(0).toDouble, vector)
 
 }
 
 
  val data = sc.textFile(data.txt)
 
  val empty = data.filter(isEmpty)
 
  val points = empty.map(parsePoint)
 
  points.cache()
 
  val model = new NaiveBayes().run(points)
 
 
 
  On Thu, Apr 24, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Do you mind sharing more code and error messages? The information you
  provided is too little to identify the problem. -Xiangrui
 
  On Thu, Apr 24, 2014 at 1:55 PM, John King 
 usedforprinting...@gmail.com
  wrote:
   Last command was:
  
   val model = new NaiveBayes().run(points)
  
  
  
   On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng men...@gmail.com
 wrote:
  
   Could you share the command you used and more of the error message?
   Also, is it an MLlib specific problem? -Xiangrui
  
   On Thu, Apr 24, 2014 at 11:49 AM, John King
   usedforprinting...@gmail.com wrote:
./spark-shell: line 153: 17654 Killed
$FWDIR/bin/spark-class org.apache.spark.repl.Main $@
   
   
Any ideas?

Re: how to set spark.executor.memory and heap size

anyone knows the reason? i've googled a bit, and found some guys had the same
problem,  but with no replies...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4796.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to set spark.executor.memory and heap size

i noticed that error occurs
at
org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
 
at
org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2378) 
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285) 
at
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77) 
at
org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39) 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

is it related to the warning below?

14/04/25 08:38:36 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/04/25 08:38:36 WARN snappy.LoadSnappy: Snappy native library not loaded



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4798.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: compile spark 0.9.1 in hadoop 2.2 above exception

2014-04-24 Thread Patrick Wendell

Try running sbt/sbt clean and re-compiling. Any luck?


On Thu, Apr 24, 2014 at 5:33 PM, martin.ou martin...@orchestrallinc.cnwrote:



 occure exception when compile spark 0.9.1 using sbt,env: hadoop 2.3

 1. SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly



 2.found Exception:

 found   : org.apache.spark.streaming.dstream.DStream[(K, V)]

 [error]  required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V]

 [error]  Note: implicit method fromPairDStream is not applicable here
 because it comes after the application point and it lacks an explicit
 result type

 [error] dstream.filter((x = f(x).booleanValue()))

 exception:

 Best Regards







 欧大鹏  Martin.Ou



 交付中心  Delive



 ＝＝

 新世基（太仓）科技服务有限公司

 Orchestrall, Inc. http://www.orchestrallinc.com/

 +86 18962621057 (Mobile)

 0512-53866019(Office)


 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise private
 information. If you have received it in error, please notify the sender
 immediately and delete the original. Any
 other use of the email by you is prohibited.

Re: Finding bad data

Hey Jim, this is unfortunately harder than I’d like right now, but here’s how 
to do it. Look at the stderr file of the executor on that machine, and you’ll 
see lines like this:

14/04/24 19:17:24 INFO HadoopRDD: Input split: 
file:/Users/matei/workspace/apache-spark/README.md:0+2000

This says what file it was reading, as well as what byte offset (that’s the 
0+2000 part). Unfortunately, because the executor is running multiple tasks at 
the same time, this message will be hard to associate with a particular task 
unless you only configure one core per executor. But it may help you spot the 
file.

The other way you might do it is a map() on the data before you process it that 
checks for error conditions. In that one you could print out the original input 
line.

I realize that neither of these is ideal. I’ve opened 
https://issues.apache.org/jira/browse/SPARK-1622 to try to expose this 
information somewhere else, ideally in the UI. The reason it wasn’t done so far 
is because some tasks in Spark can be reading from multiple Hadoop InputSplits 
(e.g. if you use coalesce(), or zip(), or similar), so it’s tough to do it in a 
super general setting.

Matei

On Apr 24, 2014, at 6:15 PM, Jim Blomo jim.bl...@gmail.com wrote:

 I'm using PySpark to load some data and getting an error while
 parsing it.  Is it possible to find the source file and line of the bad
 data?  I imagine that this would be extremely tricky when dealing with
 multiple derived RRDs, so an answer with the caveat of this only
 works when running .map() on an textFile() RDD is totally fine.
 Perhaps if the line number and file was available in pyspark I could
 catch the exception and output it with the context?
 
 Anyway to narrow down the problem input would be great. Thanks!

parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Earthson Lu

spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping)

this line is too slow. There are about 2 million elements in word_mapping.

*Is there a good style for writing a large collection to hdfs?*

import org.apache.spark._
 import SparkContext._
 import scala.io.Source
 object WFilter {
 def main(args: Array[String]) {
 val spark = new SparkContext(yarn-standalone,word
 filter,System.getenv(SPARK_HOME), SparkContext.jarOfClass(this.getClass))
 val stopset =
 Source.fromURL(this.getClass.getResource(stoplist.txt)).getLines.map(_.trim).toSet
 val file = spark.textFile(hdfs://ns1/nlp/wiki.splited)
 val tf_map = spark broadcast
 file.flatMap(_.split(\t)).map((_,1)).countByKey
 val df_map = spark broadcast
 file.map(x=Set(x.split(\t):_*)).flatMap(_.map(_-1)).countByKey
 val word_mapping = spark broadcast
 Map(df_map.value.keys.zipWithIndex.toBuffer:_*)
 def w_filter(w:String) = if (tf_map.value(w)  8 ||
 df_map.value(w)  4 || (stopset contains w)) false else true
 val mapped =
 file.map(_.split(\t).filter(w_filter).map(w=word_mapping.value(w)).mkString(\t))

 spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping)
 mapped.saveAsTextFile(hdfs://ns1/nlp/lda/wiki.docs)
 spark.stop()
 }
 }


many thx:)

-- 

~
Perfection is achieved
not when there is nothing more to add
 but when there is nothing left to take away

Re: parallelize for a large Seq is extreamly slow.

Try setting the serializer to org.apache.spark.serializer.KryoSerializer (see 
http://spark.apache.org/docs/0.9.1/tuning.html), it should be considerably 
faster.

Matei

On Apr 24, 2014, at 8:01 PM, Earthson Lu earthson...@gmail.com wrote:

 spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping)
 
 this line is too slow. There are about 2 million elements in word_mapping.
 
 Is there a good style for writing a large collection to hdfs?
 
 import org.apache.spark._
 import SparkContext._
 import scala.io.Source
 object WFilter {
 def main(args: Array[String]) {
 val spark = new SparkContext(yarn-standalone,word 
 filter,System.getenv(SPARK_HOME), SparkContext.jarOfClass(this.getClass))
 val stopset = 
 Source.fromURL(this.getClass.getResource(stoplist.txt)).getLines.map(_.trim).toSet
 val file = spark.textFile(hdfs://ns1/nlp/wiki.splited)
 val tf_map = spark broadcast 
 file.flatMap(_.split(\t)).map((_,1)).countByKey
 val df_map = spark broadcast 
 file.map(x=Set(x.split(\t):_*)).flatMap(_.map(_-1)).countByKey
 val word_mapping = spark broadcast 
 Map(df_map.value.keys.zipWithIndex.toBuffer:_*)
 def w_filter(w:String) = if (tf_map.value(w)  8 || df_map.value(w)  
 4 || (stopset contains w)) false else true
 val mapped = 
 file.map(_.split(\t).filter(w_filter).map(w=word_mapping.value(w)).mkString(\t))
 
 spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping)
 mapped.saveAsTextFile(hdfs://ns1/nlp/lda/wiki.docs)
 spark.stop()
 }
 } 
 
 many thx:) 
 
 -- 
 
 ~
 Perfection is achieved 
 not when there is nothing more to add
  but when there is nothing left to take away

Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

2014-04-24 Thread Qin Wei

Hi All,

I have a problem with the Item-Based Collaborative Filtering Recommendation
Algorithms in spark.
The basic flow is as below:
(Item1,  (User1 ,  
Score1))
   RDD1 ==(Item2,  (User2 ,   Score2))
(Item1,  (User2 ,  
Score3))
(Item2,  (User1 ,  
Score4))

   RDD1.groupByKey   ==  RDD2
(Item1,  ((User1,   Score1),  
(User2,   Score3)))
(Item2,  ((User1,   Score4),  
(User2,   Score2)))

The similarity of Vector  ((User1,   Score1),   (User2,   Score3)) and
((User1,   Score4),   (User2,   Score2)) is the similarity of Item1 and
Item2.

In my situation, RDD2 contains 20 million records, my spark programm is
extreamly slow, the source code is as below:
val conf = new
SparkConf().setMaster(spark://211.151.121.184:7077).setAppName(Score
Calcu Total).set(spark.executor.memory,
20g).setJars(Seq(/home/deployer/score-calcu-assembly-1.0.jar)) 
val sc = new SparkContext(conf) 

val mongoRDD = sc.textFile(args(0).toString,
400) 
val jsonRDD = mongoRDD.map(arg = new
JSONObject(arg)) 

val newRDD = jsonRDD.map(arg = { 
var score =
haha(arg.get(a).asInstanceOf[JSONObject]) 

// set score to 0.5 for testing
arg.put(score, 0.5) 
arg 
}) 

val resourceScoresRDD = newRDD.map(arg =
(arg.get(rid).toString.toLong, (arg.get(zid).toString,
arg.get(score).asInstanceOf[Number].doubleValue))).groupByKey().cache() 
val resourceScores =
resourceScoresRDD.collect()
val bcResourceScores =
sc.broadcast(resourceScores) 

val simRDD =
resourceScoresRDD.mapPartitions({iter = 
val m = bcResourceScores.value 
for{ (r1, v1) - iter 
   (r2, v2) - m 
   if r1  r2 
} yield (r1, r2, cosSimilarity(v1,
v2))}, true).filter(arg = arg._3  0.1) 

println(simRDD.count)

And I saw this in Spark Web UI:
http://apache-spark-user-list.1001560.n3.nabble.com/file/n4808/QQ%E6%88%AA%E5%9B%BE20140424204018.png
 
http://apache-spark-user-list.1001560.n3.nabble.com/file/n4808/QQ%E6%88%AA%E5%9B%BE20140424204001.png
 

My standalone cluster has 3 worker node (16 core and 32G RAM),and the
workload of the machine in my cluster is heavy when the spark program is
running.

Is there any better way to do the algorithm?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-the-Item-Based-Collaborative-Filtering-Recommendation-Algorithms-in-spark-tp4808.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Earthson

Kryo With Exception below:

com.esotericsoftware.kryo.KryoException
(com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 1)
com.esotericsoftware.kryo.io.Output.require(Output.java:138)
com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446)
com.esotericsoftware.kryo.io.Output.writeString(Output.java:306)
com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.write(DefaultSerializers.java:153)
com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.write(DefaultSerializers.java:146)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
com.esotericsoftware.kryo.serializers.MapSerializer.write(MapSerializer.java:79)
com.esotericsoftware.kryo.serializers.MapSerializer.write(MapSerializer.java:17)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:21)
com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:19)
com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:124)
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:223)
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)

~~~

package com.semi.nlp

import org.apache.spark._
import SparkContext._
import scala.io.Source
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator

class MyRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[Map[String,Int]])
}
}

object WFilter2 {
def initspark(name:String) = {
val conf = new SparkConf()
.setMaster(yarn-standalone)
.setAppName(name)
.setSparkHome(System.getenv(SPARK_HOME))
.setJars(SparkContext.jarOfClass(this.getClass))
.set(spark.serializer,
org.apache.spark.serializer.KryoSerializer)
.set(spark.kryo.registrator,
com.semi.nlp.MyRegistrator)
new SparkContext(conf)
}

def main(args: Array[String]) {
val spark = initspark(word filter mapping)
val stopset =
Source.fromURL(this.getClass.getResource(/stoplist.txt)).getLines.map(_.trim).toSet
val file = spark.textFile(hdfs://ns1/nlp/wiki.splited)
val tf_map = spark broadcast
file.flatMap(_.split(\t)).map((_,1)).countByKey
val df_map = spark broadcast
file.map(x=Set(x.split(\t):_*)).flatMap(_.map(_-1)).countByKey
val word_mapping = spark broadcast
Map(df_map.value.keys.zipWithIndex.toBuffer:_*)
def w_filter(w:String) = if (tf_map.value(w)  8 || df_map.value(w)
 4 || (stopset contains w)) false else true
val mapped =
file.map(_.split(\t).filter(w_filter).map(w=word_mapping.value(w)).mkString(\t))
   
spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping)
mapped.saveAsTextFile(hdfs://ns1/nlp/lda/wiki.docs)
spark.stop()
}
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4809.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark running slow for small hadoop files of 10 mb size

2014-04-24 Thread neeravsalaria

Thanks for the reply. It indeed increased the usage. There was another issue
we found, we were broadcasting hadoop configuration by writing a wrapper
class over it. But found the proper way in Spark Code 

sc.broadcast(new SerializableWritable(conf))





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-running-slow-for-small-hadoop-files-of-10-mb-size-tp4526p4811.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark mllib throwing error