Re: Access Last Element of RDD
You can use rdd.takeOrdered(1)(reverseOrdrering) reverseOrdering is you Ordering[T] instance where you define the ordering logic. This you have to pass in the method On Thu, Apr 24, 2014 at 11:21 AM, Frank Austin Nothaft fnoth...@berkeley.edu wrote: If you do this, you could simplify to: RDD.collect().last However, this has the problem of collecting all data to the driver. Is your data sorted? If so, you could reverse the sort and take the first. Alternatively, a hackey implementation might involve a mapPartitionsWithIndex that returns an empty iterator for all partitions except for the last. For the last partition, you would filter all elements except for the last element in your iterator. This should leave one element, which is your last element. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Apr 23, 2014, at 10:44 PM, Adnan Yaqoob nsyaq...@gmail.com wrote: This function will return scala List, you can use List's last function to get the last element. For example: RDD.take(RDD.count()).last On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD. I want only to access the last element. On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Oh ya, Thanks Adnan. On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.comwrote: You can use following code: RDD.take(RDD.count()) On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Hi All, Some help ! RDD.first or RDD.take(1) gives the first item, is there a straight forward way to access the last element in a similar way ? I coudnt fine a tail/last method for RDD. !! -- Sourav Chandra Senior Software Engineer · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · sourav.chan...@livestream.com o: +91 80 4121 8723 m: +91 988 699 3746 skype: sourav.chandra Livestream Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area, Bangalore 560034 www.livestream.com
Re: Access Last Element of RDD
Also same thing can be done using rdd.top(1)(reverseOrdering) On Thu, Apr 24, 2014 at 11:28 AM, Sourav Chandra sourav.chan...@livestream.com wrote: You can use rdd.takeOrdered(1)(reverseOrdrering) reverseOrdering is you Ordering[T] instance where you define the ordering logic. This you have to pass in the method On Thu, Apr 24, 2014 at 11:21 AM, Frank Austin Nothaft fnoth...@berkeley.edu wrote: If you do this, you could simplify to: RDD.collect().last However, this has the problem of collecting all data to the driver. Is your data sorted? If so, you could reverse the sort and take the first. Alternatively, a hackey implementation might involve a mapPartitionsWithIndex that returns an empty iterator for all partitions except for the last. For the last partition, you would filter all elements except for the last element in your iterator. This should leave one element, which is your last element. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Apr 23, 2014, at 10:44 PM, Adnan Yaqoob nsyaq...@gmail.com wrote: This function will return scala List, you can use List's last function to get the last element. For example: RDD.take(RDD.count()).last On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD. I want only to access the last element. On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Oh ya, Thanks Adnan. On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.comwrote: You can use following code: RDD.take(RDD.count()) On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Hi All, Some help ! RDD.first or RDD.take(1) gives the first item, is there a straight forward way to access the last element in a similar way ? I coudnt fine a tail/last method for RDD. !! -- Sourav Chandra Senior Software Engineer · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · sourav.chan...@livestream.com o: +91 80 4121 8723 m: +91 988 699 3746 skype: sourav.chandra Livestream Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area, Bangalore 560034 www.livestream.com -- Sourav Chandra Senior Software Engineer · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · sourav.chan...@livestream.com o: +91 80 4121 8723 m: +91 988 699 3746 skype: sourav.chandra Livestream Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area, Bangalore 560034 www.livestream.com
Re: Access Last Element of RDD
Thanks Guys ! On Thu, Apr 24, 2014 at 11:29 AM, Sourav Chandra sourav.chan...@livestream.com wrote: Also same thing can be done using rdd.top(1)(reverseOrdering) On Thu, Apr 24, 2014 at 11:28 AM, Sourav Chandra sourav.chan...@livestream.com wrote: You can use rdd.takeOrdered(1)(reverseOrdrering) reverseOrdering is you Ordering[T] instance where you define the ordering logic. This you have to pass in the method On Thu, Apr 24, 2014 at 11:21 AM, Frank Austin Nothaft fnoth...@berkeley.edu wrote: If you do this, you could simplify to: RDD.collect().last However, this has the problem of collecting all data to the driver. Is your data sorted? If so, you could reverse the sort and take the first. Alternatively, a hackey implementation might involve a mapPartitionsWithIndex that returns an empty iterator for all partitions except for the last. For the last partition, you would filter all elements except for the last element in your iterator. This should leave one element, which is your last element. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Apr 23, 2014, at 10:44 PM, Adnan Yaqoob nsyaq...@gmail.com wrote: This function will return scala List, you can use List's last function to get the last element. For example: RDD.take(RDD.count()).last On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD. I want only to access the last element. On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Oh ya, Thanks Adnan. On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.comwrote: You can use following code: RDD.take(RDD.count()) On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Hi All, Some help ! RDD.first or RDD.take(1) gives the first item, is there a straight forward way to access the last element in a similar way ? I coudnt fine a tail/last method for RDD. !! -- Sourav Chandra Senior Software Engineer · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · sourav.chan...@livestream.com o: +91 80 4121 8723 m: +91 988 699 3746 skype: sourav.chandra Livestream Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area, Bangalore 560034 www.livestream.com -- Sourav Chandra Senior Software Engineer · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · sourav.chan...@livestream.com o: +91 80 4121 8723 m: +91 988 699 3746 skype: sourav.chandra Livestream Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area, Bangalore 560034 www.livestream.com
Re: how to set spark.executor.memory and heap size
thank you, i add setJars, but nothing changes val conf = new SparkConf() .setMaster(spark://127.0.0.1:7077) .setAppName(Simple App) .set(spark.executor.memory, 1g) .setJars(Seq(target/scala-2.10/simple-project_2.10-1.0.jar)) val sc = new SparkContext(conf) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4732.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Re: how to set spark.executor.memory and heap size
try the complete path qinwei From: wxhsdpDate: 2014-04-24 14:21To: userSubject: Re: how to set spark.executor.memory and heap sizethank you, i add setJars, but nothing changes val conf = new SparkConf() .setMaster(spark://127.0.0.1:7077) .setAppName(Simple App) .set(spark.executor.memory, 1g) .setJars(Seq(target/scala-2.10/simple-project_2.10-1.0.jar)) val sc = new SparkContext(conf) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4732.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Need help about how hadoop works.
Thanks Mayur. So without Hadoop and any other distributed file systems, by running: val doc = sc.textFile(/home/scalatest.txt,5) doc.count we can only get parallelization within the computer where the file is loaded, but not the parallelization within the computers in the cluster (Spark can not automatically duplicate the file to the other computers in the cluster), is this understanding correct? Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4734.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Need help about how hadoop works.
Prashant Sharma On Thu, Apr 24, 2014 at 12:15 PM, Carter gyz...@hotmail.com wrote: Thanks Mayur. So without Hadoop and any other distributed file systems, by running: val doc = sc.textFile(/home/scalatest.txt,5) doc.count we can only get parallelization within the computer where the file is loaded, but not the parallelization within the computers in the cluster (Spark can not automatically duplicate the file to the other computers in the cluster), is this understanding correct? Thank you. Spark will not distribute that file for you on other systems, however if the file(/home/scalatest.txt) is present on the same path on all systems it will be processed on all nodes. We generally use hdfs which takes care of this distribution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4734.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Re: how to set spark.executor.memory and heap size
i tried, but no effect Qin Wei wrote try the complete path qinwei From: wxhsdpDate: 2014-04-24 14:21To: userSubject: Re: how to set spark.executor.memory and heap sizethank you, i add setJars, but nothing changes val conf = new SparkConf() .setMaster(spark://127.0.0.1:7077) .setAppName(Simple App) .set(spark.executor.memory, 1g) .setJars(Seq(target/scala-2.10/simple-project_2.10-1.0.jar)) val sc = new SparkContext(conf) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4732.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4736.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: SPARK_YARN_APP_JAR, SPARK_CLASSPATH and ADD_JARS in a spark-shell on YARN
Good to know, thanks for pointing this out to me! On 23/04/2014 19:55, Sandy Ryza wrote: Ah, you're right about SPARK_CLASSPATH and ADD_JARS. My bad. SPARK_YARN_APP_JAR is going away entirely - https://issues.apache.org/jira/browse/SPARK-1053 On Wed, Apr 23, 2014 at 8:07 AM, Christophe Préaud christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com wrote: Hi Sandy, Thanks for your reply ! I thought adding the jars in both SPARK_CLASSPATH and ADD_JARS was only required as a temporary workaround in spark 0.9.0 (see https://issues.apache.org/jira/browse/SPARK-1089), and that it was not necessary anymore in 0.9.1 As for SPARK_YARN_APP_JAR, is it really useful, or is it planned to be removed in future versions of Spark? I personally always set it to /dev/null when launching a spark-shell in yarn-client mode. Thanks again for your time! Christophe. On 21/04/2014 19:16, Sandy Ryza wrote: Hi Christophe, Adding the jars to both SPARK_CLASSPATH and ADD_JARS is required. The former makes them available to the spark-shell driver process, and the latter tells Spark to make them available to the executor processes running on the cluster. -Sandy On Wed, Apr 16, 2014 at 9:27 AM, Christophe Préaud christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com wrote: Hi, I am running Spark 0.9.1 on a YARN cluster, and I am wondering which is the correct way to add external jars when running a spark shell on a YARN cluster. Packaging all this dependencies in an assembly which path is then set in SPARK_YARN_APP_JAR (as written in the doc: http://spark.apache.org/docs/latest/running-on-yarn.html) does not work in my case: it pushes the jar on HDFS in .sparkStaging/application_XXX, but the spark-shell is still unable to find it (unless ADD_JARS and/or SPARK_CLASSPATH is defined) Defining all the dependencies (either in an assembly, or separately) in ADD_JARS or SPARK_CLASSPATH works (even if SPARK_YARN_APP_JAR is set to /dev/null), but defining some dependencies in ADD_JARS and the rest in SPARK_CLASSPATH does not! Hence I'm still wondering which are the differences between ADD_JARS and SPARK_CLASSPATH, and the purpose of SPARK_YARN_APP_JAR. Thanks for any insights! Christophe. Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur. Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur. Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Need help about how hadoop works.
Thank you very much for your help Prashant. Sorry I still have another question about your answer: however if the file(/home/scalatest.txt) is present on the same path on all systems it will be processed on all nodes. When presenting the file to the same path on all nodes, do we just simply copy the same file to all nodes, or do we need to split the original file into different parts (each part is still with the same file name scalatest.txt), and copy each part to a different node for parallelization? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Need help about how hadoop works.
It is the same file and hadoop library that we use for splitting takes care of assigning the right split to each node. Prashant Sharma On Thu, Apr 24, 2014 at 1:36 PM, Carter gyz...@hotmail.com wrote: Thank you very much for your help Prashant. Sorry I still have another question about your answer: however if the file(/home/scalatest.txt) is present on the same path on all systems it will be processed on all nodes. When presenting the file to the same path on all nodes, do we just simply copy the same file to all nodes, or do we need to split the original file into different parts (each part is still with the same file name scalatest.txt), and copy each part to a different node for parallelization? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: how to set spark.executor.memory and heap size
i think maybe it's the problem of read local file val logFile = /home/wxhsdp/spark/example/standalone/README.md val logData = sc.textFile(logFile).cache() if i replace the above code with val logData = sc.parallelize(Array(1,2,3,4)).cache() the job can complete successfully can't i read a file located at local file system? anyone knows the reason? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4740.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: how to set spark.executor.memory and heap size
You need to use proper url format: file://home/wxhsdp/spark/example/standalone/README.md On Thu, Apr 24, 2014 at 1:29 PM, wxhsdp wxh...@gmail.com wrote: i think maybe it's the problem of read local file val logFile = /home/wxhsdp/spark/example/standalone/README.md val logData = sc.textFile(logFile).cache() if i replace the above code with val logData = sc.parallelize(Array(1,2,3,4)).cache() the job can complete successfully can't i read a file located at local file system? anyone knows the reason? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4740.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: how to set spark.executor.memory and heap size
Sorry wrong format: file:///home/wxhsdp/spark/example/standalone/README.md An extra / is needed at the start. On Thu, Apr 24, 2014 at 1:46 PM, Adnan Yaqoob nsyaq...@gmail.com wrote: You need to use proper url format: file://home/wxhsdp/spark/example/standalone/README.md On Thu, Apr 24, 2014 at 1:29 PM, wxhsdp wxh...@gmail.com wrote: i think maybe it's the problem of read local file val logFile = /home/wxhsdp/spark/example/standalone/README.md val logData = sc.textFile(logFile).cache() if i replace the above code with val logData = sc.parallelize(Array(1,2,3,4)).cache() the job can complete successfully can't i read a file located at local file system? anyone knows the reason? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4740.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: how to set spark.executor.memory and heap size
thanks for your reply, adnan, i tried val logFile = file:///home/wxhsdp/spark/example/standalone/README.md i think there needs three left slash behind file: it's just the same as val logFile = home/wxhsdp/spark/example/standalone/README.md the error remains:( -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4743.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: how to set spark.executor.memory and heap size
Hi, You should be able to read it, file://or file:/// not even required for reading locally , just path is enough.. what error message you getting on spark-shell while reading... for local: Also read the same from hdfs file also ... put your README file there and read , it works in both ways.. val a= sc.textFile(hdfs://localhost:54310/t/README.md) also, print stack message of your spark-shell... On Thu, Apr 24, 2014 at 2:25 PM, wxhsdp wxh...@gmail.com wrote: thanks for your reply, adnan, i tried val logFile = file:///home/wxhsdp/spark/example/standalone/README.md i think there needs three left slash behind file: it's just the same as val logFile = home/wxhsdp/spark/example/standalone/README.md the error remains:( -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4743.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: how to set spark.executor.memory and heap size
hi arpit, on spark shell, i can read local file properly, but when i use sbt run, error occurs. the sbt error message is in the beginning of the thread Arpit Tak-2 wrote Hi, You should be able to read it, file://or file:/// not even required for reading locally , just path is enough.. what error message you getting on spark-shell while reading... for local: Also read the same from hdfs file also ... put your README file there and read , it works in both ways.. val a= sc.textFile(hdfs://localhost:54310/t/README.md) also, print stack message of your spark-shell... On Thu, Apr 24, 2014 at 2:25 PM, wxhsdp lt; wxhsdp@ gt; wrote: thanks for your reply, adnan, i tried val logFile = file:///home/wxhsdp/spark/example/standalone/README.md i think there needs three left slash behind file: it's just the same as val logFile = home/wxhsdp/spark/example/standalone/README.md the error remains:( -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4743.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4745.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: Need help about how hadoop works.
Thank you very much Prashant. Date: Thu, 24 Apr 2014 01:24:39 -0700 From: ml-node+s1001560n4739...@n3.nabble.com To: gyz...@hotmail.com Subject: Re: Need help about how hadoop works. It is the same file and hadoop library that we use for splitting takes care of assigning the right split to each node.Prashant Sharma On Thu, Apr 24, 2014 at 1:36 PM, Carter [hidden email] wrote: Thank you very much for your help Prashant. Sorry I still have another question about your answer: however if the file(/home/scalatest.txt) is present on the same path on all systems it will be processed on all nodes. When presenting the file to the same path on all nodes, do we just simply copy the same file to all nodes, or do we need to split the original file into different parts (each part is still with the same file name scalatest.txt), and copy each part to a different node for parallelization? Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4738.html Sent from the Apache Spark User List mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4739.html To unsubscribe from Need help about how hadoop works., click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-about-how-hadoop-works-tp4638p4746.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: how to set spark.executor.memory and heap size
Okk fine, try like this , i tried and it works.. specify spark path also in constructor... and also export SPARK_JAVA_OPTS=-Xms300m -Xmx512m -XX:MaxPermSize=1g import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object SimpleApp { def main(args: Array[String]) { val logFile = /var/log/auth.log // read any file. val sc = new SparkContext(spark://localhost:7077, Simple App, /home/ubuntu/spark-0.9.1-incubating/, List(target/scala-2.10/simple-project_2.10-2.0.jar)) val tr = sc.textFile(logFile).cache tr.take(100).foreach(println) } } This will work On Thu, Apr 24, 2014 at 3:00 PM, wxhsdp wxh...@gmail.com wrote: hi arpit, on spark shell, i can read local file properly, but when i use sbt run, error occurs. the sbt error message is in the beginning of the thread Arpit Tak-2 wrote Hi, You should be able to read it, file://or file:/// not even required for reading locally , just path is enough.. what error message you getting on spark-shell while reading... for local: Also read the same from hdfs file also ... put your README file there and read , it works in both ways.. val a= sc.textFile(hdfs://localhost:54310/t/README.md) also, print stack message of your spark-shell... On Thu, Apr 24, 2014 at 2:25 PM, wxhsdp lt; wxhsdp@ gt; wrote: thanks for your reply, adnan, i tried val logFile = file:///home/wxhsdp/spark/example/standalone/README.md i think there needs three left slash behind file: it's just the same as val logFile = home/wxhsdp/spark/example/standalone/README.md the error remains:( -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4743.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4745.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: error in mllib lr example code
Also try out these examples, all of them works http://docs.sigmoidanalytics.com/index.php/MLlib if you spot any problems in those, let us know. Regards, arpit On Wed, Apr 23, 2014 at 11:08 PM, Matei Zaharia matei.zaha...@gmail.comwrote: See http://people.csail.mit.edu/matei/spark-unified-docs/ for a more recent build of the docs; if you spot any problems in those, let us know. Matei On Apr 23, 2014, at 9:49 AM, Xiangrui Meng men...@gmail.com wrote: The doc is for 0.9.1. You are running a later snapshot, which added sparse vectors. Try LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(x = x.toDouble)). The examples are updated in the master branch. You can also check the examples there. -Xiangrui On Wed, Apr 23, 2014 at 9:34 AM, Mohit Jaggi mohitja...@gmail.com wrote: sorry...added a subject now On Wed, Apr 23, 2014 at 9:32 AM, Mohit Jaggi mohitja...@gmail.com wrote: I am trying to run the example linear regression code from http://spark.apache.org/docs/latest/mllib-guide.html But I am getting the following error...am I missing an import? code import org.apache.spark._ import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.regression.LabeledPoint object ModelLR { def main(args: Array[String]) { val sc = new SparkContext(args(0), SparkLR, System.getenv(SPARK_HOME), SparkContext.jarOfClass(this.getClass).toSeq) // Load and parse the data val data = sc.textFile(mllib/data/ridge-data/lpsa.data) val parsedData = data.map { line = val parts = line.split(',') LabeledPoint(parts(0).toDouble, parts(1).split(' ').map(x = x.toDouble).toArray) } ...snip... } error - polymorphic expression cannot be instantiated to expected type; found : [U : Double]Array[U] required: org.apache.spark.mllib.linalg.Vector - polymorphic expression cannot be instantiated to expected type; found : [U : Double]Array[U] required: org.apache.spark.mllib.linalg.Vector
Re: Access Last Element of RDD
Hi All, Finally i wrote the following code, which is felt does optimally if not the most optimum one. Using file pointers, seeking the byte after the last \n but backwards !! This is memory efficient and i hope even unix tail implementation should be something similar !! import java.io.RandomAccessFile import java.io.IOException var FILEPATH=/home/sparkcluster/hadoop-2.3.0/temp; var fileHandler = new RandomAccessFile( FILEPATH, r ); var fileLength = fileHandler.length() - 1; var cond = 1; var filePointer = fileLength-1; var toRead= -1; while(filePointer != -1 cond!=0){ fileHandler.seek( filePointer ); var readByte = fileHandler.readByte(); if( readByte == 0xA filePointer != fileLength ) cond=0; else if( readByte == 0xD filePointer != fileLength - 1 ) cond=0; filePointer=filePointer-1; toRead=toRead+1; } filePointer=filePointer+2; var bytes : Array[Byte] = new Array[Byte](toRead); fileHandler.seek(filePointer); fileHandler.read(bytes); var bdd=new String(bytes); /*bdd contains the last line*/ On Thu, Apr 24, 2014 at 11:42 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Thanks Guys ! On Thu, Apr 24, 2014 at 11:29 AM, Sourav Chandra sourav.chan...@livestream.com wrote: Also same thing can be done using rdd.top(1)(reverseOrdering) On Thu, Apr 24, 2014 at 11:28 AM, Sourav Chandra sourav.chan...@livestream.com wrote: You can use rdd.takeOrdered(1)(reverseOrdrering) reverseOrdering is you Ordering[T] instance where you define the ordering logic. This you have to pass in the method On Thu, Apr 24, 2014 at 11:21 AM, Frank Austin Nothaft fnoth...@berkeley.edu wrote: If you do this, you could simplify to: RDD.collect().last However, this has the problem of collecting all data to the driver. Is your data sorted? If so, you could reverse the sort and take the first. Alternatively, a hackey implementation might involve a mapPartitionsWithIndex that returns an empty iterator for all partitions except for the last. For the last partition, you would filter all elements except for the last element in your iterator. This should leave one element, which is your last element. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Apr 23, 2014, at 10:44 PM, Adnan Yaqoob nsyaq...@gmail.com wrote: This function will return scala List, you can use List's last function to get the last element. For example: RDD.take(RDD.count()).last On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD. I want only to access the last element. On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Oh ya, Thanks Adnan. On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.comwrote: You can use following code: RDD.take(RDD.count()) On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Hi All, Some help ! RDD.first or RDD.take(1) gives the first item, is there a straight forward way to access the last element in a similar way ? I coudnt fine a tail/last method for RDD. !! -- Sourav Chandra Senior Software Engineer · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · sourav.chan...@livestream.com o: +91 80 4121 8723 m: +91 988 699 3746 skype: sourav.chandra Livestream Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area, Bangalore 560034 www.livestream.com -- Sourav Chandra Senior Software Engineer · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · sourav.chan...@livestream.com o: +91 80 4121 8723 m: +91 988 699 3746 skype: sourav.chandra Livestream Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area, Bangalore 560034 www.livestream.com
Re: SparkPi performance-3 cluster standalone mode
Hi, Relatively new on spark and have tried running SparkPi example on a standalone 12 core three machine cluster. What I'm failing to understand is, that running this example with a single slice gives better performance as compared to using 12 slices. Same was the case when I was using parallelize function. The time is scaling almost linearly with adding each slice. Please let me know if I'm doing anything wrong. The code snippet is given below: Regards, Ahsan Ijaz -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkPi-performance-3-cluster-standalone-mode-tp4530p4751.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: how to set spark.executor.memory and heap size
it seems that it's nothing about settings, i tried take action, and find it's ok, but error occurs when i tried count and collect val a = sc.textFile(any file) a.take(n).foreach(println) //ok a.count() //failed a.collect()//failed val b = sc.parallelize((Array(1,2,3,4)) b.take(n).foreach(println) //ok b.count() //ok b.collect()//ok it's so weird Arpit Tak-2 wrote Okk fine, try like this , i tried and it works.. specify spark path also in constructor... and also export SPARK_JAVA_OPTS=-Xms300m -Xmx512m -XX:MaxPermSize=1g import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object SimpleApp { def main(args: Array[String]) { val logFile = /var/log/auth.log // read any file. val sc = new SparkContext(spark://localhost:7077, Simple App, /home/ubuntu/spark-0.9.1-incubating/, List(target/scala-2.10/simple-project_2.10-2.0.jar)) val tr = sc.textFile(logFile).cache tr.take(100).foreach(println) } } This will work On Thu, Apr 24, 2014 at 3:00 PM, wxhsdp lt; wxhsdp@ gt; wrote: hi arpit, on spark shell, i can read local file properly, but when i use sbt run, error occurs. the sbt error message is in the beginning of the thread Arpit Tak-2 wrote Hi, You should be able to read it, file://or file:/// not even required for reading locally , just path is enough.. what error message you getting on spark-shell while reading... for local: Also read the same from hdfs file also ... put your README file there and read , it works in both ways.. val a= sc.textFile(hdfs://localhost:54310/t/README.md) also, print stack message of your spark-shell... On Thu, Apr 24, 2014 at 2:25 PM, wxhsdp lt; wxhsdp@ gt; wrote: thanks for your reply, adnan, i tried val logFile = file:///home/wxhsdp/spark/example/standalone/README.md i think there needs three left slash behind file: it's just the same as val logFile = home/wxhsdp/spark/example/standalone/README.md the error remains:( -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4743.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4745.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4752.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Access Last Element of RDD
You may try this: val lastOption = sc.textFile(input).mapPartitions { iterator = if (iterator.isEmpty) { iterator } else { Iterator .continually((iterator.next(), iterator.hasNext())) .collect { case (value, false) = value } .take(1) } }.collect().lastOption Iterator based data access ensures O(1) space complexity and it runs faster because different partitions are processed in parallel. lastOption is used instead of last to deal with empty file. On Thu, Apr 24, 2014 at 7:38 PM, Sai Prasanna ansaiprasa...@gmail.comwrote: Hi All, Finally i wrote the following code, which is felt does optimally if not the most optimum one. Using file pointers, seeking the byte after the last \n but backwards !! This is memory efficient and i hope even unix tail implementation should be something similar !! import java.io.RandomAccessFile import java.io.IOException var FILEPATH=/home/sparkcluster/hadoop-2.3.0/temp; var fileHandler = new RandomAccessFile( FILEPATH, r ); var fileLength = fileHandler.length() - 1; var cond = 1; var filePointer = fileLength-1; var toRead= -1; while(filePointer != -1 cond!=0){ fileHandler.seek( filePointer ); var readByte = fileHandler.readByte(); if( readByte == 0xA filePointer != fileLength ) cond=0; else if( readByte == 0xD filePointer != fileLength - 1 ) cond=0; filePointer=filePointer-1; toRead=toRead+1; } filePointer=filePointer+2; var bytes : Array[Byte] = new Array[Byte](toRead); fileHandler.seek(filePointer); fileHandler.read(bytes); var bdd=new String(bytes); /*bdd contains the last line*/ On Thu, Apr 24, 2014 at 11:42 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Thanks Guys ! On Thu, Apr 24, 2014 at 11:29 AM, Sourav Chandra sourav.chan...@livestream.com wrote: Also same thing can be done using rdd.top(1)(reverseOrdering) On Thu, Apr 24, 2014 at 11:28 AM, Sourav Chandra sourav.chan...@livestream.com wrote: You can use rdd.takeOrdered(1)(reverseOrdrering) reverseOrdering is you Ordering[T] instance where you define the ordering logic. This you have to pass in the method On Thu, Apr 24, 2014 at 11:21 AM, Frank Austin Nothaft fnoth...@berkeley.edu wrote: If you do this, you could simplify to: RDD.collect().last However, this has the problem of collecting all data to the driver. Is your data sorted? If so, you could reverse the sort and take the first. Alternatively, a hackey implementation might involve a mapPartitionsWithIndex that returns an empty iterator for all partitions except for the last. For the last partition, you would filter all elements except for the last element in your iterator. This should leave one element, which is your last element. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Apr 23, 2014, at 10:44 PM, Adnan Yaqoob nsyaq...@gmail.com wrote: This function will return scala List, you can use List's last function to get the last element. For example: RDD.take(RDD.count()).last On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD. I want only to access the last element. On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Oh ya, Thanks Adnan. On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.comwrote: You can use following code: RDD.take(RDD.count()) On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Hi All, Some help ! RDD.first or RDD.take(1) gives the first item, is there a straight forward way to access the last element in a similar way ? I coudnt fine a tail/last method for RDD. !! -- Sourav Chandra Senior Software Engineer · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · sourav.chan...@livestream.com o: +91 80 4121 8723 m: +91 988 699 3746 skype: sourav.chandra Livestream Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area, Bangalore 560034 www.livestream.com -- Sourav Chandra Senior Software Engineer · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · sourav.chan...@livestream.com o: +91 80 4121 8723 m: +91 988 699 3746 skype: sourav.chandra Livestream Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area, Bangalore 560034 www.livestream.com
Re: Is Spark a good choice for geospatial/GIS applications? Is a community volunteer needed in this area?
Thanks for the info. It seems like the JTS library is exactly what I need (I'm not doing any raster processing at this point). So, once they successfully finish the Scala wrappers for JTS, I would theoretically be able to use Scala to write a Spark job that includes the JTS library, and then run it across a Spark cluster? That is absolutely fantastic! I'll have to look into contributing to the JTS wrapping effort. Thanks again! On 14-04-2014 at 2:25 PM, Josh Marcus wrote:Hey there, I'd encourage you to check out the development currently going on with the GeoTrellis project(http://github.com/geotrellis/geotrellis) or talk to the developers on irc (freenode, #geotrellis) as they're currently developing raster processing capabilities with spark as a backend, as well as scala wrappersfor JTS (for calculations about geometry features). --j On Wed, Apr 23, 2014 at 2:00 PM, neveroutgunned wrote: Greetings Spark users/devs! I'm interested in using Spark to process large volumes of data with a geospatial component, and I haven't been able to find much information on Spark's ability to handle this kind of operation. I don't need anything too complex; just distance between two points, point-in-polygon and the like. Does Spark (or possibly Shark) support this kind of query? Has anyone written a plugin/extension along these lines? If there isn't anything like this so far, then it seems like I have two options. I can either abandon Spark and fall back on Hadoop and Hive with the ESRI Tools extension, or I can stick with Spark and try to write/port a GIS toolkit. Which option do you think I should pursue? How hard is it for someone that's new to the Spark codebase to write an extension? Is there anyone else in the community that would be interested in having geospatial capability in Spark? Thanks for your help! - View this message in context: Is Spark a good choice for geospatial/GIS applications? Is a community volunteer needed in this area? Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Deploying a python code on a spark EC2 cluster
Moreover it seems all the workers are registered and have sufficient memory (2.7GB where as I have asked for 512 MB). The UI also shows the jobs are running on the slaves. But on the termial it is still the same error Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Please see the screenshot. Thanks http://apache-spark-user-list.1001560.n3.nabble.com/file/n4761/33.png -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-a-python-code-on-a-spark-EC2-cluster-tp4758p4761.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
reduceByKeyAndWindow - spark internals
If I have this code: val stream1= doublesInputStream.window(Seconds(10), Seconds(2)) val stream2= stream1.reduceByKeyAndWindow(_ + _, Seconds(10), Seconds(10)) Does reduceByKeyAndWindow merge all RDDs from stream1 that came in the 10 second window? Example, in the first 10 secs stream1 will have 5 RDDS. Does reduceByKeyAndWindow merge these 5RDDs into 1 RDD and remove duplicates? -Adrian
Re: How do I access the SPARK SQL
Did you build it with SPARK_HIVE=true? On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru diplomaticg...@gmail.comwrote: Hi Matei, I checked out the git repository and built it. However, I'm still getting below error. It couldn't find those SQL packages. Please advice. package org.apache.spark.sql.api.java does not exist [ERROR] /home/VirtualBoxImages.com/Documents/projects/errCount/src/main/java/errorCount/TransDriverSQL.java:[49,8] cannot find symbol [ERROR] symbol : class JavaSchemaRDD Kind regards, Raj. On 23 April 2014 22:09, Matei Zaharia matei.zaha...@gmail.com wrote: It's currently in the master branch, on https://github.com/apache/spark. You can check that out from git, build it with sbt/sbt assembly, and then try it out. We're also going to post some release candidates soon that will be pre-built. Matei On Apr 23, 2014, at 1:30 PM, diplomatic Guru diplomaticg...@gmail.com wrote: Hello Team, I'm new to SPARK and just came across SPARK SQL, which appears to be interesting but not sure how I could get it. I know it's an Alpha version but not sure if its available for community yet. Many thanks. Raj.
Re: How do I access the SPARK SQL
You shouldn't need to set SPARK_HIVE=true unless you want to use the JavaHiveContext. You should be able to access org.apache.spark.sql.api.java.JavaSQLContext with the default build. How are you building your application? Michael On Thu, Apr 24, 2014 at 9:17 AM, Andrew Or and...@databricks.com wrote: Did you build it with SPARK_HIVE=true? On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru diplomaticg...@gmail.com wrote: Hi Matei, I checked out the git repository and built it. However, I'm still getting below error. It couldn't find those SQL packages. Please advice. package org.apache.spark.sql.api.java does not exist [ERROR] /home/VirtualBoxImages.com/Documents/projects/errCount/src/main/java/errorCount/TransDriverSQL.java:[49,8] cannot find symbol [ERROR] symbol : class JavaSchemaRDD Kind regards, Raj. On 23 April 2014 22:09, Matei Zaharia matei.zaha...@gmail.com wrote: It's currently in the master branch, on https://github.com/apache/spark. You can check that out from git, build it with sbt/sbt assembly, and then try it out. We're also going to post some release candidates soon that will be pre-built. Matei On Apr 23, 2014, at 1:30 PM, diplomatic Guru diplomaticg...@gmail.com wrote: Hello Team, I'm new to SPARK and just came across SPARK SQL, which appears to be interesting but not sure how I could get it. I know it's an Alpha version but not sure if its available for community yet. Many thanks. Raj.
Re: How do I access the SPARK SQL
Looks like you're depending on Spark 0.9.1, which doesn't have Spark SQL. Assuming you've downloaded Spark, just run 'mvn install' to publish Spark locally, and depend on Spark version 1.0.0-SNAPSHOT. On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru diplomaticg...@gmail.comwrote: It's a simple application based on the People example. I'm using Maven for building and below is the pom.xml. Perhaps, I need to change the version? project groupIdUthay.Test.App/groupId artifactIdtest-app/artifactId modelVersion4.0.0/modelVersion nameTestApp/name packagingjar/packaging version1.0/version repositories repository idAkka repository/id urlhttp://repo.akka.io/releases/url /repository /repositories dependencies dependency !-- Spark dependency -- groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version0.9.1/version /dependency /dependencies /project On 24 April 2014 17:47, Michael Armbrust mich...@databricks.com wrote: You shouldn't need to set SPARK_HIVE=true unless you want to use the JavaHiveContext. You should be able to access org.apache.spark.sql.api.java.JavaSQLContext with the default build. How are you building your application? Michael On Thu, Apr 24, 2014 at 9:17 AM, Andrew Or and...@databricks.com wrote: Did you build it with SPARK_HIVE=true? On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru diplomaticg...@gmail.com wrote: Hi Matei, I checked out the git repository and built it. However, I'm still getting below error. It couldn't find those SQL packages. Please advice. package org.apache.spark.sql.api.java does not exist [ERROR] /home/VirtualBoxImages.com/Documents/projects/errCount/src/main/java/errorCount/TransDriverSQL.java:[49,8] cannot find symbol [ERROR] symbol : class JavaSchemaRDD Kind regards, Raj. On 23 April 2014 22:09, Matei Zaharia matei.zaha...@gmail.com wrote: It’s currently in the master branch, on https://github.com/apache/spark. You can check that out from git, build it with sbt/sbt assembly, and then try it out. We’re also going to post some release candidates soon that will be pre-built. Matei On Apr 23, 2014, at 1:30 PM, diplomatic Guru diplomaticg...@gmail.com wrote: Hello Team, I'm new to SPARK and just came across SPARK SQL, which appears to be interesting but not sure how I could get it. I know it's an Alpha version but not sure if its available for community yet. Many thanks. Raj.
Re: How do I access the SPARK SQL
Oh, and you'll also need to add a dependency on spark-sql_2.10. On Thu, Apr 24, 2014 at 10:13 AM, Michael Armbrust mich...@databricks.comwrote: Yeah, you'll need to run `sbt publish-local` to push the jars to your local maven repository (~/.m2) and then depend on version 1.0.0-SNAPSHOT. On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru diplomaticg...@gmail.com wrote: It's a simple application based on the People example. I'm using Maven for building and below is the pom.xml. Perhaps, I need to change the version? project groupIdUthay.Test.App/groupId artifactIdtest-app/artifactId modelVersion4.0.0/modelVersion nameTestApp/name packagingjar/packaging version1.0/version repositories repository idAkka repository/id urlhttp://repo.akka.io/releases/url /repository /repositories dependencies dependency !-- Spark dependency -- groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version0.9.1/version /dependency /dependencies /project On 24 April 2014 17:47, Michael Armbrust mich...@databricks.com wrote: You shouldn't need to set SPARK_HIVE=true unless you want to use the JavaHiveContext. You should be able to access org.apache.spark.sql.api.java.JavaSQLContext with the default build. How are you building your application? Michael On Thu, Apr 24, 2014 at 9:17 AM, Andrew Or and...@databricks.comwrote: Did you build it with SPARK_HIVE=true? On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru diplomaticg...@gmail.com wrote: Hi Matei, I checked out the git repository and built it. However, I'm still getting below error. It couldn't find those SQL packages. Please advice. package org.apache.spark.sql.api.java does not exist [ERROR] /home/VirtualBoxImages.com/Documents/projects/errCount/src/main/java/errorCount/TransDriverSQL.java:[49,8] cannot find symbol [ERROR] symbol : class JavaSchemaRDD Kind regards, Raj. On 23 April 2014 22:09, Matei Zaharia matei.zaha...@gmail.com wrote: It's currently in the master branch, on https://github.com/apache/spark. You can check that out from git, build it with sbt/sbt assembly, and then try it out. We're also going to post some release candidates soon that will be pre-built. Matei On Apr 23, 2014, at 1:30 PM, diplomatic Guru diplomaticg...@gmail.com wrote: Hello Team, I'm new to SPARK and just came across SPARK SQL, which appears to be interesting but not sure how I could get it. I know it's an Alpha version but not sure if its available for community yet. Many thanks. Raj.
Re: How do I access the SPARK SQL
Many thanks for your prompt reply. I'll try your suggestions and will get back to you. On 24 April 2014 18:17, Michael Armbrust mich...@databricks.com wrote: Oh, and you'll also need to add a dependency on spark-sql_2.10. On Thu, Apr 24, 2014 at 10:13 AM, Michael Armbrust mich...@databricks.com wrote: Yeah, you'll need to run `sbt publish-local` to push the jars to your local maven repository (~/.m2) and then depend on version 1.0.0-SNAPSHOT. On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru diplomaticg...@gmail.com wrote: It's a simple application based on the People example. I'm using Maven for building and below is the pom.xml. Perhaps, I need to change the version? project groupIdUthay.Test.App/groupId artifactIdtest-app/artifactId modelVersion4.0.0/modelVersion nameTestApp/name packagingjar/packaging version1.0/version repositories repository idAkka repository/id urlhttp://repo.akka.io/releases/url /repository /repositories dependencies dependency !-- Spark dependency -- groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version0.9.1/version /dependency /dependencies /project On 24 April 2014 17:47, Michael Armbrust mich...@databricks.com wrote: You shouldn't need to set SPARK_HIVE=true unless you want to use the JavaHiveContext. You should be able to access org.apache.spark.sql.api.java.JavaSQLContext with the default build. How are you building your application? Michael On Thu, Apr 24, 2014 at 9:17 AM, Andrew Or and...@databricks.comwrote: Did you build it with SPARK_HIVE=true? On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru diplomaticg...@gmail.com wrote: Hi Matei, I checked out the git repository and built it. However, I'm still getting below error. It couldn't find those SQL packages. Please advice. package org.apache.spark.sql.api.java does not exist [ERROR] /home/VirtualBoxImages.com/Documents/projects/errCount/src/main/java/errorCount/TransDriverSQL.java:[49,8] cannot find symbol [ERROR] symbol : class JavaSchemaRDD Kind regards, Raj. On 23 April 2014 22:09, Matei Zaharia matei.zaha...@gmail.comwrote: It’s currently in the master branch, on https://github.com/apache/spark. You can check that out from git, build it with sbt/sbt assembly, and then try it out. We’re also going to post some release candidates soon that will be pre-built. Matei On Apr 23, 2014, at 1:30 PM, diplomatic Guru diplomaticg...@gmail.com wrote: Hello Team, I'm new to SPARK and just came across SPARK SQL, which appears to be interesting but not sure how I could get it. I know it's an Alpha version but not sure if its available for community yet. Many thanks. Raj.
Re: IDE for sparkR
Rstudio should be fine. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/IDE-for-sparkR-tp4764p4772.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Access Last Element of RDD
Thanks Cheng !! On Thu, Apr 24, 2014 at 5:43 PM, Cheng Lian lian.cs@gmail.com wrote: You may try this: val lastOption = sc.textFile(input).mapPartitions { iterator = if (iterator.isEmpty) { iterator } else { Iterator .continually((iterator.next(), iterator.hasNext())) .collect { case (value, false) = value } .take(1) } }.collect().lastOption Iterator based data access ensures O(1) space complexity and it runs faster because different partitions are processed in parallel. lastOptionis used instead of last to deal with empty file. On Thu, Apr 24, 2014 at 7:38 PM, Sai Prasanna ansaiprasa...@gmail.comwrote: Hi All, Finally i wrote the following code, which is felt does optimally if not the most optimum one. Using file pointers, seeking the byte after the last \n but backwards !! This is memory efficient and i hope even unix tail implementation should be something similar !! import java.io.RandomAccessFile import java.io.IOException var FILEPATH=/home/sparkcluster/hadoop-2.3.0/temp; var fileHandler = new RandomAccessFile( FILEPATH, r ); var fileLength = fileHandler.length() - 1; var cond = 1; var filePointer = fileLength-1; var toRead= -1; while(filePointer != -1 cond!=0){ fileHandler.seek( filePointer ); var readByte = fileHandler.readByte(); if( readByte == 0xA filePointer != fileLength ) cond=0; else if( readByte == 0xD filePointer != fileLength - 1 ) cond=0; filePointer=filePointer-1; toRead=toRead+1; } filePointer=filePointer+2; var bytes : Array[Byte] = new Array[Byte](toRead); fileHandler.seek(filePointer); fileHandler.read(bytes); var bdd=new String(bytes); /*bdd contains the last line*/ On Thu, Apr 24, 2014 at 11:42 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Thanks Guys ! On Thu, Apr 24, 2014 at 11:29 AM, Sourav Chandra sourav.chan...@livestream.com wrote: Also same thing can be done using rdd.top(1)(reverseOrdering) On Thu, Apr 24, 2014 at 11:28 AM, Sourav Chandra sourav.chan...@livestream.com wrote: You can use rdd.takeOrdered(1)(reverseOrdrering) reverseOrdering is you Ordering[T] instance where you define the ordering logic. This you have to pass in the method On Thu, Apr 24, 2014 at 11:21 AM, Frank Austin Nothaft fnoth...@berkeley.edu wrote: If you do this, you could simplify to: RDD.collect().last However, this has the problem of collecting all data to the driver. Is your data sorted? If so, you could reverse the sort and take the first. Alternatively, a hackey implementation might involve a mapPartitionsWithIndex that returns an empty iterator for all partitions except for the last. For the last partition, you would filter all elements except for the last element in your iterator. This should leave one element, which is your last element. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Apr 23, 2014, at 10:44 PM, Adnan Yaqoob nsyaq...@gmail.com wrote: This function will return scala List, you can use List's last function to get the last element. For example: RDD.take(RDD.count()).last On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD. I want only to access the last element. On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Oh ya, Thanks Adnan. On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.comwrote: You can use following code: RDD.take(RDD.count()) On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Hi All, Some help ! RDD.first or RDD.take(1) gives the first item, is there a straight forward way to access the last element in a similar way ? I coudnt fine a tail/last method for RDD. !! -- Sourav Chandra Senior Software Engineer · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · sourav.chan...@livestream.com o: +91 80 4121 8723 m: +91 988 699 3746 skype: sourav.chandra Livestream Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area, Bangalore 560034 www.livestream.com -- Sourav Chandra Senior Software Engineer · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · sourav.chan...@livestream.com o: +91 80 4121 8723 m: +91 988 699 3746 skype: sourav.chandra Livestream Ajmera Summit, First Floor, #3/D, 68 Ward, 3rd Cross, 7th C Main, 3rd Block, Koramangala Industrial Area, Bangalore 560034 www.livestream.com
Spark mllib throwing error
./spark-shell: line 153: 17654 Killed $FWDIR/bin/spark-class org.apache.spark.repl.Main $@ Any ideas?
Re: Deploying a python code on a spark EC2 cluster
Did you launch this using our EC2 scripts (http://spark.apache.org/docs/latest/ec2-scripts.html) or did you manually set up the daemons? My guess is that their hostnames are not being resolved properly on all nodes, so executor processes can’t connect back to your driver app. This error message indicates that: 14/04/24 09:00:49 WARN util.Utils: Your hostname, spark-node resolves to a loopback address: 127.0.0.1; using 10.74.149.251 instead (on interface eth0) 14/04/24 09:00:49 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address If you launch with your EC2 scripts, or don’t manually change the hostnames, this should not happen. Matei On Apr 24, 2014, at 11:36 AM, John King usedforprinting...@gmail.com wrote: Same problem. On Thu, Apr 24, 2014 at 10:54 AM, Shubhabrata mail2shu...@gmail.com wrote: Moreover it seems all the workers are registered and have sufficient memory (2.7GB where as I have asked for 512 MB). The UI also shows the jobs are running on the slaves. But on the termial it is still the same error Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Please see the screenshot. Thanks http://apache-spark-user-list.1001560.n3.nabble.com/file/n4761/33.png -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-a-python-code-on-a-spark-EC2-cluster-tp4758p4761.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: SparkPi performance-3 cluster standalone mode
The problem is that SparkPi uses Math.random(), which is a synchronized method, so it can’t scale to multiple cores. In fact it will be slower on multiple cores due to lock contention. Try another example and you’ll see better scaling. I think we’ll have to update SparkPi to create a new Random in each task to avoid this. Matei On Apr 24, 2014, at 4:43 AM, Adnan nsyaq...@gmail.com wrote: Hi, Relatively new on spark and have tried running SparkPi example on a standalone 12 core three machine cluster. What I'm failing to understand is, that running this example with a single slice gives better performance as compared to using 12 slices. Same was the case when I was using parallelize function. The time is scaling almost linearly with adding each slice. Please let me know if I'm doing anything wrong. The code snippet is given below: Regards, Ahsan Ijaz -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkPi-performance-3-cluster-standalone-mode-tp4530p4751.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark mllib throwing error
Could you share the command you used and more of the error message? Also, is it an MLlib specific problem? -Xiangrui On Thu, Apr 24, 2014 at 11:49 AM, John King usedforprinting...@gmail.com wrote: ./spark-shell: line 153: 17654 Killed $FWDIR/bin/spark-class org.apache.spark.repl.Main $@ Any ideas?
Re: Trying to use pyspark mllib NaiveBayes
Is your Spark cluster running? Try to start with generating simple RDDs and counting. -Xiangrui On Thu, Apr 24, 2014 at 11:38 AM, John King usedforprinting...@gmail.com wrote: I receive this error: Traceback (most recent call last): File stdin, line 1, in module File /home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py, line 178, in train ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd, lambda_) File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 535, in __call__ File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 368, in send_command File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 361, in send_command File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 317, in _get_connection File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 324, in _create_connection File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 431, in start py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server
Re: Spark mllib throwing error
Last command was: val model = new NaiveBayes().run(points) On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng men...@gmail.com wrote: Could you share the command you used and more of the error message? Also, is it an MLlib specific problem? -Xiangrui On Thu, Apr 24, 2014 at 11:49 AM, John King usedforprinting...@gmail.com wrote: ./spark-shell: line 153: 17654 Killed $FWDIR/bin/spark-class org.apache.spark.repl.Main $@ Any ideas?
Re: Trying to use pyspark mllib NaiveBayes
Yes, I got it running for large RDD (~7 million lines) and mapping. Just received this error when trying to classify. On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng men...@gmail.com wrote: Is your Spark cluster running? Try to start with generating simple RDDs and counting. -Xiangrui On Thu, Apr 24, 2014 at 11:38 AM, John King usedforprinting...@gmail.com wrote: I receive this error: Traceback (most recent call last): File stdin, line 1, in module File /home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py, line 178, in train ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd, lambda_) File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 535, in __call__ File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 368, in send_command File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 361, in send_command File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 317, in _get_connection File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 324, in _create_connection File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 431, in start py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server
Re: Deploying a python code on a spark EC2 cluster
This happens to me when using the EC2 scripts for v1.0.0rc2 recent release. The Master connects and then disconnects immediately, eventually saying Master disconnected from cluster. On Thu, Apr 24, 2014 at 4:01 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Did you launch this using our EC2 scripts ( http://spark.apache.org/docs/latest/ec2-scripts.html) or did you manually set up the daemons? My guess is that their hostnames are not being resolved properly on all nodes, so executor processes can’t connect back to your driver app. This error message indicates that: 14/04/24 09:00:49 WARN util.Utils: Your hostname, spark-node resolves to a loopback address: 127.0.0.1; using 10.74.149.251 instead (on interface eth0) 14/04/24 09:00:49 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address If you launch with your EC2 scripts, or don’t manually change the hostnames, this should not happen. Matei On Apr 24, 2014, at 11:36 AM, John King usedforprinting...@gmail.com wrote: Same problem. On Thu, Apr 24, 2014 at 10:54 AM, Shubhabrata mail2shu...@gmail.comwrote: Moreover it seems all the workers are registered and have sufficient memory (2.7GB where as I have asked for 512 MB). The UI also shows the jobs are running on the slaves. But on the termial it is still the same error Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Please see the screenshot. Thanks http://apache-spark-user-list.1001560.n3.nabble.com/file/n4761/33.png -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-a-python-code-on-a-spark-EC2-cluster-tp4758p4761.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How do I access the SPARK SQL
It worked!! Many thanks for your brilliant support. On 24 April 2014 18:20, diplomatic Guru diplomaticg...@gmail.com wrote: Many thanks for your prompt reply. I'll try your suggestions and will get back to you. On 24 April 2014 18:17, Michael Armbrust mich...@databricks.com wrote: Oh, and you'll also need to add a dependency on spark-sql_2.10. On Thu, Apr 24, 2014 at 10:13 AM, Michael Armbrust mich...@databricks.com wrote: Yeah, you'll need to run `sbt publish-local` to push the jars to your local maven repository (~/.m2) and then depend on version 1.0.0-SNAPSHOT. On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru diplomaticg...@gmail.com wrote: It's a simple application based on the People example. I'm using Maven for building and below is the pom.xml. Perhaps, I need to change the version? project groupIdUthay.Test.App/groupId artifactIdtest-app/artifactId modelVersion4.0.0/modelVersion nameTestApp/name packagingjar/packaging version1.0/version repositories repository idAkka repository/id urlhttp://repo.akka.io/releases/url /repository /repositories dependencies dependency !-- Spark dependency -- groupIdorg.apache.spark/groupId artifactIdspark-core_2.10/artifactId version0.9.1/version /dependency /dependencies /project On 24 April 2014 17:47, Michael Armbrust mich...@databricks.comwrote: You shouldn't need to set SPARK_HIVE=true unless you want to use the JavaHiveContext. You should be able to access org.apache.spark.sql.api.java.JavaSQLContext with the default build. How are you building your application? Michael On Thu, Apr 24, 2014 at 9:17 AM, Andrew Or and...@databricks.comwrote: Did you build it with SPARK_HIVE=true? On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru diplomaticg...@gmail.com wrote: Hi Matei, I checked out the git repository and built it. However, I'm still getting below error. It couldn't find those SQL packages. Please advice. package org.apache.spark.sql.api.java does not exist [ERROR] /home/VirtualBoxImages.com/Documents/projects/errCount/src/main/java/errorCount/TransDriverSQL.java:[49,8] cannot find symbol [ERROR] symbol : class JavaSchemaRDD Kind regards, Raj. On 23 April 2014 22:09, Matei Zaharia matei.zaha...@gmail.comwrote: It’s currently in the master branch, on https://github.com/apache/spark. You can check that out from git, build it with sbt/sbt assembly, and then try it out. We’re also going to post some release candidates soon that will be pre-built. Matei On Apr 23, 2014, at 1:30 PM, diplomatic Guru diplomaticg...@gmail.com wrote: Hello Team, I'm new to SPARK and just came across SPARK SQL, which appears to be interesting but not sure how I could get it. I know it's an Alpha version but not sure if its available for community yet. Many thanks. Raj.
Re: error in mllib lr example code
Thanks Xiangrui, Matei and Arpit. It does work fine after adding Vector.dense. I have a follow up question, I will post on a new thread. On Thu, Apr 24, 2014 at 2:49 AM, Arpit Tak arpi...@sigmoidanalytics.comwrote: Also try out these examples, all of them works http://docs.sigmoidanalytics.com/index.php/MLlib if you spot any problems in those, let us know. Regards, arpit On Wed, Apr 23, 2014 at 11:08 PM, Matei Zaharia matei.zaha...@gmail.comwrote: See http://people.csail.mit.edu/matei/spark-unified-docs/ for a more recent build of the docs; if you spot any problems in those, let us know. Matei On Apr 23, 2014, at 9:49 AM, Xiangrui Meng men...@gmail.com wrote: The doc is for 0.9.1. You are running a later snapshot, which added sparse vectors. Try LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(x = x.toDouble)). The examples are updated in the master branch. You can also check the examples there. -Xiangrui On Wed, Apr 23, 2014 at 9:34 AM, Mohit Jaggi mohitja...@gmail.com wrote: sorry...added a subject now On Wed, Apr 23, 2014 at 9:32 AM, Mohit Jaggi mohitja...@gmail.com wrote: I am trying to run the example linear regression code from http://spark.apache.org/docs/latest/mllib-guide.html But I am getting the following error...am I missing an import? code import org.apache.spark._ import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.regression.LabeledPoint object ModelLR { def main(args: Array[String]) { val sc = new SparkContext(args(0), SparkLR, System.getenv(SPARK_HOME), SparkContext.jarOfClass(this.getClass).toSeq) // Load and parse the data val data = sc.textFile(mllib/data/ridge-data/lpsa.data) val parsedData = data.map { line = val parts = line.split(',') LabeledPoint(parts(0).toDouble, parts(1).split(' ').map(x = x.toDouble).toArray) } ...snip... } error - polymorphic expression cannot be instantiated to expected type; found : [U : Double]Array[U] required: org.apache.spark.mllib.linalg.Vector - polymorphic expression cannot be instantiated to expected type; found : [U : Double]Array[U] required: org.apache.spark.mllib.linalg.Vector
spark mllib to jblas calls..and comparison with VW
Folks, I am wondering how mllib interacts with jblas and lapack. Does it make copies of data from my RDD format to jblas's format? Does jblas copy it again before passing to lapack native code? I also saw some comparisons with VW and it seems mllib is slower on a single node but scales better and outperforms VW on 16 nodes. Any idea why? Are improvements in the pipeline? Mohit.
Re: Trying to use pyspark mllib NaiveBayes
I tried locally with the example described in the latest guide: http://54.82.157.211:4000/mllib-naive-bayes.html , and it worked fine. Do you mind sharing the code you used? -Xiangrui On Thu, Apr 24, 2014 at 1:57 PM, John King usedforprinting...@gmail.com wrote: Yes, I got it running for large RDD (~7 million lines) and mapping. Just received this error when trying to classify. On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng men...@gmail.com wrote: Is your Spark cluster running? Try to start with generating simple RDDs and counting. -Xiangrui On Thu, Apr 24, 2014 at 11:38 AM, John King usedforprinting...@gmail.com wrote: I receive this error: Traceback (most recent call last): File stdin, line 1, in module File /home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py, line 178, in train ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd, lambda_) File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 535, in __call__ File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 368, in send_command File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 361, in send_command File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 317, in _get_connection File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 324, in _create_connection File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 431, in start py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server
Re: Spark mllib throwing error
Do you mind sharing more code and error messages? The information you provided is too little to identify the problem. -Xiangrui On Thu, Apr 24, 2014 at 1:55 PM, John King usedforprinting...@gmail.com wrote: Last command was: val model = new NaiveBayes().run(points) On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng men...@gmail.com wrote: Could you share the command you used and more of the error message? Also, is it an MLlib specific problem? -Xiangrui On Thu, Apr 24, 2014 at 11:49 AM, John King usedforprinting...@gmail.com wrote: ./spark-shell: line 153: 17654 Killed $FWDIR/bin/spark-class org.apache.spark.repl.Main $@ Any ideas?
Re: Trying to use pyspark mllib NaiveBayes
I was able to run simple examples as well. Which version of Spark? Did you use the most recent commit or from branch-1.0? Some background: I tried to build both on Amazon EC2, but the master kept disconnecting from the client and executors failed after connecting. So I tried to just use one machine with a lot of ram. I can set up a cluster on released 0.9.1, but I need the sparse vector representations as my data is very sparse. Any way I can access a version of 1.0 that doesn't have to be compiled and is proven to work on EC2? My code: *import* numpy *from* numpy *import* array, dot, shape *from* pyspark *import* SparkContext *from* math *import* exp, log from pyspark.mllib.classification import NaiveBayes from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint, LinearModel def isSpace(line): if line.isspace() or not line.strip(): return True sizeOfDict = 2357815 def parsePoint(line): values = line.split('\t') feat = values[1].split(' ') features = {} for f in feat: f = f.split(':') if len(f) 1: features[f[0]] = f[1] return LabeledPoint(float(values[0]), SparseVector(sizeOfDict, features)) data = sc.textFile(.../data.txt, 6) empty = data.filter(lambda x: not isSpace(x)) // I had an extra new line between each line points = empty.map(parsePoint) model = NaiveBayes.train(points) On Thu, Apr 24, 2014 at 6:55 PM, Xiangrui Meng men...@gmail.com wrote: I tried locally with the example described in the latest guide: http://54.82.157.211:4000/mllib-naive-bayes.html , and it worked fine. Do you mind sharing the code you used? -Xiangrui On Thu, Apr 24, 2014 at 1:57 PM, John King usedforprinting...@gmail.com wrote: Yes, I got it running for large RDD (~7 million lines) and mapping. Just received this error when trying to classify. On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng men...@gmail.com wrote: Is your Spark cluster running? Try to start with generating simple RDDs and counting. -Xiangrui On Thu, Apr 24, 2014 at 11:38 AM, John King usedforprinting...@gmail.com wrote: I receive this error: Traceback (most recent call last): File stdin, line 1, in module File /home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py, line 178, in train ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd, lambda_) File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 535, in __call__ File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 368, in send_command File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 361, in send_command File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 317, in _get_connection File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 324, in _create_connection File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 431, in start py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server
Re: Spark mllib throwing error
In the other thread I had an issue with Python. In this issue, I tried switching to Scala. The code is: *import* org.apache.spark.mllib.regression.*LabeledPoint**;* *import org.apache.spark.mllib.linalg.SparseVector;* *import org.apache.spark.mllib.classification.NaiveBayes;* import scala.collection.mutable.ArrayBuffer def isEmpty(a: String): Boolean = a != null !a.replaceAll((?m)\s+$, ).isEmpty() def parsePoint(a: String): LabeledPoint = { val values = a.split('\t') val feat = values(1).split(' ') val indices = ArrayBuffer.empty[Int] val featValues = ArrayBuffer.empty[Double] for (f - feat) { val q = f.split(':') if (q.length == 2) { indices += (q(0).toInt) featValues += (q(1).toDouble) } } val vector = new SparseVector(2357815, indices.toArray, featValues.toArray) return LabeledPoint(values(0).toDouble, vector) } val data = sc.textFile(data.txt) val empty = data.filter(isEmpty) val points = empty.map(parsePoint) points.cache() val model = new NaiveBayes().run(points) On Thu, Apr 24, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote: Do you mind sharing more code and error messages? The information you provided is too little to identify the problem. -Xiangrui On Thu, Apr 24, 2014 at 1:55 PM, John King usedforprinting...@gmail.com wrote: Last command was: val model = new NaiveBayes().run(points) On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng men...@gmail.com wrote: Could you share the command you used and more of the error message? Also, is it an MLlib specific problem? -Xiangrui On Thu, Apr 24, 2014 at 11:49 AM, John King usedforprinting...@gmail.com wrote: ./spark-shell: line 153: 17654 Killed $FWDIR/bin/spark-class org.apache.spark.repl.Main $@ Any ideas?
Re: Trying to use pyspark mllib NaiveBayes
Also when will the official 1.0 be released? On Thu, Apr 24, 2014 at 7:04 PM, John King usedforprinting...@gmail.comwrote: I was able to run simple examples as well. Which version of Spark? Did you use the most recent commit or from branch-1.0? Some background: I tried to build both on Amazon EC2, but the master kept disconnecting from the client and executors failed after connecting. So I tried to just use one machine with a lot of ram. I can set up a cluster on released 0.9.1, but I need the sparse vector representations as my data is very sparse. Any way I can access a version of 1.0 that doesn't have to be compiled and is proven to work on EC2? My code: *import* numpy *from* numpy *import* array, dot, shape *from* pyspark *import* SparkContext *from* math *import* exp, log from pyspark.mllib.classification import NaiveBayes from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint, LinearModel def isSpace(line): if line.isspace() or not line.strip(): return True sizeOfDict = 2357815 def parsePoint(line): values = line.split('\t') feat = values[1].split(' ') features = {} for f in feat: f = f.split(':') if len(f) 1: features[f[0]] = f[1] return LabeledPoint(float(values[0]), SparseVector(sizeOfDict, features)) data = sc.textFile(.../data.txt, 6) empty = data.filter(lambda x: not isSpace(x)) // I had an extra new line between each line points = empty.map(parsePoint) model = NaiveBayes.train(points) On Thu, Apr 24, 2014 at 6:55 PM, Xiangrui Meng men...@gmail.com wrote: I tried locally with the example described in the latest guide: http://54.82.157.211:4000/mllib-naive-bayes.html , and it worked fine. Do you mind sharing the code you used? -Xiangrui On Thu, Apr 24, 2014 at 1:57 PM, John King usedforprinting...@gmail.com wrote: Yes, I got it running for large RDD (~7 million lines) and mapping. Just received this error when trying to classify. On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng men...@gmail.com wrote: Is your Spark cluster running? Try to start with generating simple RDDs and counting. -Xiangrui On Thu, Apr 24, 2014 at 11:38 AM, John King usedforprinting...@gmail.com wrote: I receive this error: Traceback (most recent call last): File stdin, line 1, in module File /home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py, line 178, in train ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd, lambda_) File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 535, in __call__ File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 368, in send_command File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 361, in send_command File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 317, in _get_connection File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 324, in _create_connection File /home/ubuntu/spark-1.0.0-rc2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 431, in start py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server
Re: Spark mllib throwing error
I don't see anything wrong with your code. Could you do points.count() to see how many training examples you have? Also, make sure you don't have negative feature values. The error message you sent did not say NaiveBayes went wrong, but the Spark shell was killed. -Xiangrui On Thu, Apr 24, 2014 at 4:05 PM, John King usedforprinting...@gmail.com wrote: In the other thread I had an issue with Python. In this issue, I tried switching to Scala. The code is: import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.linalg.SparseVector; import org.apache.spark.mllib.classification.NaiveBayes; import scala.collection.mutable.ArrayBuffer def isEmpty(a: String): Boolean = a != null !a.replaceAll((?m)\s+$, ).isEmpty() def parsePoint(a: String): LabeledPoint = { val values = a.split('\t') val feat = values(1).split(' ') val indices = ArrayBuffer.empty[Int] val featValues = ArrayBuffer.empty[Double] for (f - feat) { val q = f.split(':') if (q.length == 2) { indices += (q(0).toInt) featValues += (q(1).toDouble) } } val vector = new SparseVector(2357815, indices.toArray, featValues.toArray) return LabeledPoint(values(0).toDouble, vector) } val data = sc.textFile(data.txt) val empty = data.filter(isEmpty) val points = empty.map(parsePoint) points.cache() val model = new NaiveBayes().run(points) On Thu, Apr 24, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote: Do you mind sharing more code and error messages? The information you provided is too little to identify the problem. -Xiangrui On Thu, Apr 24, 2014 at 1:55 PM, John King usedforprinting...@gmail.com wrote: Last command was: val model = new NaiveBayes().run(points) On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng men...@gmail.com wrote: Could you share the command you used and more of the error message? Also, is it an MLlib specific problem? -Xiangrui On Thu, Apr 24, 2014 at 11:49 AM, John King usedforprinting...@gmail.com wrote: ./spark-shell: line 153: 17654 Killed $FWDIR/bin/spark-class org.apache.spark.repl.Main $@ Any ideas?
Re: Spark mllib throwing error
It just displayed this error and stopped on its own. Do the lines of code mentioned in the error have anything to do with it? On Thu, Apr 24, 2014 at 7:54 PM, Xiangrui Meng men...@gmail.com wrote: I don't see anything wrong with your code. Could you do points.count() to see how many training examples you have? Also, make sure you don't have negative feature values. The error message you sent did not say NaiveBayes went wrong, but the Spark shell was killed. -Xiangrui On Thu, Apr 24, 2014 at 4:05 PM, John King usedforprinting...@gmail.com wrote: In the other thread I had an issue with Python. In this issue, I tried switching to Scala. The code is: import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.linalg.SparseVector; import org.apache.spark.mllib.classification.NaiveBayes; import scala.collection.mutable.ArrayBuffer def isEmpty(a: String): Boolean = a != null !a.replaceAll((?m)\s+$, ).isEmpty() def parsePoint(a: String): LabeledPoint = { val values = a.split('\t') val feat = values(1).split(' ') val indices = ArrayBuffer.empty[Int] val featValues = ArrayBuffer.empty[Double] for (f - feat) { val q = f.split(':') if (q.length == 2) { indices += (q(0).toInt) featValues += (q(1).toDouble) } } val vector = new SparseVector(2357815, indices.toArray, featValues.toArray) return LabeledPoint(values(0).toDouble, vector) } val data = sc.textFile(data.txt) val empty = data.filter(isEmpty) val points = empty.map(parsePoint) points.cache() val model = new NaiveBayes().run(points) On Thu, Apr 24, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote: Do you mind sharing more code and error messages? The information you provided is too little to identify the problem. -Xiangrui On Thu, Apr 24, 2014 at 1:55 PM, John King usedforprinting...@gmail.com wrote: Last command was: val model = new NaiveBayes().run(points) On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng men...@gmail.com wrote: Could you share the command you used and more of the error message? Also, is it an MLlib specific problem? -Xiangrui On Thu, Apr 24, 2014 at 11:49 AM, John King usedforprinting...@gmail.com wrote: ./spark-shell: line 153: 17654 Killed $FWDIR/bin/spark-class org.apache.spark.repl.Main $@ Any ideas?
Re: how to set spark.executor.memory and heap size
anyone knows the reason? i've googled a bit, and found some guys had the same problem, but with no replies... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4796.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: how to set spark.executor.memory and heap size
i noticed that error occurs at org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183) at org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2378) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285) at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:77) at org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) is it related to the warning below? 14/04/25 08:38:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/04/25 08:38:36 WARN snappy.LoadSnappy: Snappy native library not loaded -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4798.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: compile spark 0.9.1 in hadoop 2.2 above exception
Try running sbt/sbt clean and re-compiling. Any luck? On Thu, Apr 24, 2014 at 5:33 PM, martin.ou martin...@orchestrallinc.cnwrote: occure exception when compile spark 0.9.1 using sbt,env: hadoop 2.3 1. SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly 2.found Exception: found : org.apache.spark.streaming.dstream.DStream[(K, V)] [error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V] [error] Note: implicit method fromPairDStream is not applicable here because it comes after the application point and it lacks an explicit result type [error] dstream.filter((x = f(x).booleanValue())) exception: Best Regards 欧大鹏 Martin.Ou 交付中心 Delive == 新世基(太仓)科技服务有限公司 Orchestrall, Inc. http://www.orchestrallinc.com/ +86 18962621057 (Mobile) 0512-53866019(Office) This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.
Re: Finding bad data
Hey Jim, this is unfortunately harder than I’d like right now, but here’s how to do it. Look at the stderr file of the executor on that machine, and you’ll see lines like this: 14/04/24 19:17:24 INFO HadoopRDD: Input split: file:/Users/matei/workspace/apache-spark/README.md:0+2000 This says what file it was reading, as well as what byte offset (that’s the 0+2000 part). Unfortunately, because the executor is running multiple tasks at the same time, this message will be hard to associate with a particular task unless you only configure one core per executor. But it may help you spot the file. The other way you might do it is a map() on the data before you process it that checks for error conditions. In that one you could print out the original input line. I realize that neither of these is ideal. I’ve opened https://issues.apache.org/jira/browse/SPARK-1622 to try to expose this information somewhere else, ideally in the UI. The reason it wasn’t done so far is because some tasks in Spark can be reading from multiple Hadoop InputSplits (e.g. if you use coalesce(), or zip(), or similar), so it’s tough to do it in a super general setting. Matei On Apr 24, 2014, at 6:15 PM, Jim Blomo jim.bl...@gmail.com wrote: I'm using PySpark to load some data and getting an error while parsing it. Is it possible to find the source file and line of the bad data? I imagine that this would be extremely tricky when dealing with multiple derived RRDs, so an answer with the caveat of this only works when running .map() on an textFile() RDD is totally fine. Perhaps if the line number and file was available in pyspark I could catch the exception and output it with the context? Anyway to narrow down the problem input would be great. Thanks!
parallelize for a large Seq is extreamly slow.
spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping) this line is too slow. There are about 2 million elements in word_mapping. *Is there a good style for writing a large collection to hdfs?* import org.apache.spark._ import SparkContext._ import scala.io.Source object WFilter { def main(args: Array[String]) { val spark = new SparkContext(yarn-standalone,word filter,System.getenv(SPARK_HOME), SparkContext.jarOfClass(this.getClass)) val stopset = Source.fromURL(this.getClass.getResource(stoplist.txt)).getLines.map(_.trim).toSet val file = spark.textFile(hdfs://ns1/nlp/wiki.splited) val tf_map = spark broadcast file.flatMap(_.split(\t)).map((_,1)).countByKey val df_map = spark broadcast file.map(x=Set(x.split(\t):_*)).flatMap(_.map(_-1)).countByKey val word_mapping = spark broadcast Map(df_map.value.keys.zipWithIndex.toBuffer:_*) def w_filter(w:String) = if (tf_map.value(w) 8 || df_map.value(w) 4 || (stopset contains w)) false else true val mapped = file.map(_.split(\t).filter(w_filter).map(w=word_mapping.value(w)).mkString(\t)) spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping) mapped.saveAsTextFile(hdfs://ns1/nlp/lda/wiki.docs) spark.stop() } } many thx:) -- ~ Perfection is achieved not when there is nothing more to add but when there is nothing left to take away
Re: parallelize for a large Seq is extreamly slow.
Try setting the serializer to org.apache.spark.serializer.KryoSerializer (see http://spark.apache.org/docs/0.9.1/tuning.html), it should be considerably faster. Matei On Apr 24, 2014, at 8:01 PM, Earthson Lu earthson...@gmail.com wrote: spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping) this line is too slow. There are about 2 million elements in word_mapping. Is there a good style for writing a large collection to hdfs? import org.apache.spark._ import SparkContext._ import scala.io.Source object WFilter { def main(args: Array[String]) { val spark = new SparkContext(yarn-standalone,word filter,System.getenv(SPARK_HOME), SparkContext.jarOfClass(this.getClass)) val stopset = Source.fromURL(this.getClass.getResource(stoplist.txt)).getLines.map(_.trim).toSet val file = spark.textFile(hdfs://ns1/nlp/wiki.splited) val tf_map = spark broadcast file.flatMap(_.split(\t)).map((_,1)).countByKey val df_map = spark broadcast file.map(x=Set(x.split(\t):_*)).flatMap(_.map(_-1)).countByKey val word_mapping = spark broadcast Map(df_map.value.keys.zipWithIndex.toBuffer:_*) def w_filter(w:String) = if (tf_map.value(w) 8 || df_map.value(w) 4 || (stopset contains w)) false else true val mapped = file.map(_.split(\t).filter(w_filter).map(w=word_mapping.value(w)).mkString(\t)) spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping) mapped.saveAsTextFile(hdfs://ns1/nlp/lda/wiki.docs) spark.stop() } } many thx:) -- ~ Perfection is achieved not when there is nothing more to add but when there is nothing left to take away
Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark
Hi All, I have a problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark. The basic flow is as below: (Item1, (User1 , Score1)) RDD1 ==(Item2, (User2 , Score2)) (Item1, (User2 , Score3)) (Item2, (User1 , Score4)) RDD1.groupByKey == RDD2 (Item1, ((User1, Score1), (User2, Score3))) (Item2, ((User1, Score4), (User2, Score2))) The similarity of Vector ((User1, Score1), (User2, Score3)) and ((User1, Score4), (User2, Score2)) is the similarity of Item1 and Item2. In my situation, RDD2 contains 20 million records, my spark programm is extreamly slow, the source code is as below: val conf = new SparkConf().setMaster(spark://211.151.121.184:7077).setAppName(Score Calcu Total).set(spark.executor.memory, 20g).setJars(Seq(/home/deployer/score-calcu-assembly-1.0.jar)) val sc = new SparkContext(conf) val mongoRDD = sc.textFile(args(0).toString, 400) val jsonRDD = mongoRDD.map(arg = new JSONObject(arg)) val newRDD = jsonRDD.map(arg = { var score = haha(arg.get(a).asInstanceOf[JSONObject]) // set score to 0.5 for testing arg.put(score, 0.5) arg }) val resourceScoresRDD = newRDD.map(arg = (arg.get(rid).toString.toLong, (arg.get(zid).toString, arg.get(score).asInstanceOf[Number].doubleValue))).groupByKey().cache() val resourceScores = resourceScoresRDD.collect() val bcResourceScores = sc.broadcast(resourceScores) val simRDD = resourceScoresRDD.mapPartitions({iter = val m = bcResourceScores.value for{ (r1, v1) - iter (r2, v2) - m if r1 r2 } yield (r1, r2, cosSimilarity(v1, v2))}, true).filter(arg = arg._3 0.1) println(simRDD.count) And I saw this in Spark Web UI: http://apache-spark-user-list.1001560.n3.nabble.com/file/n4808/QQ%E6%88%AA%E5%9B%BE20140424204018.png http://apache-spark-user-list.1001560.n3.nabble.com/file/n4808/QQ%E6%88%AA%E5%9B%BE20140424204001.png My standalone cluster has 3 worker node (16 core and 32G RAM),and the workload of the machine in my cluster is heavy when the spark program is running. Is there any better way to do the algorithm? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-the-Item-Based-Collaborative-Filtering-Recommendation-Algorithms-in-spark-tp4808.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: parallelize for a large Seq is extreamly slow.
Kryo With Exception below: com.esotericsoftware.kryo.KryoException (com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1) com.esotericsoftware.kryo.io.Output.require(Output.java:138) com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446) com.esotericsoftware.kryo.io.Output.writeString(Output.java:306) com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.write(DefaultSerializers.java:153) com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.write(DefaultSerializers.java:146) com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) com.esotericsoftware.kryo.serializers.MapSerializer.write(MapSerializer.java:79) com.esotericsoftware.kryo.serializers.MapSerializer.write(MapSerializer.java:17) com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:21) com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:19) com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:124) org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:223) org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) java.lang.Thread.run(Thread.java:722) ~~~ package com.semi.nlp import org.apache.spark._ import SparkContext._ import scala.io.Source import com.esotericsoftware.kryo.Kryo import org.apache.spark.serializer.KryoRegistrator class MyRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.register(classOf[Map[String,Int]]) } } object WFilter2 { def initspark(name:String) = { val conf = new SparkConf() .setMaster(yarn-standalone) .setAppName(name) .setSparkHome(System.getenv(SPARK_HOME)) .setJars(SparkContext.jarOfClass(this.getClass)) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) .set(spark.kryo.registrator, com.semi.nlp.MyRegistrator) new SparkContext(conf) } def main(args: Array[String]) { val spark = initspark(word filter mapping) val stopset = Source.fromURL(this.getClass.getResource(/stoplist.txt)).getLines.map(_.trim).toSet val file = spark.textFile(hdfs://ns1/nlp/wiki.splited) val tf_map = spark broadcast file.flatMap(_.split(\t)).map((_,1)).countByKey val df_map = spark broadcast file.map(x=Set(x.split(\t):_*)).flatMap(_.map(_-1)).countByKey val word_mapping = spark broadcast Map(df_map.value.keys.zipWithIndex.toBuffer:_*) def w_filter(w:String) = if (tf_map.value(w) 8 || df_map.value(w) 4 || (stopset contains w)) false else true val mapped = file.map(_.split(\t).filter(w_filter).map(w=word_mapping.value(w)).mkString(\t)) spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping) mapped.saveAsTextFile(hdfs://ns1/nlp/lda/wiki.docs) spark.stop() } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4809.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark running slow for small hadoop files of 10 mb size
Thanks for the reply. It indeed increased the usage. There was another issue we found, we were broadcasting hadoop configuration by writing a wrapper class over it. But found the proper way in Spark Code sc.broadcast(new SerializableWritable(conf)) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-running-slow-for-small-hadoop-files-of-10-mb-size-tp4526p4811.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark mllib throwing error
I only see one risk: if your feature indices are not sorted, it might have undefined behavior. Other than that, I don't see any thing suspicious. -Xiangrui On Thu, Apr 24, 2014 at 4:56 PM, John King usedforprinting...@gmail.com wrote: It just displayed this error and stopped on its own. Do the lines of code mentioned in the error have anything to do with it? On Thu, Apr 24, 2014 at 7:54 PM, Xiangrui Meng men...@gmail.com wrote: I don't see anything wrong with your code. Could you do points.count() to see how many training examples you have? Also, make sure you don't have negative feature values. The error message you sent did not say NaiveBayes went wrong, but the Spark shell was killed. -Xiangrui On Thu, Apr 24, 2014 at 4:05 PM, John King usedforprinting...@gmail.com wrote: In the other thread I had an issue with Python. In this issue, I tried switching to Scala. The code is: import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.linalg.SparseVector; import org.apache.spark.mllib.classification.NaiveBayes; import scala.collection.mutable.ArrayBuffer def isEmpty(a: String): Boolean = a != null !a.replaceAll((?m)\s+$, ).isEmpty() def parsePoint(a: String): LabeledPoint = { val values = a.split('\t') val feat = values(1).split(' ') val indices = ArrayBuffer.empty[Int] val featValues = ArrayBuffer.empty[Double] for (f - feat) { val q = f.split(':') if (q.length == 2) { indices += (q(0).toInt) featValues += (q(1).toDouble) } } val vector = new SparseVector(2357815, indices.toArray, featValues.toArray) return LabeledPoint(values(0).toDouble, vector) } val data = sc.textFile(data.txt) val empty = data.filter(isEmpty) val points = empty.map(parsePoint) points.cache() val model = new NaiveBayes().run(points) On Thu, Apr 24, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote: Do you mind sharing more code and error messages? The information you provided is too little to identify the problem. -Xiangrui On Thu, Apr 24, 2014 at 1:55 PM, John King usedforprinting...@gmail.com wrote: Last command was: val model = new NaiveBayes().run(points) On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng men...@gmail.com wrote: Could you share the command you used and more of the error message? Also, is it an MLlib specific problem? -Xiangrui On Thu, Apr 24, 2014 at 11:49 AM, John King usedforprinting...@gmail.com wrote: ./spark-shell: line 153: 17654 Killed $FWDIR/bin/spark-class org.apache.spark.repl.Main $@ Any ideas?
Re: how to set spark.executor.memory and heap size
Hi I am also curious about this question. The textFile function was supposed to read a hdfs file? In this case ,It is on local filesystem that the file was taken in.There are any recognization ways to identify the local filesystem and the hdfs in the textFile function? Beside, the OOM exeception is really strange. Keeping eyes on this. 2014-04-25 13:10 GMT+08:00 Sean Owen so...@cloudera.com: On Fri, Apr 25, 2014 at 2:20 AM, wxhsdp wxh...@gmail.com wrote: 14/04/25 08:38:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/04/25 08:38:36 WARN snappy.LoadSnappy: Snappy native library not loaded Since this comes up regularly -- these warnings from Hadoop are entirely safe to ignore for development and testing.