Complex transformation on a dataframe column

2015-10-15 Thread Hao Wang
(nullable = true) The function for format addresses: def formatAddress(address: String): String Best regards, Hao Wang - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: How to convert dataframe to a nested StructType schema

2015-09-17 Thread Hao Wang
Schema) > > Thanks! > > - Terry > > On Wed, Sep 16, 2015 at 2:10 AM, Hao Wang wrote: > >> Hi, >> >> I created a dataframe with 4 string columns (city, state, country, >> zipcode). >> I then applied the following nested schema to it by creating a cust

How to convert dataframe to a nested StructType schema

2015-09-15 Thread Hao Wang
Hi, I created a dataframe with 4 string columns (city, state, country, zipcode). I then applied the following nested schema to it by creating a custom StructType. When I run df.take(5), it gives the exception below as expected. The question is how I can convert the Rows in the dataframe to confor

How to convert dataframe to a nested StructType schema

2015-09-14 Thread Hao Wang
Hi, I created a dataframe with 4 string columns (city, state, country, zipcode). I then applied the following nested schema to it by creating a custom StructType. When I run df.take(5), it gives the exception below as expected. The question is how I can convert the Rows in the dataframe to confor

Re: How to split log data into different files according to severity

2015-06-14 Thread Hao Wang
ml > <https://www.mail-archive.com/user@spark.apache.org/msg30204.html> > > On June 13, 2015, at 5:41 AM, Hao Wang wrote: > > > Hi, > > I have a bunch of large log files on Hadoop. Each line contains a log and its > severity. Is there a way that I can u

Re: How to split log data into different files according to severity

2015-06-13 Thread Hao Wang
I am currently using filter inside a loop of all severity levels to do this, which I think is pretty inefficient. It has to read the entire data set once for each severity. I wonder if there is a more efficient way that takes just one pass of the data? Thanks. Best, Hao Wang > On Jun 13, 2

How to split log data into different files according to severity

2015-06-13 Thread Hao Wang
] log2 [ERROR] log3 [INFO] log4 Output: error_file [ERROR] log1 [ERROR] log3 info_file [INFO] log2 [INFO] log4 Best, Hao Wang

Spark-shell return results when the job is executing?

2014-09-01 Thread Hao Wang
Hi, all I am wondering if I use Spark-shell to scan a large file to obtain lines containing "error", whether the shell returns results while the job is executing, or the job has been totally finished. Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University

Re: Kyro deserialisation error

2014-07-17 Thread Hao Wang
Hi, all Yes, it's a name of Wikipedia article. I am running WikipediaPageRank example of Spark Bagels. I am wondering whether there is any relation to buffer size of Kyro. The page rank can be successfully finished, sometimes not because this kind of Kyro exception happens too many times, which b

Re: Kyro deserialisation error

2014-07-16 Thread Hao Wang
Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com On Thu, Jul 17, 2014 at 2:58 AM, Tathagata Das wrote: > Is the class that is not found in the wikipediapagerank jar? > > TD > > > On Wed, Jul 16, 2014 at 12:32 AM, Hao Wang wrote: > >> Thanks for your

Re: Kyro deserialisation error

2014-07-16 Thread Hao Wang
Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com On Wed, Jul 16, 2014 at 1:41 PM, Tathagata Das wrote: > Are you using classes from external libraries tha

Re: Kyro deserialisation error

2014-07-15 Thread Hao Wang
I am running the WikipediaPageRank in Spark example and share the same problem with you: 4/07/16 11:31:06 DEBUG DAGScheduler: submitStage(Stage 6) 14/07/16 11:31:06 ERROR TaskSetManager: Task 6.0:450 failed 4 times; aborting job 14/07/16 11:31:06 INFO DAGScheduler: Failed to run foreach at Bagel.s

Akka listens to hostname while user may spark-submit with master in IP url

2014-06-15 Thread Hao Wang
Hi, All In Spark the "spark.driver.host" is driver hostname in default, thus, akka actor system will listen to a URL like akka.tcp://hostname:port. However, when a user tries to use spark-submit to run application, the user may set "--master spark://192.168.1.12:7077". Then, the *AppClient* in *S

Re: long GC pause during file.cache()

2014-06-15 Thread Hao Wang
Hi, Wei You may try to set JVM opts in *spark-env.sh* as follow to prevent or mitigate GC pause: export SPARK_JAVA_OPTS="-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC -Xmx2g -XX:MaxPermSize=256m" There are more options you could add, please just Google :) Regards, Wang Hao(王灏) CloudTeam | S

Re: how to set spark.executor.memory and heap size

2014-06-13 Thread Hao Wang
Hi, Laurent You could set Spark.executor.memory and heap size by following methods: 1. in you conf/spark-env.sh: *export SPARK_WORKER_MEMORY=38g* *export SPARK_JAVA_OPTS="-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC -Xmx2g -XX:MaxPermSize=256m"* 2. you could also add modification for

BUG? Why does MASTER have to be set to spark://hostname:port?

2014-06-13 Thread Hao Wang
Hi, all When I try to run Spark PageRank using: ./bin/spark-submit \ --master spark://192.168.1.12:7077 \ --class org.apache.spark.examples.bagel.WikipediaPageRank \ ~/Documents/Scala/WikiPageRank/target/scala-2.10/wikipagerank_2.10-1.0.jar \ hdfs://192.168.1.12:9000/freebase-13G 0.05 100 Tru

Re: Spark 1.0.0 Standalone AppClient cannot connect Master

2014-06-12 Thread Hao Wang
; This is not removed. We moved it here: > http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html > If you're building with SBT, and you don't specify the > SPARK_HADOOP_VERSION, then it defaults to 1.0.4. > > Andrew > > > 2014-06-12 6:24 GMT-07:0

Spark 1.0.0 Standalone AppClient cannot connect Master

2014-06-12 Thread Hao Wang
Hi, all Why does the Spark 1.0.0 official doc remove how to build Spark with corresponding Hadoop version? It means that if I don't need to specify the Hadoop version with I build my Spark 1.0.0 with `sbt/sbt assembly`? Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai

Re: com.google.protobuf out of memory

2014-05-25 Thread Hao Wang
Hi, Zuhair According to my experience, you could try following steps to avoid Spark OOM: 1. Increase JVM memory by adding export SPARK_JAVA_OPTS="-Xmx2g" 2. Use .persist(storage.StorageLevel.MEMORY_AND_DISK) instead of .cache() 3. Have you set spark.executor.memory value? It's 512m by de

OutofMemory: Failed on spark/examples/bagel/WikipediaPageRank.scala

2014-05-19 Thread Hao Wang
Hi, all I am running a 30GB Wikipedia dataset on a 7-server cluster. Using WikipediaPageRank underexample/Bagel. My Spark version is bae07e3 [behind 1] fix different versions of commons-lang dependency and apache/spark#746 addendum The problem is that the job will fail after several stages becau

Re: java.lang.NoClassDefFoundError: org/apache/spark/deploy/worker/Worker

2014-05-19 Thread Hao Wang
...@gmail.com On Sun, May 18, 2014 at 1:52 PM, Hao Wang wrote: > Hi, all > > *Spark version: bae07e3 [behind 1] fix different versions of commons-lang > dependency and apache/spark#746 addendum* > > I have six worker nodes and four of them have this NoClassDefFoundError when > I use

How to compile the examples directory?

2014-05-19 Thread Hao Wang
Hi, I am running some examples of Spark on a cluster. Because I need to modify some source code, I have to re-compile the whole Spark using `sbt/sbt assembly`, which takes a long time. I have tried `mvn package` under the example directory, it failed because of some dependencies problem. Any way

For performance, Spark prefers OracleJDK or OpenJDK?

2014-05-18 Thread Hao Wang
Hi, Oracle JDK and OpenJDK, which one is better or preferred for Spark? Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com

java.lang.NoClassDefFoundError: org/apache/spark/deploy/worker/Worker

2014-05-17 Thread Hao Wang
Hi, all *Spark version: bae07e3 [behind 1] fix different versions of commons-lang dependency and apache/spark#746 addendum* I have six worker nodes and four of them have this NoClassDefFoundError when I use thestart-slaves.sh on my driver node. However, running ./bin/spark-class org.apache.spark.