Nested Complex Type Data Parsing and Transforming to table

2014-11-12 Thread luohui20001
Hi I got a problem when reading a textfile which contains nested complex type data and got a type unmatch problem.Any hint will be appreciated. The problem take place at map(s = s.map as type mismatch; found :

回复:Re: How SparkStreaming output messages to Kafka?

2015-03-30 Thread luohui20001
To Saisai: it works after I correct some of them with your advices like below: Furthermore, I am not quite clear about which code running on driver and which code running on executor, so i wrote my understanding in comment. would you help check? Thank you. To akhil:

转发:How SparkStreaming output messages to Kafka?

2015-03-29 Thread luohui20001
Hi guys, I am using SparkStreaming to receive message from kafka,process it and then send back to kafka. however ,kafka consumer can not receive any messages. Any one share ideas? here is my code: object SparkStreamingSampleDirectApproach { def main(args: Array[String]): Unit =

回复:回复:Re: Re: How SparkStreaming output messages to Kafka?

2015-03-30 Thread luohui20001
BTW, what's the matter about below warning? Not quite clear about KafkaRDD WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset skipping topic1 0. does this warning occurs relative with my starting consumer without --from-beginning param?

回复:Re: 回复:Re: Re: How SparkStreaming output messages to Kafka?

2015-03-30 Thread luohui20001
got it.Thank you Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:Saisai Shao sai.sai.s...@gmail.com 收件人:罗辉 luohui20...@sina.com 抄送人:user user@spark.apache.org 主题:Re: 回复:Re: Re: How SparkStreaming output messages to Kafka? 日期:2015年03月30日 17点05分

SparkStreaming batch processing time question

2015-04-01 Thread luohui20001
hi guys: I got a question when reading http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-interval. What will happen to the streaming data if the batch processing time is bigger than the batch interval? Will the next batch data be

回复:Re: SparkStreaming batch processing time question

2015-04-01 Thread luohui20001
hummm, got it. Thank you Akhil. Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:Akhil Das ak...@sigmoidanalytics.com 收件人:罗辉 luohui20...@sina.com 抄送人:user user@spark.apache.org 主题:Re: SparkStreaming batch processing time question 日期:2015年04月01日

回复:How to learn Spark ?

2015-04-02 Thread luohui20001
The best way of learning spark is to use spark you may follow the instruction of apache spark website.http://spark.apache.org/docs/latest/ download-deploy it in standalone mode-run some examples-try cluster deploy mode- then try to develop your own app and deploy it in your spark cluster.

sparksql support hive view

2015-05-04 Thread luohui20001
guys, just to confirm, sparksql support hive feature view, is that the one LateralView in hive language manual? thanks Thanksamp;Best regards! 罗辉 San.Luo

回复:回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-06 Thread luohui20001
update status after i did some tests. I modified some other parameters, found 2 parameters maybe relative.spark_worker_instance and spark.sql.shuffle.partitions before Today I used default setting of spark_worker_instance and spark.sql.shuffle.partitions whose value is 1 and 200.At that time ,

回复:RE: 回复:回复:RE: 回复:Re:_sparksql_running_slow_while_joining_2_tables.

2015-05-07 Thread luohui20001
I checked the data again, no skewed data in it, it is just txt files with sereral string and int fields. that's it. I also followed the suggestions in tuning guild page, refer to http://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning I will keep on inspecting why those left

(无主题)

2015-05-07 Thread luohui20001
Hi guys, I got a PhoenixParserException: ERROR 601 (42P00): Syntax error. Encountered FORMAT at line 21, column 141. when creating a table by using ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'. As I remember, previous version phoenix support this grammar,

回复:回复:回复:RE: 回复:回复:RE: 回复:Re:_sparksql_running_slow_while_joining_2_tables.

2015-05-07 Thread luohui20001
I tried small spark.sql.shuffle.partitions = 16,so that every task will fetch generally equal size of data,however every task runs still slow. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:luohui20...@sina.com 收件人:luohui20001 luohui20...@sina.com

回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-04 Thread luohui20001
As I know broadcastjoin is automatically enabled by spark.sql.autoBroadcastJoinThreshold.refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options and how to check my app's physical plan,and others things like optimized plan,executable plan.etc thanks

回复:spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-04 Thread luohui20001
you may need to copy hive-site.xml to your spark conf directory and check your hive metastore warehouse setting, and also check if you are authenticated to access hive metastore warehouse. Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:鹰

回复:Re: how to delete data from table in sparksql

2015-05-15 Thread luohui20001
got it,thank you. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Michael Armbrust mich...@databricks.com 收件人:Denny Lee denny.g@gmail.com 抄送人:罗辉 luohui20...@sina.com, user user@spark.apache.org 主题:Re: how to delete data from table in sparksql

how to delete data from table in sparksql

2015-05-14 Thread luohui20001
Hi guys i got to delete some data from a table by delete from table where name = xxx, however delete is not functioning like the DML operation in hive. I got a info like below:Usage: delete [FILE|JAR|ARCHIVE] value [value]* 15/05/14 18:18:24 ERROR processors.DeleteResourceProcessor:

回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-13 Thread luohui20001
Hi Hao: I tried broadcastjoin with following steps, and found that my query is still running slow ,not very sure if I'm doing right with broadcastjoin:1.add spark.sql.autoBroadcastJoinThreshold 104857600(100MB) in conf/spark-default.conf. 100MB is larger than any of my 2 tables.2.start

回复:Re: Re: How to decrease the time of storing block in memory

2015-06-09 Thread luohui20001
hi akhil Not exactly ,the task took 54s to finish, started from 16:14:02 and ended at 16:14:56. within this 54s , it needs 19s to store value in memory, which started from 16:14:23 and ended at 16:14:42. I think this is the most time-wasting part of this task ,also unreasonable.You may check

回复:Re: How to decrease the time of storing block in memory

2015-06-09 Thread luohui20001
Only 1 minor GC, 0.07s. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Akhil Das ak...@sigmoidanalytics.com 收件人:罗辉 luohui20...@sina.com 抄送人:user user@spark.apache.org 主题:Re: How to decrease the time of storing block in memory 日期:2015年06月09日 15点02分

How to decrease the time of storing block in memory

2015-06-08 Thread luohui20001
hi there I am trying to descrease my app's running time in worker node. I checked the log and found the most time-wasting part is below:15/06/08 16:14:23 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.1 KB, free 353.3 MB) 15/06/08 16:14:42 INFO

回复:Re: Re: Re: How to decrease the time of storing block in memory

2015-06-10 Thread luohui20001
thanks Ak, thanks for your idea. I had tried using spark to do what the shell did. However it is not fast enough as I expected and not very easy. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Akhil Das ak...@sigmoidanalytics.com 收件人:罗辉

Spark Configuration of spark.worker.cleanup.appDataTtl

2015-06-16 Thread luohui20001
Hi guys: I added a parameter spark.worker.cleanup.appDataTtl 3 * 24 * 3600 in my conf/spark-default.conf, then I start my spark cluster. However I got an exception: 15/06/16 14:25:14 INFO util.Utils: Successfully started service 'sparkWorker' on port 43344. 15/06/16 14:25:14 ERROR

回复:Re: Spark Configuration of spark.worker.cleanup.appDataTtl

2015-06-16 Thread luohui20001
thanks saisai,I should try more times. I thought it will be caculated automatically as the default. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Saisai Shao sai.sai.s...@gmail.com 收件人:罗辉 luohui20...@sina.com 抄送人:user user@spark.apache.org

How many APPs does spark support to running simultaneously in one cluster?

2015-06-11 Thread luohui20001
Hi guys:I'm trying to running 24 APPs simultaneously in one spark cluster. However, everytime my cluster can only running 17 APPs in the same time. Other APPs disappeared, no logs, no failures. Any ideas will be appreciated. Here is my code:object GeneCompare5{ def

回复:Re: Re: Re: Re: how to distributed run a bash shell in spark

2015-05-27 Thread luohui20001
Hi Akhil and all My previous code has some problems,all the executors are looping and running the same command. That's not what I am expecting.previous code: val shellcompare = List(run,sort.sh) val shellcompareRDD = sc.makeRDD(shellcompare) val result = List(aggregate,result)

回复:Re: got java.lang.reflect.UndeclaredThrowableException when running multiply APPs in spark

2015-06-30 Thread luohui20001
Thanks Ak, This problem has been solved, I use nmon to monitor the system I/O and CPU pressure and found there is a very sharp peak.And after that peak many process stops running, so I correct my code and this issue gone. previous code looks like this:

All master are unreponsive issue

2015-07-02 Thread luohui20001
Hi there: I got an problem that Application has been killed.Reason:All masters are unresponsive!Giving up. I check the network I/O and found sometimes it is really high when running my app. Pls refer to the attached pic for more info.I also checked

got java.lang.reflect.UndeclaredThrowableException when running multiply APPs in spark

2015-06-29 Thread luohui20001
Hi there I am running 30 APPs in my spark cluster, and some of the APPs got exception like below:[root@slave3 0]# cat stderr 15/06/29 17:20:08 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 15/06/29 17:20:09 WARN util.NativeCodeLoader: Unable

How to shut down spark web UI?

2015-07-06 Thread luohui20001
Hello there, I heard that there is some way to shutdown Spark WEB UI, is there a configuration to support this? Thank you. Thanksamp;Best regards! San.Luo

回复:All master are unreponsive issue

2015-07-02 Thread luohui20001
Hi there, i check the source code and found that in org.apache.spark.deploy.client.AppClient, there is a parameter tells(line 52): val REGISTRATION_TIMEOUT = 20.seconds val REGISTRATION_RETRIES = 3As I know If I wanna increase the retry times, must I modify this value,rebuild the

回复:Re: How to shut down spark web UI?

2015-07-06 Thread luohui20001
got it ,thanks. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Shixiong Zhu zsxw...@gmail.com 收件人:罗辉 luohui20...@sina.com 抄送人:user user@spark.apache.org 主题:Re: How to shut down spark web UI? 日期:2015年07月06日 17点31分 You can set spark.ui.enabled to false

How many APPs does spark support to run simultaneously in one cluster?

2015-06-11 Thread luohui20001
Hi there, If I set spark.scheduler.mode to FAIR, How many APPs does spark support to run simultaneously in one cluster? Is there a upper limit? Thanks. Thanksamp;Best regards! San.Luo

how to distributed run a bash shell in spark

2015-05-24 Thread luohui20001
hello there I am trying to run a app in which part of it needs to run a shell.how to run a shell distributed in spark cluster.thanks. here's my code:import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.spark.SparkConf; import

回复:Re: how to distributed run a bash shell in spark

2015-05-24 Thread luohui20001
Thanks Akhil, your code is a big help to me,'cause perl script is the exactly thing i wanna try to run in spark. I will have a try. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Akhil Das ak...@sigmoidanalytics.com 收件人:罗辉

回复:Re: Re: how to distributed run a bash shell in spark

2015-05-25 Thread luohui20001
thanks, madhu and Akhil I modified my code like below,however I think it is not so distributed. Have you guys better idea to run this app more efficiantly and distributed? So I add some comments with my understanding: import org.apache.spark._ import www.celloud.com.model._ object GeneCompare3 {

回复:Re: Re: Re: how to distributed run a bash shell in spark

2015-05-25 Thread luohui20001
I am right trying to run some shell script in my spark app, hoping it runs more concurrently in my spark cluster.However I am not sure whether my codes will run concurrently in my executors.Dive into my code, you can see that I am trying to 1.splite both db and sample into 21 small files. That

回复:Re: Re: Re: Re: how to distributed run a bash shell in spark

2015-05-26 Thread luohui20001
Thanks Akhil, I checked the job UI again ,my app is running concurrently in all the executors. But some of the tasks got I/O exception. I will continue inspecting on this. java.io.IOException: Failed to create local dir in

Hibench build fail

2015-07-07 Thread luohui20001
Hi grace, recently I am trying Hibench to evaluate my spark cluster, however I got a problem in building Hibench, would you help to take a look? thanks. It fails at building Sparkbench, and you may check the attched pic for more info. My spark version :1.3.1,hadoop version :2.7.0

回复:Re: HiBench test for hadoop/hive/spark cluster

2015-07-15 Thread luohui20001
Hi Ted Thanks for your advice, i found that there is something wrong with hadoop fs -get command, 'cause I believe the localization of hdfs://spark-study:9000/HiBench/Aggregation/temp/user_agents to /tmp/hadoop-root/mapred/local/1437016615898/user_agents is a behaviour like hadoop fs -get

akka.event.Logging$LoggerInitializationException

2015-10-09 Thread luohui20001
Hi there: when my colleague runs multiple spark Apps simultaneously,some of them failed with akka.event.Logging$LoggerInitializationException. Caused by: akka.event.Logging$LoggerInitializationException: Logger log1-Slf4jLogger did not respond with LoggerInitialized, sent instead

回复:RE: Hibench build fail

2015-07-08 Thread luohui20001
Hi Ted and Grace, Retried with Spark 1.4.0,still failed with same phenomenon.here is a log.FYI. What else details may help?BTW, is it a necessary step to run Hibench test for my spark cluster? I also tried to skip building Hibench to execute bin/run-all.sh, but also got

回复:回复:RE: Hibench build fail

2015-07-08 Thread luohui20001
should I add dependencies for spark-core_2.10,spark-yarn_2.10,spark-streaming_2.10, org.apache.spark:spark-mllib_2.10,:spark-hive_2.10,:spark-graphx_2.10 in pom.xml?if yes, there are 7 pom.xml in HiBench listing below, which one to modify? [root@spark-study HiBench-master]# find ./ -name

a document for JDK version testing status

2015-09-17 Thread luohui20001
Hi there, I remembered there was a document showing most versions of JDK using and testing status in global companies' spark clusters.However I couldn't find it in spark website and databricks.Is there anyone who still remember that document and don't mind to provide a link? Thanks.

MySQLSyntaxErrorException when connect hive to sparksql

2015-11-26 Thread luohui20001
hi guys, when I am trying to connect hive with spark-sql,I got a problem like below: [root@master spark]# bin/spark-shell --master local[4]log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).log4j:WARN Please initialize the log4j system

How to export a project to a JAR in Scala IDE for eclipse Correctly?

2016-07-27 Thread luohui20001
Hi there: I export a project into jar like this "right click my project->choose export ->java-> jar file-> next->choose "src/main/resouces" and ''src/main/scala"'-> clikc browse and choose a jar file export location-> choose overwrite it", and this jar is unable to run with "java -jar

回复:Re: question about UDAF

2016-07-12 Thread luohui20001
hi pedro thanks for your advices. I got my code working as below:code in main:val hc = new org.apache.spark.sql.hive.HiveContext(sc) val hiveTable = hc.sql("select lp_location_id,id,pv from house_id_pv_location_top50")val jsonArray = new JsonArray val middleResult =

回复:How to get recommand result for users in a kafka SparkStreaming Application

2016-08-03 Thread luohui20001
PS: I am using Spark1.6.1, kafka 0.10.0.0 ThanksBest regards! San.Luo - 原始邮件 - 发件人: 收件人:"user" 主题:How to get recommand result for users in a kafka SparkStreaming Application 日期:2016年08月03日 15点01分 hello

How to get recommand result for users in a kafka SparkStreaming Application

2016-08-03 Thread luohui20001
hello guys: I have an app which consumes json messages from kafka and recommend movies for the users in those messages ,the code like this : conf.setAppName("KafkaStreaming") val storageLevel = StorageLevel.DISK_ONLY val ssc = new StreamingContext(conf,

回复:Re:run spark apps in linux crontab

2016-07-20 Thread luohui20001
thank you focus, and all.this problem solved by adding a line ". /etc/profile" in my shell. ThanksBest regards! San.Luo - 原始邮件 - 发件人:"focus" <focushe...@qq.com> 收件人:"luohui20001" <luohui20...@si

回复:Re: run spark apps in linux crontab

2016-07-21 Thread luohui20001
ocus, and all.this problem solved by adding a line ". /etc/profile" in my shell. ThanksBest regards! San.Luo - 原始邮件 - 发件人:"focus" <focushe...@qq.com> 收件人:"luohui20001" <luohui20...@sina.com>, "user@spark.apa

run spark apps in linux crontab

2016-07-20 Thread luohui20001
hi guys: I add a spark-submit job into my Linux crontab list by the means below ,however none of them works. If I change it to a normal shell script, it is ok. I don't quite understand why. I checked the 8080 web ui of my spark cluster, no job submitted, and there is not messages in

how to select first 50 value of each group after group by?

2016-07-06 Thread luohui20001
hi thereI have a DF with 3 columns: id , pv, location.(the rows are already grouped by location and sort by pv in des) I wanna get the first 50 id values grouped by location. I checked the API of dataframe,groupeddata,pairRDD, and found no match. is there a way to do this naturally?

回复:Re: how to select first 50 value of each group after group by?

2016-07-07 Thread luohui20001
hi Anton: I check the docs you mentioned, and have code accordingly, however met an exception like "org.apache.spark.sql.AnalysisException: Window function row_number does not take a frame specification.;" It Seems that the row_number API is giving a global row numbers of every row

question about UDAF

2016-07-11 Thread luohui20001
hello guys: I have a DF and a UDAF. this DF has 2 columns, lp_location_id , id, both are of Int type. I want to group by id and aggregate all value of id into 1 string. So I used a UDAF to do this transformation: multi Int values to 1 String. However my UDAF returns empty values as the

回复:Re: Re: how to select first 50 value of each group after group by?

2016-07-08 Thread luohui20001
Thank you Anton I got my problem solved as below codeval hivetable = hc.sql("select * from house_sale_pv_location") val overLocation = Window.partitionBy(hivetable.col("lp_location_id")) val sortedDF = hivetable.withColumn("rowNumber",

saving DF to HDFS in parquet format very slow in SparkSQL app

2016-08-09 Thread luohui20001
hi there:I got a problem in saving a DF to HDFS as parquet format very slow. And I attached a pic which shows a lot of time is spent in getting result.the code is :streamingData.write.mode(SaveMode.Overwrite).parquet("/data/streamingData") I don't quite understand why my app is so slow in

回复:saving DF to HDFS in parquet format very slow in SparkSQL app

2016-08-09 Thread luohui20001
maybe this problem is not so easy to understand, so I attached my full code. Hope this could help in solving the problem. ThanksBest regards! San.Luo - 原始邮件 - 发件人: 收件人:"user" 主题:saving DF to HDFS in

how to add a column according to an existing column of a dataframe?

2016-06-30 Thread luohui20001
hi guys, I have a dataframe with 3 columns, id(int) ,type(string) ,price(string) , and I want to add a column "price range", according to the value of price. I checked the SPARK-15383, however in my code I just want to append a column, which is transforming from the original dataframe

[SparkSQL+SparkStreaming]SparkStreaming APP can not load data into SparkSQL table

2016-09-05 Thread luohui20001
hi guys: I got a question that my SparkStreaming APP can not loading data into SparkSQL table in. Here is my code: val conf = new SparkConf().setAppName("KafkaStreaming for " + topics).setMaster("spark://master60:7077") val storageLevel = StorageLevel.DISK_ONLY val ssc = new

回复:[SparkSQL+SparkStreaming]SparkStreaming APP can not load data into SparkSQL table

2016-09-05 Thread luohui20001
the data can be written as parquet into HDFS. But the loading data process is not working as expected. ThanksBest regards! San.Luo - 原始邮件 - 发件人: 收件人:"user" 主题:[SparkSQL+SparkStreaming]SparkStreaming APP

回复:Re: Selecting the top 100 records per group by?

2016-09-12 Thread luohui20001
hi kevinwindow function is what you need, like below:val hivetable = hc.sql("select * from house_sale_pv_location") val overLocation = Window.partitionBy(hivetable.col("lp_location_id")) val sortedDF = hivetable.withColumn("rowNumber",

Spark MLlib question: load model failed with exception:org.json4s.package$MappingException: Did not find value which can be converted into java.lang.String

2016-08-17 Thread luohui20001
Hello guys: I have a problem in loading recommend model. I have 2 models, one is good(able to get recommend result) and another is not working. I checked these 2 models, both are MatrixFactorizationModel object. But in the metadata, one is a PipelineModel and another is a

SparkStreaming + Flume: org.jboss.netty.channel.ChannelException: Failed to bind to: master60/10.0.10.60:31001

2016-08-25 Thread luohui20001
Hi there I have a flume cluster sending messages to SparkStreaming. I got an exception like below:16/08/25 23:00:54 ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - org.jboss.netty.channel.ChannelException: Failed to bind to: master60/10.0.10.60:31001

transactional data in sparksql

2017-07-31 Thread luohui20001
hello guys: I have some transactional data as attached file 1.txt. A sequence of a single operation 1 followed by a few operations 0 is a transation here. The transcations, which sum(Amount) of operation 0 is less than the sum(Amount) of operation 1, need to be found out. There are

A tool to generate simulation data

2017-07-27 Thread luohui20001
hello guys Is there a tool or an open source project that can mock lange amount of data quickly, and support below :1. transaction data2. time series data3. specified format data like CSV files or json files.4. data generated at a changing speed.5. distributed data generation

Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread luohui20001
hello guys: I have a simple rdd like :val userIDs = 1 to 1val rdd1 = sc.parallelize(userIDs , 16) //this rdd has 1 user id And I have a List[String] like below:scala> listForRule77 res76: List[String] = List(1,1,100.00|1483286400, 1,1,100.00|1483372800,

回复:Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread luohui20001
Riccardo and Ryan Thank you for your ideas.It seems that crossjoin is a new dataset api after spark2.x. my spark version is 1.6.3. Is there a relative api to do crossjoin?thank you. ThanksBest regards! San.Luo - 原始邮件 - 发件人:Riccardo

回复:Re: Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread luohui20001
Thank you guys, I got my code worked like below:val record75df = sc.parallelize(listForRule75, numPartitions).map(x=> x.replace("|", ",")).map(_.split(",")).map(x => Mycaseclass4(x(0).toInt,x(1).toInt,x(2).toFloat,x(3).toInt)).toDF() val userids = 1 to 1 val uiddf =

回复:Re: A tool to generate simulation data

2017-07-27 Thread luohui20001
thank you Suzen, i've had a try to generate 1 billion records within 1.5min. It is fast.And I will go on to try some other cases. ThanksBest regards! San.Luo - 原始邮件 - 发件人:"Suzen, Mehmet" 收件人:luohui20...@sina.com 抄送人:user