Hi
I got a problem when reading a textfile which contains nested complex
type data and got a type unmatch problem.Any hint will be appreciated.
The problem take place at map(s = s.map as type mismatch; found
:
To Saisai:
it works after I correct some of them with your advices like below:
Furthermore, I am not quite clear about which code running on driver
and which code running on executor, so i wrote my understanding in comment.
would you help check? Thank you.
To akhil:
Hi guys,
I am using SparkStreaming to receive message from kafka,process it
and then send back to kafka. however ,kafka consumer can not receive any
messages. Any one share ideas?
here is my code:
object SparkStreamingSampleDirectApproach {
def main(args: Array[String]): Unit =
BTW, what's the matter about below warning? Not quite clear about KafkaRDD
WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset
skipping topic1 0.
does this warning occurs relative with my starting consumer without
--from-beginning param?
got it.Thank you
Thanksamp;Best regards!
罗辉 San.Luo
- 原始邮件 -
发件人:Saisai Shao sai.sai.s...@gmail.com
收件人:罗辉 luohui20...@sina.com
抄送人:user user@spark.apache.org
主题:Re: 回复:Re: Re: How SparkStreaming output messages to Kafka?
日期:2015年03月30日 17点05分
hi guys:
I got a question when reading
http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-interval.
What will happen to the streaming data if the batch processing time is
bigger than the batch interval? Will the next batch data be
hummm, got it. Thank you Akhil.
Thanksamp;Best regards!
罗辉 San.Luo
- 原始邮件 -
发件人:Akhil Das ak...@sigmoidanalytics.com
收件人:罗辉 luohui20...@sina.com
抄送人:user user@spark.apache.org
主题:Re: SparkStreaming batch processing time question
日期:2015年04月01日
The best way of learning spark is to use spark
you may follow the instruction of apache spark
website.http://spark.apache.org/docs/latest/
download-deploy it in standalone mode-run some examples-try cluster deploy
mode- then try to develop your own app and deploy it in your spark cluster.
guys, just to confirm, sparksql support hive feature view, is that the one
LateralView in hive language manual?
thanks
Thanksamp;Best regards!
罗辉 San.Luo
update status after i did some tests. I modified some other parameters, found 2
parameters maybe relative.spark_worker_instance and spark.sql.shuffle.partitions
before Today I used default setting of spark_worker_instance and
spark.sql.shuffle.partitions whose value is 1 and 200.At that time ,
I checked the data again, no skewed data in it, it is just txt files with
sereral string and int fields. that's it. I also followed the suggestions in
tuning guild page, refer to
http://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning
I will keep on inspecting why those left
Hi guys, I got a PhoenixParserException: ERROR 601
(42P00): Syntax error. Encountered FORMAT at line 21, column 141.
when creating a table by using ROW FORMAT DELIMITED FIELDS TERMINATED
BY '\t' LINES TERMINATED BY '\n'. As I remember, previous
version phoenix support this grammar,
I tried small spark.sql.shuffle.partitions = 16,so that every task will fetch
generally equal size of data,however every task runs still slow.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:luohui20...@sina.com
收件人:luohui20001 luohui20...@sina.com
As I know broadcastjoin is automatically enabled by
spark.sql.autoBroadcastJoinThreshold.refer to
http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
and how to check my app's physical plan,and others things like optimized
plan,executable plan.etc
thanks
you may need to copy hive-site.xml to your spark conf directory and check your
hive metastore warehouse setting, and also check if you are authenticated to
access hive metastore warehouse.
Thanksamp;Best regards!
罗辉 San.Luo
- 原始邮件 -
发件人:鹰
got it,thank you.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:Michael Armbrust mich...@databricks.com
收件人:Denny Lee denny.g@gmail.com
抄送人:罗辉 luohui20...@sina.com, user user@spark.apache.org
主题:Re: how to delete data from table in sparksql
Hi guys i got to delete some data from a table by delete from table
where name = xxx, however delete is not functioning like the DML operation
in hive. I got a info like below:Usage: delete [FILE|JAR|ARCHIVE] value
[value]*
15/05/14 18:18:24 ERROR processors.DeleteResourceProcessor:
Hi Hao: I tried broadcastjoin with following steps, and found that my
query is still running slow ,not very sure if I'm doing right with
broadcastjoin:1.add spark.sql.autoBroadcastJoinThreshold 104857600(100MB)
in conf/spark-default.conf. 100MB is larger than any of my 2 tables.2.start
hi akhil
Not exactly ,the task took 54s to finish, started from 16:14:02 and ended at
16:14:56.
within this 54s , it needs 19s to store value in memory, which started from
16:14:23 and ended at 16:14:42. I think this is the most time-wasting part of
this task ,also unreasonable.You may check
Only 1 minor GC, 0.07s.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:Akhil Das ak...@sigmoidanalytics.com
收件人:罗辉 luohui20...@sina.com
抄送人:user user@spark.apache.org
主题:Re: How to decrease the time of storing block in memory
日期:2015年06月09日 15点02分
hi there I am trying to descrease my app's running time in worker node. I
checked the log and found the most time-wasting part is below:15/06/08 16:14:23
INFO storage.MemoryStore: Block broadcast_0 stored as values in memory
(estimated size 2.1 KB, free 353.3 MB)
15/06/08 16:14:42 INFO
thanks Ak, thanks for your idea. I had tried using spark to do what the
shell did. However it is not fast enough as I expected and not very easy.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:Akhil Das ak...@sigmoidanalytics.com
收件人:罗辉
Hi guys: I added a parameter spark.worker.cleanup.appDataTtl 3 * 24 *
3600 in my conf/spark-default.conf, then I start my spark cluster. However I
got an exception:
15/06/16 14:25:14 INFO util.Utils: Successfully started service 'sparkWorker'
on port 43344.
15/06/16 14:25:14 ERROR
thanks saisai,I should try more times. I thought it will be caculated
automatically as the default.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:Saisai Shao sai.sai.s...@gmail.com
收件人:罗辉 luohui20...@sina.com
抄送人:user user@spark.apache.org
Hi guys:I'm trying to running 24 APPs simultaneously in one spark
cluster.
However, everytime my cluster can only running 17 APPs in the same
time. Other APPs disappeared, no logs, no failures. Any ideas will be
appreciated.
Here is my code:object GeneCompare5{
def
Hi Akhil and all
My previous code has some problems,all the executors are looping and
running the same command. That's not what I am expecting.previous code:
val shellcompare = List(run,sort.sh)
val shellcompareRDD = sc.makeRDD(shellcompare)
val result = List(aggregate,result)
Thanks Ak,
This problem has been solved, I use nmon to monitor the system I/O
and CPU pressure and found there is a very sharp peak.And after that peak many
process stops running, so I correct my code and this issue gone.
previous code looks like this:
Hi there: I got an problem that Application has been killed.Reason:All
masters are unresponsive!Giving up. I check the network I/O and found
sometimes it is really high when running my app. Pls refer to the attached pic
for more info.I also checked
Hi there
I am running 30 APPs in my spark cluster, and some of the APPs got
exception like below:[root@slave3 0]# cat stderr
15/06/29 17:20:08 INFO executor.CoarseGrainedExecutorBackend: Registered signal
handlers for [TERM, HUP, INT]
15/06/29 17:20:09 WARN util.NativeCodeLoader: Unable
Hello there,
I heard that there is some way to shutdown Spark WEB UI, is there a
configuration to support this?
Thank you.
Thanksamp;Best regards!
San.Luo
Hi there, i check the source code and found that in
org.apache.spark.deploy.client.AppClient, there is a parameter tells(line 52):
val REGISTRATION_TIMEOUT = 20.seconds
val REGISTRATION_RETRIES = 3As I know If I wanna increase the retry times,
must I modify this value,rebuild the
got it ,thanks.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:Shixiong Zhu zsxw...@gmail.com
收件人:罗辉 luohui20...@sina.com
抄送人:user user@spark.apache.org
主题:Re: How to shut down spark web UI?
日期:2015年07月06日 17点31分
You can set spark.ui.enabled to false
Hi there, If I set spark.scheduler.mode to FAIR, How many APPs does
spark support to run simultaneously in one cluster? Is there a upper limit?
Thanks.
Thanksamp;Best regards!
San.Luo
hello there I am trying to run a app in which part of it needs to run a
shell.how to run a shell distributed in spark cluster.thanks.
here's my code:import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.SparkConf;
import
Thanks Akhil,
your code is a big help to me,'cause perl script is the exactly
thing i wanna try to run in spark. I will have a try.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:Akhil Das ak...@sigmoidanalytics.com
收件人:罗辉
thanks, madhu and Akhil
I modified my code like below,however I think it is not so distributed. Have
you guys better idea to run this app more efficiantly and distributed?
So I add some comments with my understanding:
import org.apache.spark._
import www.celloud.com.model._
object GeneCompare3 {
I am right trying to run some shell script in my spark app, hoping it runs more
concurrently in my spark cluster.However I am not sure whether my codes will
run concurrently in my executors.Dive into my code, you can see that I am
trying to
1.splite both db and sample into 21 small files. That
Thanks Akhil, I checked the job UI again ,my app is running
concurrently in all the executors. But some of the tasks got I/O exception. I
will continue inspecting on this.
java.io.IOException: Failed to create local dir in
Hi grace, recently I am trying Hibench to evaluate my spark cluster,
however I got a problem in building Hibench, would you help to take a look?
thanks. It fails at building Sparkbench, and you may check the attched pic
for more info. My spark version :1.3.1,hadoop version :2.7.0
Hi Ted Thanks for your advice, i found that there is something wrong with
hadoop fs -get command, 'cause I believe the localization of
hdfs://spark-study:9000/HiBench/Aggregation/temp/user_agents to
/tmp/hadoop-root/mapred/local/1437016615898/user_agents is a behaviour like
hadoop fs -get
Hi there: when my colleague runs multiple spark Apps simultaneously,some
of them failed with akka.event.Logging$LoggerInitializationException.
Caused by: akka.event.Logging$LoggerInitializationException: Logger
log1-Slf4jLogger did not respond with LoggerInitialized, sent instead
Hi Ted and Grace, Retried with Spark 1.4.0,still failed with same
phenomenon.here is a log.FYI. What else details may help?BTW, is
it a necessary step to run Hibench test for my spark cluster? I also tried to
skip building Hibench to execute bin/run-all.sh, but also got
should I add dependencies for
spark-core_2.10,spark-yarn_2.10,spark-streaming_2.10,
org.apache.spark:spark-mllib_2.10,:spark-hive_2.10,:spark-graphx_2.10 in
pom.xml?if yes, there are 7 pom.xml in HiBench listing below, which one to
modify?
[root@spark-study HiBench-master]# find ./ -name
Hi there, I remembered there was a document showing most versions of JDK
using and testing status in global companies' spark clusters.However I couldn't
find it in spark website and databricks.Is there anyone who still remember that
document and don't mind to provide a link? Thanks.
hi guys, when I am trying to connect hive with spark-sql,I got a problem
like below:
[root@master spark]# bin/spark-shell --master local[4]log4j:WARN No appenders
could be found for logger
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).log4j:WARN Please
initialize the log4j system
Hi there: I export a project into jar like this "right click my
project->choose export ->java-> jar file-> next->choose "src/main/resouces" and
''src/main/scala"'-> clikc browse and choose a jar file export location->
choose overwrite it", and this jar is unable to run with "java -jar
hi pedro thanks for your advices. I got my code working as below:code in
main:val hc = new org.apache.spark.sql.hive.HiveContext(sc)
val hiveTable = hc.sql("select lp_location_id,id,pv from
house_id_pv_location_top50")val jsonArray = new JsonArray
val middleResult =
PS: I am using Spark1.6.1, kafka 0.10.0.0
ThanksBest regards!
San.Luo
- 原始邮件 -
发件人:
收件人:"user"
主题:How to get recommand result for users in a kafka SparkStreaming Application
日期:2016年08月03日 15点01分
hello
hello guys: I have an app which consumes json messages from kafka and
recommend movies for the users in those messages ,the code like this :
conf.setAppName("KafkaStreaming")
val storageLevel = StorageLevel.DISK_ONLY
val ssc = new StreamingContext(conf,
thank you focus, and all.this problem solved by adding a line ". /etc/profile"
in my shell.
ThanksBest regards!
San.Luo
- 原始邮件 -
发件人:"focus" <focushe...@qq.com>
收件人:"luohui20001" <luohui20...@si
ocus, and all.this problem solved by adding a line ". /etc/profile"
in my shell.
ThanksBest regards!
San.Luo
- 原始邮件 -
发件人:"focus" <focushe...@qq.com>
收件人:"luohui20001" <luohui20...@sina.com>, "user@spark.apa
hi guys: I add a spark-submit job into my Linux crontab list by the means
below ,however none of them works. If I change it to a normal shell script, it
is ok. I don't quite understand why. I checked the 8080 web ui of my spark
cluster, no job submitted, and there is not messages in
hi thereI have a DF with 3 columns: id , pv, location.(the rows are already
grouped by location and sort by pv in des) I wanna get the first 50 id values
grouped by location. I checked the API of dataframe,groupeddata,pairRDD, and
found no match. is there a way to do this naturally?
hi Anton: I check the docs you mentioned, and have code accordingly,
however met an exception like "org.apache.spark.sql.AnalysisException: Window
function row_number does not take a frame specification.;" It Seems that
the row_number API is giving a global row numbers of every row
hello guys: I have a DF and a UDAF. this DF has 2 columns, lp_location_id ,
id, both are of Int type. I want to group by id and aggregate all value of id
into 1 string. So I used a UDAF to do this transformation: multi Int values to
1 String. However my UDAF returns empty values as the
Thank you Anton I got my problem solved as below codeval hivetable
= hc.sql("select * from house_sale_pv_location")
val overLocation = Window.partitionBy(hivetable.col("lp_location_id"))
val sortedDF = hivetable.withColumn("rowNumber",
hi there:I got a problem in saving a DF to HDFS as parquet format very
slow. And I attached a pic which shows a lot of time is spent in getting
result.the code is
:streamingData.write.mode(SaveMode.Overwrite).parquet("/data/streamingData")
I don't quite understand why my app is so slow in
maybe this problem is not so easy to understand, so I attached my full code.
Hope this could help in solving the problem.
ThanksBest regards!
San.Luo
- 原始邮件 -
发件人:
收件人:"user"
主题:saving DF to HDFS in
hi guys, I have a dataframe with 3 columns, id(int) ,type(string)
,price(string) , and I want to add a column "price range", according to the
value of price. I checked the SPARK-15383, however in my code I just want
to append a column, which is transforming from the original dataframe
hi guys: I got a question that my SparkStreaming APP can not loading data
into SparkSQL table in. Here is my code:
val conf = new SparkConf().setAppName("KafkaStreaming for " +
topics).setMaster("spark://master60:7077")
val storageLevel = StorageLevel.DISK_ONLY
val ssc = new
the data can be written as parquet into HDFS. But the loading data process is
not working as expected.
ThanksBest regards!
San.Luo
- 原始邮件 -
发件人:
收件人:"user"
主题:[SparkSQL+SparkStreaming]SparkStreaming APP
hi kevinwindow function is what you need, like below:val hivetable =
hc.sql("select * from house_sale_pv_location")
val overLocation = Window.partitionBy(hivetable.col("lp_location_id"))
val sortedDF = hivetable.withColumn("rowNumber",
Hello guys: I have a problem in loading recommend model. I have 2 models,
one is good(able to get recommend result) and another is not working. I checked
these 2 models, both are MatrixFactorizationModel object. But in the metadata,
one is a PipelineModel and another is a
Hi there I have a flume cluster sending messages to SparkStreaming. I got
an exception like below:16/08/25 23:00:54 ERROR ReceiverTracker: Deregistered
receiver for stream 0: Error starting receiver 0 -
org.jboss.netty.channel.ChannelException: Failed to bind to:
master60/10.0.10.60:31001
hello guys: I have some transactional data as attached file 1.txt. A
sequence of a single operation 1 followed by a few operations 0 is a transation
here. The transcations, which sum(Amount) of operation 0 is less than the
sum(Amount) of operation 1, need to be found out.
There are
hello guys Is there a tool or an open source project that can mock lange
amount of data quickly, and support below :1. transaction data2. time series
data3. specified format data like CSV files or json files.4. data generated at
a changing speed.5. distributed data generation
hello guys: I have a simple rdd like :val userIDs = 1 to 1val rdd1 =
sc.parallelize(userIDs , 16) //this rdd has 1 user id And I have a
List[String] like below:scala> listForRule77
res76: List[String] = List(1,1,100.00|1483286400, 1,1,100.00|1483372800,
Riccardo and Ryan Thank you for your ideas.It seems that crossjoin is a new
dataset api after spark2.x. my spark version is 1.6.3. Is there a relative
api to do crossjoin?thank you.
ThanksBest regards!
San.Luo
- 原始邮件 -
发件人:Riccardo
Thank you guys, I got my code worked like below:val record75df =
sc.parallelize(listForRule75, numPartitions).map(x=> x.replace("|",
",")).map(_.split(",")).map(x =>
Mycaseclass4(x(0).toInt,x(1).toInt,x(2).toFloat,x(3).toInt)).toDF()
val userids = 1 to 1
val uiddf =
thank you Suzen, i've had a try to generate 1 billion records within 1.5min. It
is fast.And I will go on to try some other cases.
ThanksBest regards!
San.Luo
- 原始邮件 -
发件人:"Suzen, Mehmet"
收件人:luohui20...@sina.com
抄送人:user
70 matches
Mail list logo