To Saisai:
it works after I correct some of them with your advices like below:
Furthermore, I am not quite clear about which code running on driver
and which code running on executor, so i wrote my understanding in comment.
would you help check? Thank you.
To akhil:
Also you could use Producer singletion to improve the performance, since
now you have to create a Producer for each partition in each batch
duration, you could create a singleton object and reuse it (Producer is
tread safe as I know).
-Jerry
2015-03-30 15:13 GMT+08:00 Saisai Shao
What happens when you do:
sc.textFile(hdfs://path/to/the_file.txt)
Thanks
Best Regards
On Mon, Mar 30, 2015 at 11:04 AM, Nick Travers n.e.trav...@gmail.com
wrote:
Hi List,
I'm following this example here
https://github.com/databricks/learning-spark/tree/master/mini-complete-example
I am able to connect to MySQL Hive metastore from the client cluster
machine.
-sh-4.1$ mysql --user=hiveuser --password=pass --host=
hostname.vip.company.com
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 9417286
Server version: 5.5.12-eb-5.5.12-log MySQL-eb
Are you having enough messages in kafka to consume? Can you make sure you
kafka setup is working with your console consumer? Also try this example
https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala
Thanks
Yeah, after review again about your code, the reason why you cannot receive
any data is that your previous code lacks ACTION function of DStream, so
the code actually doesn't execute, after you changing to the style as I
mentioned, `foreachRDD` will trigger and run the jobs as you wrote.
Yes,
Hey,
Trying to build Spark 1.3 with Scala 2.11 supporting yarn hive (with
thrift server).
Running;
*mvn -e -DskipTests -Pscala-2.11 -Dscala-2.11 -Pyarn -Pmapr4 -Phive
-Phive-thriftserver clean install*
The build fails with;
INFO] Compiling 9 Scala sources to
This warning is not related to --from-beginning. It means there's no new
data for current partition in current batch duration, it is acceptable. If
you pushing the data into Kafka again, this warning log will be disappeared.
Thanks
Saisai
2015-03-30 16:58 GMT+08:00 luohui20...@sina.com:
BTW,
Shuffle write will finally spill the data into file system as a bunch of
files. If you want to avoid disk write, you can mount a ramdisk and
configure spark.local.dir to this ram disk. So shuffle output will write
to memory based FS, and will not introduce disk IO.
Thanks
Jerry
2015-03-30 17:15
Thanks for your answer! Unfortunately I can't use Spark SQL for some reason.
If anyone has experience in using ORC as hadoopFile, I'd be happy to read
some hints/thoughts about my issues.
Zsolt
2015-03-27 19:07 GMT+01:00 Xiangrui Meng men...@gmail.com:
This is a PR in review to support ORC
Hi,
I was looking at SparkUI, Executors, and I noticed that I have 597 MB for
Shuffle while I am using cached temp-table and the Spark had 2 GB free
memory (the number under Memory Used is 597 MB /2.6 GB) ?!!!
Shouldn't be Shuffle Write be zero and everything (map/reduce) tasks be
done in
Hi,
thanks for your quick answers.
I looked at what was being written on disk and a folder called
blockmgr-d0236c76-7f7c-4a60-a6ae-ffc622b2db84 was enlarging every
second. This folder contained shuffle data and was not being cleaned
(after 30minutes of my application running it contained the
BTW, what's the matter about below warning? Not quite clear about KafkaRDD
WARN KafkaRDD: Beginning offset ${part.fromOffset} is the same as ending offset
skipping topic1 0.
does this warning occurs relative with my starting consumer without
--from-beginning param?
got it.Thank you
Thanksamp;Best regards!
罗辉 San.Luo
- 原始邮件 -
发件人:Saisai Shao sai.sai.s...@gmail.com
收件人:罗辉 luohui20...@sina.com
抄送人:user user@spark.apache.org
主题:Re: 回复:Re: Re: How SparkStreaming output messages to Kafka?
日期:2015年03月30日 17点05分
The behavior is the same. I am not sure it's a problem as much as
design decision. It does not require everything to stay in memory, but
the values for one key at a time. Have a look at how the preceding
shuffle works.
Consider repartitionAndSortWithinPartition to *partition* by hour and
then
Hi all,
I am trying to understand Spark lazy evaluation works, and I need some
help. I have noticed that creating an RDD once and using it many times
won't trigger recomputation of it every time it gets used. Whereas creating
a new RDD for every time a new operation is performed will trigger
I think that you get a sort of silent caching after shuffles, in
some cases, since the shuffle files are not immediately removed and
can be reused.
(This is the flip side to the frequent question/complaint that the
shuffle files aren't removed straight away.)
On Mon, Mar 30, 2015 at 9:43 AM,
We are building a wrapper that makes it possible to use reactive streams
(i.e. Observable, see reactivex.io) as input to Spark Streaming. We
therefore tried to create a custom receiver for Spark. However, the
Observable lives at the driver program and is generally not serializable.
Is it possible
we are experiencing some problems with the groupBy operations when used
to group together data that will be written in the same file. The
operation that we want to do is the following: given some data with a
timestamp, we want to sort it by timestamp, group it by hour and write
one file per
Note that even the Facebook four degrees of separation paper went down to a
single machine running WebGraph (http://webgraph.di.unimi.it/) for the final
steps, after running jobs in there Hadoop cluster to build the dataset for that
final operation.
The computations were performed on a
Try this in Spark shell:
|import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.hive.HiveContext
val jsc = new JavaSparkContext(sc)
val hc = new HiveContext(jsc.sc)
|
(I never mentioned that JavaSparkContext extends SparkContext…)
Cheng
On 3/30/15 8:28 PM,
Taking out the complexity of the ARIMA models to simplify things- I can't
seem to find a good way to represent even standard moving averages in spark
streaming. Perhaps it's my ignorance with the micro-batched style of the
DStreams API.
On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet
The mysql command line doesn't use JDBC to talk to MySQL server, so
this doesn't verify anything.
I think this Hive metastore installation guide from Cloudera may be
helpful. Although this document is for CDH4, the general steps are the
same, and should help you to figure out the
I worked, thank you.
On 30.03.2015 11:58, Sean Owen wrote:
The behavior is the same. I am not sure it's a problem as much as
design decision. It does not require everything to stay in memory, but
the values for one key at a time. Have a look at how the preceding
shuffle works.
Consider
Hi,
I am new to Spark/Streaming, and tried to run modified
FlumeEventCount.scala example to display all events by adding the call:
stream.map(e = Event:header: + e.event.get(0).toString + body: +
new String(e.event.getBody.array)).print()
The spark-submit runs fine with --master local[4],
bq. In /etc/secucity/limits.conf set the next values:
Have you done the above modification on all the machines in your Spark
cluster ?
If you use Ubuntu, be sure that the /etc/pam.d/common-session file contains
the following line:
session required pam_limits.so
On Mon, Mar 30, 2015 at 5:08
Hi, I change my process flow.
Now I am processing a file per hour, instead of process at the end of the
day.
This decreased the memory comsuption .
Regards
Eduardo
On Thu, Mar 26, 2015 at 3:16 PM, Davies Liu dav...@databricks.com wrote:
Could you narrow down to a step which cause the
Hello Lian
Can you share the URL ?
On Mon, Mar 30, 2015 at 6:12 PM, Cheng Lian lian.cs@gmail.com wrote:
The mysql command line doesn't use JDBC to talk to MySQL server, so
this doesn't verify anything.
I think this Hive metastore installation guide from Cloudera may be
helpful.
I'm executing my application in local mode (with --master local[*]).
I'm using ubuntu and I've put session required pam_limits.so into
/etc/pam.d/common-session
but it doesn't work
On Mon, Mar 30, 2015 at 4:08 PM, Ted Yu yuzhih...@gmail.com wrote:
bq. In /etc/secucity/limits.conf set the next
The “best” solution to spark-shell’s problem is creating a file
$SPARK_HOME/conf/java-opts
with “-Dhdp.version=2.2.0.0-2014”
Cheers,
Doug
On Mar 28, 2015, at 1:25 PM, Michael Stone mst...@mathom.us wrote:
I've also been having trouble running 1.3.0 on HDP. The
Thanks Sean!
Do you know if there is a way (even manually) to delete these intermediate
shuffle results? I was just want to test the expected behaviour. I know
that re-caching might be a positive action most of the times but I want to
try it without it.
Renato M.
2015-03-30 12:15 GMT+02:00 Sean
Hi
I have a problem with temp data in Spark. I have fixed
spark.shuffle.manager to SORT. In /etc/secucity/limits.conf set the next
values:
* softnofile 100
* hardnofile 100
In spark-env.sh set ulimit -n 100
I've restarted the spark service and it
Hi Alessandro
Could you specify which query were you able to run successfully?
1. sqlContext.sql(SELECT * FROM Logs as l where l.timestamp = '2012-10-08
16:10:36' ).collect
OR
2. sqlContext.sql(SELECT * FROM Logs as l where cast(l.timestamp as string)
= '2012-10-08 16:10:36.0').collect
I am
Thanks Saisai. I will try your solution, but still i don't understand why
filesystem should be used where there is a plenty of memory available!
On Mon, Mar 30, 2015 at 11:22 AM, Saisai Shao sai.sai.s...@gmail.com
wrote:
Shuffle write will finally spill the data into file system as a bunch of
Thanks for your answer!
I don't call .collect because I want to trigger the execution. I call it
because I need the rdd on the driver. This is not a huge RDD and it's not
larger than the one returned with 50GB input data.
The end of the stack trace:
The two IP's are the two worker nodes, I think
Mostly, you will have to restart the machines to get the ulimit effect (or
relogin). What operation are you doing? Are you doing too many
repartitions?
Thanks
Best Regards
On Mon, Mar 30, 2015 at 4:52 PM, Masf masfwo...@gmail.com wrote:
Hi
I have a problem with temp data in Spark. I have
Hi.
I've done relogin, in fact, I put 'uname -n' and returns 100, but it
crashs.
I'm doing reduceByKey and SparkSQL mixed over 17 files (250MB-500MB/file)
Regards.
Miguel Angel.
On Mon, Mar 30, 2015 at 1:52 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Mostly, you will have to restart
Just the same as spark was disrupting the hadoop ecosystem by changing the
assumption that you can't rely on memory in distributed analytics...now
maybe we are challenging the assumption that big data analytics need to
distributed?
I've been asking the same question lately and seen similarly that
thanks. That is what I have tried. JavaSparkContext does not extend
SparkContext, it can not be used here.
Anyone else know whether we can use HiveContext with JavaSparkContext, from
API documents, seems this is not supported. thanks.
On Sun, Mar 29, 2015 at 9:24 AM, Cheng Lian
Hi,
I like to have a online realtime recommendation system. I have a ALS model
but I want to add the new data on realtime.
Is it possible??? any guidelines???
--
View this message in context:
Hi,
do you have any updates
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-do-incremental-training-using-ALSModel-MLlib-tp20942p22296.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I think the jar file has to be local. In HDFS is not supported yet in Spark.
See this answer:
http://stackoverflow.com/questions/28739729/spark-submit-not-working-when-application-jar-is-in-hdfs
Date: Sun, 29 Mar 2015 22:34:46 -0700
From: n.e.trav...@gmail.com
To: user@spark.apache.org
Hi,
DStream.print() only prints the first 10 elements contained in the Stream. You
can call DStream.print(x) to print the first x elements but if you don’t know
the exact count you can call DStream.foreachRDD and apply a function to display
the content of every RDD.
For example:
Hi,
I am interested in this opportunity. I am working as Research Engineer in
Impetus Technologies, Bangalore, India. In fact we implemented Distributed
Deep Learning on Spark. Will share my CV if you are interested.
Please visit the below link:
Dear all,
I'm still struggling to make a pre-trained caffe model transformer for
dataframe works. The main problem is that creating a caffe model inside the
UDF is very slow and consumes memories.
Some of you suggest to broadcast the model. The problem with broadcasting
is that I use a JNI
Thank you for your reply.
I'm sorry confirmation is slow.
I'll try the tuning 'spark.yarn.executor.memoryOverhead'.
Thanks,
Yuichiro Sakamoto
On 2015/03/25 0:56, Sandy Ryza wrote:
Hi Yuichiro,
The way to avoid this is to boost spark.yarn.executor.memoryOverhead until the
executors have
Hi there,
I'm using Spark Streaming 1.2.1 with actorStreams. Initially, all goes well.
15/03/30 15:37:00 INFO spark.storage.MemoryStore: Block broadcast_1 stored as
values in memory (estimated size 3.2 KB, free 1589.8 MB)
15/03/30 15:37:00 INFO spark.storage.BlockManagerInfo: Added
Maybe you should mail him directly on j.bo...@ucl.ac.uk
Thanks
Best Regards
On Mon, Mar 30, 2015 at 8:47 PM, Chitturi Padma
learnings.chitt...@gmail.com wrote:
Hi,
I am interested in this opportunity. I am working as Research Engineer in
Impetus Technologies, Bangalore, India. In fact we
Hi Ankur
If you using standalone mode, your config is wrong. You should use export
SPARK_DAEMON_MEMORY=xxx in config/spark-env.sh. At least it works on my
spark 1.3.0 standalone mode machine.
BTW, The SPARK_DRIVER_MEMORY is used in Yarn mode and looks like the
standalone mode don't use this
Ah, sorry, my bad...
http://www.cloudera.com/content/cloudera/en/documentation/cdh4/v4-2-0/CDH4-Installation-Guide/cdh4ig_topic_18_4.html
On 3/30/15 10:24 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:
Hello Lian
Can you share the URL ?
On Mon, Mar 30, 2015 at 6:12 PM, Cheng Lian lian.cs@gmail.com
Sean
Yes I know that I can use persist() to persist to disk, but it is still a big
extra cost of persist a huge RDD to disk. I hope that I can do one pass to get
the count as well as rdd.saveAsObjectFile(file2), but I don’t know how.
May be use accumulator to count the total ?
Ningjun
From:
I think you must have downloaded the spark source code gz file.
It is little confusing. You have to select the hadoop version also and
the actual tgz file will have spark version and hadoop version in it.
-R
On Mon, Mar 30, 2015 at 10:34 AM, vance46 wang2...@purdue.edu wrote:
Hi all,
I'm a
Hello,
Thank you for your contribution.
We have tried to reproduce your error but we need more information:
- Which Spark version are you using? Stratio Spark-Mongodb connector
supports 1.2.x SparkSQL version.
- What jars are you adding while launching the Spark-shell?
Best regards,
Hi Xi,
Please create a JIRA if it takes longer to locate the issue. Did you
try a smaller k?
Best,
Xiangrui
On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen davidshe...@gmail.com wrote:
Hi Burak,
After I added .repartition(sc.defaultParallelism), I can see from the log
the partition number is set
Nicolas:
See if there was occurrence of the following exception in the log:
errs = throw new SparkException(
sCouldn't connect to leader for topic ${part.topic}
${part.partition}: +
errs.mkString(\n)),
Cheers
On Mon, Mar 30, 2015 at 9:40 AM, Cody Koeninger
Hi all,
I'm a newbee try to setup spark for my research project on a RedHat system.
I've downloaded spark-1.3.0.tgz and untared it. and installed python, java
and scala. I've set JAVA_HOME and SCALA_HOME and then try to use sudo
sbt/sbt assembly according to
Did you try this example?
https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala
I think you need to create a topic set with # partitions to consume.
Thanks
Best Regards
On Mon, Mar 30, 2015 at 9:35 PM, Nicolas
If you are only interested in getting a hands on with Spark and not with
building it with specific version of Hadoop use one of the bundle provider
like cloudera.
It will give you a very easy way to install and monitor your services.( I
find installing via cloudera manager
On 30 Mar 2015, at 13:27, jay vyas
jayunit100.apa...@gmail.commailto:jayunit100.apa...@gmail.com wrote:
Just the same as spark was disrupting the hadoop ecosystem by changing the
assumption that you can't rely on memory in distributed analytics...now maybe
we are challenging the assumption
One workaround could be to convert a DataFrame into a RDD inside the
transform function and then use mapPartitions/broadcast to work with the
JNI calls and then convert back to RDD.
Thanks
Shivaram
On Mon, Mar 30, 2015 at 8:37 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
Dear all,
I'm
This line
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.close(
KafkaRDD.scala:158)
is the attempt to close the underlying kafka simple consumer.
We can add a null pointer check, but the underlying issue of the consumer
being null probably indicates a problem earlier. Do you see
Okay, I didn't realize that I changed the behavior of lambda in 1.3.
to make it scale-invariant, but it is worth discussing whether this
is a good change. In 1.2, we multiply lambda by the number ratings in
each sub-problem. This makes it scale-invariant for explicit
feedback. However, in implicit
Hi Folks,
Just to summarize it to run SPARK on HDP distribution.
1. The spark version has to be 1.3.0 and above if you are using upstream
distribution. This configuration is mainly for HDP rolling upgrade purpose,
and the patch only went into spark upstream from 1.3.0.
2. In
Hi,
Is it possible to put the log4j.properties in the application jar such that
the driver and the executors use this log4j file. Do I need to specify
anything while submitting my app so that this file is used?
Thanks,
Udit
I'm trying to use OpenJDK 7 with Spark 1.3.0 and noticed that the
compute-classpath.sh script is not adding the datanucleus jars to the classpath
because compute-classpath.sh is assuming to find the jar command in
$JAVA_HOME/bin/jar, which does not exist for OpenJDK. Is this an issue anybody
So, I am trying to build Spark 1.3.0 (standalone mode) on Windows 7 using
Maven, but I'm getting a build failure.
java -version
java version 1.8.0_31
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
Here is the command I am
I have the same problem, i.e. exception with the same call stack when I start
either pyspark or spark-shell. I use spark-1.3.0-bin-hadoop2.4 on ubuntu
14.10.
bin/pyspark
bunch of INFO messages, then ActorInitializationException exception.
Shell starts, I can do this:
rd = sc.parallelize([1,2])
Hi All,
I am waiting the spark 1.3.1 to fix the bug to work with S3 file system.
Anyone knows the release date for 1.3.1? I can't downgrade to 1.2.1 because
there is jar compatible issue to work with AWS SDK.
Regards,
Shuai
Are you referring to
SPARK-6330https://issues.apache.org/jira/browse/SPARK-6330?
If you are able to build Spark from source yourself, I believe you should just
need to cherry-pick the following commits in order to backport the fix:
67fa6d1f830dee37244b5a30684d797093c7c134 [SPARK-6330] Fix
Ah, never mind, I found the jar command in the java-1.7.0-openjdk-devel
package. I only had java-1.7.0-openjdk installed. Looks like I just need to
install java-1.7.0-openjdk-devel then set JAVA_HOME to /usr/lib/jvm/java
instead of /usr/lib/jvm/jre.
~ Jonathan Kelly
From: Kelly, Jonathan
When I run my code in Eclipse with the following parameters,
VM Args: -Xmx4g
OS: Windows
Time: 4.4 minutes
It is faster than submitting to a cluster with these parameters:
SPARK_EXECUTOR_MEMORY=4G
OS: Ubuntu
Time: 5.2 minutes
They are equivalent options are they not? Both environments run on
This PR updated the k-means|| initialization:
https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d,
which was included in 1.3.0. It should fix kmean|| initialization with
large k. Please create a JIRA for this issue and send me the code and the
dataset to produce this
For the same amount of data, if I set the k=500, the job finished in about
3 hrs. I wonder if I set k=5000, the job could finish in 30 hrs...the
longest time I waited was 12 hrs...
If I use kmeans-random, same amount of data, k=5000, the job finished in
less than 2 hrs.
I think current kmeans||
Client mode would not support HDFS jar extraction.
I tried this:
sudo -u hdfs spark-submit --class org.apache.spark.examples.SparkPi
--deploy-mode cluster --master yarn
hdfs:///user/spark/spark-examples-1.2.0-cdh5.3.2-hadoop2.5.0-cdh5.3.2.jar 10
And it worked.
--
View this message in context:
Hey Xi,
Have you tried Spark 1.3.0? The initialization happens on the driver node
and we fixed an issue with the initialization in 1.3.0. Again, please start
with a smaller k, and increase it gradually, Let us know at what k the
problem happens.
Best,
Xiangrui
On Sat, Mar 28, 2015 at 3:11 AM,
I have set Kryo Serializer as default serializer in SparkConf and Spark UI
confirms it too, but in the Spark logs I'm getting this exception,
java.io.OptionalDataException
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1370)
at
Updated spark-defaults and spark-env:
Log directory /home/hduser/spark/spark-events does not exist.
(Also, in the default /tmp/spark-events it also did not work)
On 30 March 2015 at 18:03, Marcelo Vanzin van...@cloudera.com wrote:
Are those config values in spark-defaults.conf? I don't think
You can extend Gradient, e.g.,
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala#L266,
and use it in GradientDescent:
Are those config values in spark-defaults.conf? I don't think you can
use ~ there - IIRC it does not do any kind of variable expansion.
On Mon, Mar 30, 2015 at 3:50 PM, Tom thubregt...@gmail.com wrote:
I have set
spark.eventLog.enabled true
as I try to preserve log files. When I run, I get
We test large feature dimension but not very large k
(https://github.com/databricks/spark-perf/blob/master/config/config.py.template#L525).
Again, please create a JIRA and post your test code and a link to your
test dataset, we can work on it. It is hard to track the issue with
multiple threads in
One issue is that 'big' becomes 'not so big' reasonably quickly. A couple
of TeraBytes is not that challenging (depending on the algorithm) these
days where as 5 years ago it was a big challenge. We have a bit over a
PetaByte (not using Spark) and using a distributed system is the only
viable way
I have set
spark.eventLog.enabled true
as I try to preserve log files. When I run, I get
Log directory /tmp/spark-events does not exist.
I set
spark.local.dir ~/spark
spark.eventLog.dir ~/spark/spark-events
and
SPARK_LOCAL_DIRS=~/spark
Now I get:
Log directory ~/spark/spark-events does not
Try running it like this:
sudo -u hdfs spark-submit --class org.apache.spark.examples.SparkPi
--deploy-mode cluster --master yarn
hdfs:///user/spark/spark-examples-1.2.0-cdh5.3.2-hadoop2.5.0-cdh5.3.2.jar 10
Caveats:
1) Make sure the permissions of /user/nick is 775 or 777.
2) No need for
I am trying to register classes with KryoSerializer. I get the following
error message:
How do I find out what class is being referred to by: *OpenHashMap$mcI$sp ?*
*com.esotericsoftware.kryo.KryoException:
java.lang.IllegalArgumentException: Class is not registered:
I re-ran my application through Eclipse on Ubuntu and received slower than
expected results of 6.1 minutes. So the question is now, why would there be
such a difference of run times between Windows 7 and Ubuntu 14.04?
--
View this message in context:
Hi Wisely,
I am running spark 1.2.1 and I have checked the process heap and it is
running with all the heap that I am assigning and as I mentioned earlier I
get OOM on workers not the driver or master.
Thanks
Ankur
On Mon, Mar 30, 2015 at 9:24 AM, giive chen thegi...@gmail.com wrote:
Hi Ankur
This sounds like SPARK-6532.
On Mon, Mar 30, 2015 at 1:34 PM, ARose ashley.r...@telarix.com wrote:
So, I am trying to build Spark 1.3.0 (standalone mode) on Windows 7 using
Maven, but I'm getting a build failure.
java -version
java version 1.8.0_31
Java(TM) SE Runtime Environment (build
Are you running Spark in cluster mode by any chance?
(It always helps to show the command line you're actually running, and
if there's an exception, the first few frames of the stack trace.)
On Mon, Mar 30, 2015 at 4:11 PM, Tom Hubregtsen thubregt...@gmail.com wrote:
Updated spark-defaults and
I’m looking at various HA scenarios with Spark streaming. We’re currently
running a Spark streaming job that is intended to be long-lived, 24/7. We see
that if we kill node managers that are hosting Spark workers, new node managers
assume execution of the jobs that were running on the
I tried to file a bug in git repo however I don't see a link to open
issues
On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I checked the ports using netstat and don't see any connections
established on that port. Logs show only this:
15/03/27 13:50:48 INFO
Hi Vincent,
This may be a case that you're missing a semi-colon after your CREATE
TEMPORARY TABLE statement. I ran your original statement (missing the
semi-colon) and got the same error as you did. As soon as I added it in, I
was good to go again:
CREATE TEMPORARY TABLE jsonTable
USING
Thank you for updating the files Holden! I actually was using that
same text in my files located on HDFS. Could the files being located
on HDFS be the reason why the example gets stuck? I c/p the code
provided on github, the only things I changed were:
a) file paths to: val spam =
I ran a line like following:
tb2.groupBy(city, state).avg(price).show
I got result:
city state AVG(price)
Charlestown New South Wales 1200.0
Newton ... MA 1200.0
Coral Gables ... FL 1200.0
Castricum
I'm hoping to cut an RC this week. We are just waiting for a few other
critical fixes.
On Mon, Mar 30, 2015 at 12:54 PM, Kelly, Jonathan jonat...@amazon.com
wrote:
Are you referring to SPARK-6330
https://issues.apache.org/jira/browse/SPARK-6330?
If you are able to build Spark from source
You'll need to use the longer form for aggregation:
tb2.groupBy(city, state).agg(avg(price).as(newName)).show
depending on the language you'll need to import:
scala: import org.apache.spark.sql.functions._
python: from pyspark.sql.functions import *
On Mon, Mar 30, 2015 at 5:49 PM, Neal Yin
I run Spark in local mode.
Command line (added some debug info):
hduser@hadoop7:~/spark-terasort$ ./bin/run-example SparkPi 10
Jar:
/home/hduser/spark-terasort/examples/target/scala-2.10/spark-examples-1.3.0-SNAPSHOT-hadoop2.4.0.jar
/home/hduser/spark-terasort/bin/spark-submit --master local[*]
Hi,
I used CombineTextInputFormat to read many small files.
The Java code is as follows (I've written it as an utility function):
public static JavaRDDString combineTextFile(JavaSparkContext sc,
String path, long maxSplitSize, boolean recursive)
{
Configuration conf =
I am having the same problems. Did you find a fix?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-build-with-hive-support-fails-tp22215p22309.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
The stack trace for the first scenario and your suggested improvement is
similar, with as only difference the first line (Sorry for not including
this):
Log directory /home/hduser/spark/spark-events does not exist.
To verify your premises, I cd'ed into the directory by copy pasting the
path
hey all, I am trying to figure out if I can use for building loosely coupled
distributed data pipelines. This is part of a pitch that I am trying to come
up. I'd really appreciate if someone can comment if this is possible or no.
Many Thanks
--
View this message in context:
1 - 100 of 106 matches
Mail list logo