Did you know that simple demo program of reading characters from file didn't
work ?
Who wrote that simple hello world type little program ?
jane thorpe
janethor...@aol.com
-Original Message-
From: jane thorpe
To: somplasticllc ; user
Sent: Fri, 3 Apr 2020 2:44
Subject: Re: HDF
>
> jane thorpe
> janethor...@aol.com
>
>
> -Original Message-
> From: jane thorpe
> To: somplasticllc ; user
> Sent: Fri, 3 Apr 2020 2:44
> Subject: Re: HDFS file hdfs://
> 127.0.0.1:9000/hdfs/spark/examples/README.txt
>
>
> Thanks darling
>
>
0.1:9000/hdfs/spark/examples/README.txt MapPartitionsRDD[91] at
textFile at :27
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[94] at
reduceByKey at :30
scala> :quit
jane thorpe
janethor...@aol.com
-Original Message-
From: Som Lima
CC: user
Sent: Tue, 31 Mar 2020
Hi Jane
Try this example
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala
Som
On Tue, 31 Mar 2020, 21:34 jane thorpe, wrote:
> hi,
>
> Are there setup instructions on the website for
>
From: Steve Loughran [mailto:ste...@hortonworks.com]
> > Sent: Saturday, September 30, 2017 6:10 AM
> > To: JG Perrin <jper...@lumeris.com>
> > Cc: Alexander Czech <alexander.cz...@googlemail.com>;
> user@spark.apache.org
> > Subject: Re: HDFS or NFS as a
> From: Steve Loughran [mailto:ste...@hortonworks.com]
> Sent: Saturday, September 30, 2017 6:10 AM
> To: JG Perrin <jper...@lumeris.com>
> Cc: Alexander Czech <alexander.cz...@googlemail.com>; user@spark.apache.org
> Subject: Re: HDFS or NFS as a cache?
>
>
>
[mailto:ste...@hortonworks.com]
Sent: Saturday, September 30, 2017 6:10 AM
To: JG Perrin <jper...@lumeris.com>
Cc: Alexander Czech <alexander.cz...@googlemail.com>; user@spark.apache.org
Subject: Re: HDFS or NFS as a cache?
On 29 Sep 2017, at 20:03, JG Perrin
<jper...@lumeris.
On 29 Sep 2017, at 20:03, JG Perrin
> wrote:
You will collect in the driver (often the master) and it will save the data, so
for saving, you will not have to set up HDFS.
no, it doesn't work quite like that.
1. workers generate their data and
On 29 Sep 2017, at 15:59, Alexander Czech
> wrote:
Yes I have identified the rename as the problem, that is why I think the extra
bandwidth of the larger instances might not help. Also there is a consistency
issue with S3
You will collect in the driver (often the master) and it will save the data, so
for saving, you will not have to set up HDFS.
From: Alexander Czech [mailto:alexander.cz...@googlemail.com]
Sent: Friday, September 29, 2017 8:15 AM
To: user@spark.apache.org
Subject: HDFS or NFS as a cache?
I have
Yes I have identified the rename as the problem, that is why I think the
extra bandwidth of the larger instances might not help. Also there is a
consistency issue with S3 because of the how the rename works so that I
probably lose data.
On Fri, Sep 29, 2017 at 4:42 PM, Vadim Semenov
How many files you produce? I believe it spends a lot of time on renaming
the files because of the output committer.
Also instead of 5x c3.2xlarge try using 2x c3.8xlarge instead because they
have 10GbE and you can get good throughput for S3.
On Fri, Sep 29, 2017 at 9:15 AM, Alexander Czech <
There is a mv command in GCS but I am not quite sure (because of limitation
of data on which I work on it and lack my budget) whether the mv command
actually copies and deletes or just re-points the files to a new directory
by changing its meta-data.
Yes the Data Quality checks are done after the
Thank you Gourav,
> Moving files from _temp folders to main folders is an additional overhead
> when you are working on S3 as there is no move operation.
Good catch. Is that GCS the same?
> I generally have a set of Data Quality checks after each job to ascertain
> whether everything went
But you have to be careful, that is the default setting. There is a way you
can overwrite it so that the writing to _temp folder does not take place
and you write directly to the main folder.
Moving files from _temp folders to main folders is an additional overhead
when you are working on S3 as
It’s out of the box in Spark.
When you write data into hfs or any storage it only creates a new parquet
folder properly if your Spark job was success else only _temp folder inside to
mark it’s still not success (spark was killed) or nothing inside (Spark job was
failed).
> On Aug 8, 2016,
Try to set the spark.locality.wait to a higher number and see if things
change. You can read more about the configuration properties from here
http://spark.apache.org/docs/latest/configuration.html#scheduling
Thanks
Best Regards
On Sat, Dec 12, 2015 at 12:16 AM, shahid ashraf
Vanzin [mailto:van...@cloudera.com]
Sent: Tuesday, September 15, 2015 7:47 PM
To: Adrian Bridgett
Cc: user
Subject: Re: hdfs-ha on mesos - odd bug
On Mon, Sep 14, 2015 at 6:55 AM, Adrian Bridgett <adr...@opensignal.com> wrote:
> 15/09/14 13:00:25 WARN TaskSetManager: Lost task 0.0 in stage
ndredi 2 Octobre 2015 18:37:22
Objet: Re: HDFS small file generation problem
Ok thanks, but can I also update data instead of insert data ?
- Mail original -
De: "Brett Antonides" <banto...@gmail.com>
À: user@spark.apache.org
Envoyé: Vendredi 2 Octobre 2015 18:18:18
Objet
;Jörn Franke" <jornfra...@gmail.com>
À: nib...@free.fr, "Brett Antonides" <banto...@gmail.com>
Cc: user@spark.apache.org
Envoyé: Samedi 3 Octobre 2015 11:17:51
Objet: Re: HDFS small file generation problem
You can update data in hive if you use the orc format
Le
nks a lot !
> Nicolas
>
>
> - Mail original -
> De: "Jörn Franke" <jornfra...@gmail.com>
> À: nib...@free.fr, "Brett Antonides" <banto...@gmail.com>
> Cc: user@spark.apache.org
> Envoyé: Samedi 3 Octobre 2015 11:17:51
> Objet: Re: H
nides" <banto...@gmail.com>
> Cc: user@spark.apache.org
> Envoyé: Samedi 3 Octobre 2015 11:17:51
> Objet: Re: HDFS small file generation problem
>
>
>
> You can update data in hive if you use the orc format
>
>
>
> Le sam. 3 oct. 2015 à 10:42, < nib...@
gt; - Mail original -
>> De: "Jörn Franke" <jornfra...@gmail.com>
>> À: nib...@free.fr, "Brett Antonides" <banto...@gmail.com>
>> Cc: user@spark.apache.org
>> Envoyé: Samedi 3 Octobre 2015 11:17:51
>> Objet: Re: HDFS small fil
; Nicolas
>
> - Mail original -
> De: nib...@free.fr
> À: "Brett Antonides" <banto...@gmail.com>
> Cc: user@spark.apache.org
> Envoyé: Vendredi 2 Octobre 2015 18:37:22
> Objet: Re: HDFS small file generation problem
>
> Ok thanks, but can I also upda
user@spark.apache.org
Envoyé: Samedi 3 Octobre 2015 11:17:51
Objet: Re: HDFS small file generation problem
You can update data in hive if you use the orc format
Le sam. 3 oct. 2015 à 10:42, < nib...@free.fr > a écrit :
Hello,
Finally Hive is not a solution as I cannot update the data.
Thanks a lot, why you said "the most recent version" ?
- Mail original -
De: "Jörn Franke" <jornfra...@gmail.com>
À: "nibiau" <nib...@free.fr>
Cc: banto...@gmail.com, user@spark.apache.org
Envoyé: Samedi 3 Octobre 2015 13:56:43
Objet: Re: RE : Re:
@spark.apache.org
> Envoyé: Samedi 3 Octobre 2015 13:56:43
> Objet: Re: RE : Re: HDFS small file generation problem
>
>
>
> Yes the most recent version yes, or you can use phoenix on top of hbase. I
> recommend to try out both and see which one is the most suitable.
>
>
&
: "Jörn Franke" <jornfra...@gmail.com>
À: nib...@free.fr, "user" <user@spark.apache.org>
Envoyé: Lundi 28 Septembre 2015 23:53:56
Objet: Re: HDFS small file generation problem
Use hadoop archive
Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit :
Mail original -
> De: "Jörn Franke" <jornfra...@gmail.com>
> À: nib...@free.fr, "user" <user@spark.apache.org>
> Envoyé: Lundi 28 Septembre 2015 23:53:56
> Objet: Re: HDFS small file generation problem
>
>
>
> Use hadoop archive
>
>
>
Ok thanks, but can I also update data instead of insert data ?
- Mail original -
De: "Brett Antonides" <banto...@gmail.com>
À: user@spark.apache.org
Envoyé: Vendredi 2 Octobre 2015 18:18:18
Objet: Re: HDFS small file generation problem
I had a very similar pr
Use hadoop archive
Le dim. 27 sept. 2015 à 15:36, a écrit :
> Hello,
> I'm still investigating my small file generation problem generated by my
> Spark Streaming jobs.
> Indeed, my Spark Streaming jobs are receiving a lot of small events (avg
> 10kb), and I have to store them
For some reason Spark isnt picking up your hadoop confs, Did you download
spark compiled with the hadoop version that you are having in the cluster?
Thanks
Best Regards
On Fri, Sep 25, 2015 at 7:43 PM, Angel Angel
wrote:
> hello,
> I am running the spark application.
>
Please post the question on vendor's forum.
> On Sep 25, 2015, at 7:13 AM, Angel Angel wrote:
>
> hello,
> I am running the spark application.
>
> I have installed the cloudera manager.
> it includes the spark version 1.2.0
>
>
> But now i want to use spark version
I would suggest not to write small files to hdfs. rather you can hold them
in memory, maybe off heap. and then you may flush it to hdfs using another
job. similar to https://github.com/ptgoetz/storm-hdfs (not sure if spark
already has something like it)
On Sun, Sep 27, 2015 at 11:36 PM,
You could try a couple of things
a) use Kafka for stream processing, store current incoming events and spark
streaming job ouput in Kafka rather than on HDFS and dual write to HDFS too
(in a micro batched mode), so every x minutes. Kafka is more suited to
processing lots of small events/
b)
On Mon, Sep 14, 2015 at 6:55 AM, Adrian Bridgett wrote:
> 15/09/14 13:00:25 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> 10.1.200.245): java.lang.IllegalArgumentException:
> java.net.UnknownHostException: nameservice1
> at
>
Hi Sam, in short, no, it's a traditional install as we plan to use spot
instances and didn't want price spikes to kill off HDFS.
We're actually doing a bit of a hybrid, using spot instances for the
mesos slaves, ondemand for the mesos masters. So for the time being,
putting hdfs on the
> On 15 Sep 2015, at 08:55, Adrian Bridgett wrote:
>
> Hi Sam, in short, no, it's a traditional install as we plan to use spot
> instances and didn't want price spikes to kill off HDFS.
>
> We're actually doing a bit of a hybrid, using spot instances for the mesos
>
I've seen similar traces, but couldn't track down the failure completely.
You are using Kerberos for your HDFS cluster, right? AFAIK Kerberos isn't
supported in Mesos deployments.
Can you resolve that host name (nameservice1) from the driver machine (ping
nameservice1)? Can it be resolved from
Thanks Steve - we are already taking the safe route - putting NN and
datanodes on the central mesos-masters which are on demand. Later (much
later!) we _may_ put some datanodes on spot instances (and using several
spot instance types as the spikes seem to only affect one type - worst
case we
I don't know about the broken url. But are you running HDFS as a mesos
framework? If so is it using mesos-dns?
Then you should resolve the namenode via hdfs:///
On Mon, Sep 14, 2015 at 3:55 PM, Adrian Bridgett
wrote:
> I'm hitting an odd issue with running spark on
I will try a fresh setup very soon.
Actually, I tried to compile spark by myself, against hadoop 2.5.2, but I
had the issue that I mentioned in this thread:
http://apache-spark-user-list.1001560.n3.nabble.com/Master-doesn-t-start-no-logs-td23651.html
I was wondering if maybe
You could consider using Zeppelin and spark on yarn as an alternative.
http://zeppelin.incubator.apache.org/
Simon
On 16 Jun 2015, at 17:58, Sanjay Subramanian
sanjaysubraman...@yahoo.com.INVALID wrote:
hey guys
After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS
It says your namenode is down (connection refused on 8020), you can restart
your HDFS by going into hadoop directory and typing sbin/stop-dfs.sh and
then sbin/start-dfs.sh
Thanks
Best Regards
On Tue, Jun 2, 2015 at 5:03 AM, Su She suhsheka...@gmail.com wrote:
Hello All,
A bit scared I did
Ahh, this did the trick, I had to get the name node out of same mode
however before it fully worked.
Thanks!
On Tue, Jun 2, 2015 at 12:09 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
It says your namenode is down (connection refused on 8020), you can restart
your HDFS by going into hadoop
Thanks Akhil!
1) I had to do sudo -u hdfs hdfs dfsadmin -safemode leave
a) I had created a user called hdfs with superuser privileges in Hue, hence
the double hdfs.
2) Lastly, I know this is getting a bit off topic, but this is my etc/hosts
file:
127.0.0.1 localhost.localdomain
Hello Sean and Akhil,
I shut down the services on Cloudera Manager. I shut them down in the
appropriate order and then stopped all services of CM. I then shut down my
instances. I then turned my instances back on, but I am getting the same
error.
1) I tried hadoop fs -safemode leave and it said
Command would be:
hadoop dfsadmin -safemode leave
If you are not able to ping your instances, it can be because of you are
blocking all the ICMP requests. Im not quiet sure why you are not able to
ping google.com from your instances. Make sure the internal IP (ifconfig)
is proper in the
If you are using CDH, you would be shutting down services with
Cloudera Manager. I believe you can do it manually using Linux
'services' if you do the steps correctly across your whole cluster.
I'm not sure if the stock stop-all.sh script is supposed to work.
Certainly, if you are using CM, by far
Thanks Akhil and Sean for the responses.
I will try shutting down spark, then storage and then the instances.
Initially, when hdfs was in safe mode, I waited for 1 hour and the problem
still persisted. I will try this new method.
Thanks!
On Sat, Jan 17, 2015 at 2:03 AM, Sean Owen
Safest way would be to first shutdown HDFS and then shutdown Spark (call
stop-all.sh would do) and then shutdown the machines.
You can execute the following command to disable safe mode:
*hadoop fs -safemode leave*
Thanks
Best Regards
On Sat, Jan 17, 2015 at 8:31 AM, Su She
You would not want to turn off storage underneath Spark. Shut down
Spark first, then storage, then shut down the instances. Reverse the
order when restarting.
HDFS will be in safe mode for a short time after being started before
it becomes writeable. I would first check that it's not just that.
Try
(hdfs:///localhost:8020/user/data/*)
With 3 /.
Thx
tri
-Original Message-
From: Benjamin Cuthbert [mailto:cuthbert@gmail.com]
Sent: Monday, December 01, 2014 4:41 PM
To: user@spark.apache.org
Subject: hdfs streaming context
All,
Is it possible to stream on HDFS directory
Have you tried just passing a path to ssc.textFileStream() ? It
monitors the path for new files by looking at mtime/atime ; all
new/touched files in the time window appear as an rdd in the dstream.
On 1 December 2014 at 14:41, Benjamin Cuthbert cuthbert@gmail.com wrote:
All,
Is it possible
Yes, in fact, that's the only way it works. You need
hdfs://localhost:8020/user/data, I believe.
(No it's not correct to write hdfs:///...)
On Mon, Dec 1, 2014 at 10:41 PM, Benjamin Cuthbert
cuthbert@gmail.com wrote:
All,
Is it possible to stream on HDFS directory and listen for multiple
Thanks Sean,
That worked just removing the /* and leaving it as /user/data
Seems to be streaming in.
On 1 Dec 2014, at 22:50, Sean Owen so...@cloudera.com wrote:
Yes, in fact, that's the only way it works. You need
hdfs://localhost:8020/user/data, I believe.
(No it's not correct to
@spark.apache.org
Subject: Re: hdfs streaming context
Yes, in fact, that's the only way it works. You need
hdfs://localhost:8020/user/data, I believe.
(No it's not correct to write hdfs:///...)
On Mon, Dec 1, 2014 at 10:41 PM, Benjamin Cuthbert cuthbert@gmail.com
wrote:
All,
Is it possible
Yes but you can't follow three slashes with host:port. No host
probably defaults to whatever is found in your HDFS config.
On Mon, Dec 1, 2014 at 11:02 PM, Bui, Tri tri@verizonwireless.com wrote:
For the streaming example I am working on, Its accepted (hdfs:///user/data)
without the
@spark.apache.org
Subject: Re: hdfs streaming context
Yes but you can't follow three slashes with host:port. No host probably
defaults to whatever is found in your HDFS config.
On Mon, Dec 1, 2014 at 11:02 PM, Bui, Tri tri@verizonwireless.com wrote:
For the streaming example I am working on, Its
You can use the sc.objectFile
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
to read it. It will be RDD[Student] type.
Thanks
Best Regards
On Mon, Nov 17, 2014 at 4:03 PM, Naveen Kumar Pokala
npok...@spcapitaliq.com wrote:
Hi,
JavaRDDInstrument
Hello Naveen,
I think you should first override toString method of your
sample.spark.test.Student class.
--
Cordialement,
Hlib Mykhailenko
Doctorant à INRIA Sophia-Antipolis Méditerranée
2004 Route des Lucioles BP93
06902 SOPHIA ANTIPOLIS cedex
- Original Message -
From:
I got some time to look in to it. It appears as that Spark (latest git)
is doing this operation much more often compare to Aug 1 version. Here
is the log from operation I am referring to
14/08/19 12:37:26 INFO spark.CacheManager: Partition rdd_8_414 not
found, computing it
14/08/19 12:37:26 INFO
In general it would be nice to be able to configure replication on a
per-job basis. Is there a way to do that without changing the config
values in the Hadoop conf/ directory between jobs? Maybe by modifying
OutputFormats or the JobConf ?
On Mon, Jul 14, 2014 at 11:12 PM, Matei Zaharia
Andrew, there are overloaded versions of saveAsHadoopFile or
saveAsNewAPIHadoopFile that allow you to pass in a per-job Hadoop conf.
saveAsTextFile is just a convenience wrapper on top of saveAsHadoopFile.
On Mon, Jul 14, 2014 at 11:22 PM, Andrew Ash and...@andrewash.com wrote:
In general it
eager to know this issue too,does any one knows how?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/hdfs-replication-on-saving-RDD-tp289p9700.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
You can change this setting through SparkContext.hadoopConfiguration, or put
the conf/ directory of your Hadoop installation on the CLASSPATH when you
launch your app so that it reads the config values from there.
Matei
On Jul 14, 2014, at 8:06 PM, valgrind_girl 124411...@qq.com wrote:
eager
Commit: 5f48721, github.com/apache/spark/pull/586
From: alee...@hotmail.com
To: user@spark.apache.org
Subject: RE: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn
mode
Date: Wed, 18 Jun 2014 11:24:36 -0700
Forgot to mention that I am using spark-submit to submit jobs
Forgot to mention that I am using spark-submit to submit jobs, and a verbose
mode print out looks like this with the SparkPi examples.The .sparkStaging
won't be deleted. My thoughts is that this should be part of the staging and
should be cleaned up as well when sc gets terminated.
Hi,
The problem was due to a pre-built/binary Tachyon-0.4.1 jar in the
SPARK_CLASSPATH, and that Tachyon jar had been built against
Hadoop-1.0.4.Building the Tachyon against Hadoop-2.0.0 resolved the issue.
Thanks
On Wed, Jun 11, 2014 at 11:34 PM, Marcelo Vanzin van...@cloudera.com
wrote:
Any suggestions from anyone?
Thanks
Bijoy
On Tue, Jun 10, 2014 at 11:46 PM, bijoy deb bijoy.comput...@gmail.com
wrote:
Hi all,
I have build Shark-0.9.1 using sbt using the below command:
*SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.6.0 sbt/sbt assembly*
My Hadoop cluster is also having version
The error is saying that your client libraries are older than what
your server is using (2.0.0-mr1-cdh4.6.0 is IPC version 7).
Try double-checking that your build is actually using that version
(e.g., by looking at the hadoop jar files in lib_managed/jars).
On Wed, Jun 11, 2014 at 2:07 AM, bijoy
71 matches
Mail list logo