Hi,
I'm using pyspark (1.6.2) to do a little bit of ETL and have noticed a very
odd situation. I have two dataframes, base and updated. The "updated"
dataframe contains constrained subset of data from "base" that I wish to
excluded. Something like this.
updated = base.where(base.X =
Hi,
I have a cluster with 15 nodes of which 5 are HDFS nodes. I kick off a job
that creates some 120 stages. Eventually, the active and pending stages
reduce down to a small bottleneck and it never fails... the tasks
associated with the 10 (or so) running tasks are always allocated to the
same
I'm using the spark 1.6.1 (hadoop-2.6) and I'm trying to load a file that's
in s3. I've done this previously with spark 1.5 with no issue. Attempting
to load and count a single file as follows:
dataFrame = sqlContext.read.text('s3a://bucket/path-to-file.csv')
dataFrame.count()
But when it
ner
>
> Pozdrawiam,
> Jacek Laskowski
>
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Mar 24, 2016 at 3:24 PM, Aaron Jackson <ajack...@pob
Well thats unfortunate, just means I have to scrape the webui for that
information. As to why, I have a cluster that is being increased in size
to accommodate the processing requirements of a large set of jobs. Its
useful to know when the new workers have joined the spark cluster. In my
Yeah, that's what I thought.
In this specific case, I'm porting over some scripts from an existing RDBMS
platform. I had been porting them (slowly) to in-code notation with python
or scala. However, to expedite my efforts (and presumably theirs since I'm
not doing this forever), I went down the
Greetings,
I am processing a "batch" of files and have structured an iterative process
around them. Each batch is processed by first loading the data with
spark-csv, performing some minor transformations and then writing back out
as parquet. Absolutely no caching or shuffle should occur with