If you load data using ORC or parquet, the RDD will have a partition per file,
so in fact your data frame will not directly match the partitioning of the
table.
If you want to process by and guarantee preserving partitioning then
mapPartition etc will be useful.
Note that if you perform any
If yarn has capacity to run both simultaneously it will. You should ensure you
are not allocating too many executors for the first app and leave some space
for the second)
You may want to run the application on different yarn queues to control
resource allocation. If you run as a different user
You don't connect to spark exactly. The spark client (running on your remote
machine) submits jobs to the YARN cluster running on HDP. What you probably
need is yarn-cluster or yarn-client with the yarn client configs as downloaded
from the Ambari actions menu.
Simon
> On 10 Aug 2015, at 12:44
You might also want to consider broadcasting the models to ensure you get one
instance shared across cores in each machine, otherwise the model will be
serialised to each task and you'll get a copy per executor (roughly core in
this instance)
Simon
Sent from my iPhone
> On 30 Jul 2015, at 10
You could consider using Zeppelin and spark on yarn as an alternative.
http://zeppelin.incubator.apache.org/
Simon
> On 16 Jun 2015, at 17:58, Sanjay Subramanian
> wrote:
>
> hey guys
>
> After day one at the spark-summit SFO, I realized sadly that (indeed) HDFS is
> not supported by Databr
You mean toDF() not toRD(). It stands for data frame of that makes it easier to
remember.
Simon
> On 18 May 2015, at 01:07, Rajdeep Dua wrote:
>
> Hi All,
> Was trying the Inferred Schema spart example
> http://spark.apache.org/docs/latest/sql-programming-guide.html#overview
>
> I am getting
You won’t be able to use YARN labels on 2.2.0. However, you only need the
labels if you want to map containers on specific hardware. In your scenario,
the capacity scheduler in YARN might be the best bet. You can setup separate
queues for the streaming and other jobs to protect a percentage of c
You shouldn’t have any issues with differing nodes on the latest Ambari and
Hortonworks. It works fine for mixed hardware and spark on yarn.
Simon
> On Jan 26, 2015, at 4:34 PM, Michael Segel wrote:
>
> If you’re running YARN, then you should be able to mix and max where YARN is
> managing t
You can use the same build commands, but it's well worth setting up a zinc
server if you're doing a lot of builds. That will allow incremental scala
builds, which speeds up the process significantly.
SPARK-4501 might be of interest too.
Simon
> On 3 Jan 2015, at 17:27, Manoj Kumar wrote:
>
>