You can repartition your dataframe into 1 partition and all the data will land 
into one partition. However, doing this is perilious because you will end up 
with all your data on one node, and if you have too much data you will run out 
of memory. In fact, anytime you are thinking about putting data in a single 
file, you should ask yourself “Does this data fit into memory?”

The reason why Spark is geared towards reading and writing data in a 
partitioned manner is because fundamentally, partitioning data is how you scale 
your applications. Partitioned data allows Spark (or really any application 
that is designed to scale on a cluster) to read data in parallel, process it 
and spit out, without any bottlenecking. Humans prefer all their data in a 
single file/table, because humans have a limited ability of keeping track of 
multitude of files. Grid enabled software hate single files, simply because 
there is no good way for 2 nodes to read a large file without having some sort 
of bottlenecking

Imagine a data processing pipeline that starts with some sort of ingestion and 
transformation at one end, which feeds into several analytical processes. 
Usually there are humans at the end who are looking at the results of the 
analytics.  These humans love to get their analytics in a dashboard that gives 
them a high-level view of the data. However, all the data processing systems 
that go from input to analytics, prefer their data to be cut up into bite sized 
chunks

From: Christopher Piggott <cpigg...@gmail.com>
Date: Saturday, December 30, 2017 at 3:45 PM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: Converting binary files

I have been searching for examples, but not finding exactly what I need.

I am looking for the paradigm for using spark 2.2 to convert a bunch of binary 
files into a bunch of different binary files.  I'm starting with:

   val files = 
spark.sparkContext.binaryFiles("hdfs://1.2.3.4/input<http://1.2.3.4/input>")

then convert them:

   val converted = files.map {   case (filename, content) =>   ( filename -> 
convert(content) }

but I don't really want to save by 'partition', I want to save the file using 
the original name but in a different directory.e.g. "converted/*"

I'm not quite sure how I'm supposed to do this within the framework of what's 
available to me in SparkContext.  Do I need to do it myself using the HDFS api?

It would seem like this would be a pretty normal thing to do.  Imagine for 
instance I were saying take a bunch of binary files and compress them, and save 
the compressed output to a different directory.  I feel like I'm missing 
something fundamental here.

--C



________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Reply via email to