If the file is in HDFS already you can use spark to read the file using a specific input format (depending on file type) to split it.
http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html On Sat, Apr 22, 2017 at 4:36 AM, Paul Tremblay <paulhtremb...@gmail.com> wrote: > We are tasked with loading a big file (possibly 2TB) into a data > warehouse. In order to do this efficiently, we need to split the file into > smaller files. > > I don't believe there is a way to do this with Spark, because in order for > Spark to distribute the file to the worker nodes, it first has to be split > up, right? > > We ended up using a single machine with a single thread to do the > splitting. I just want to make sure I am not missing something obvious. > > Thanks! > > -- > Paul Henry Tremblay > Attunix >