Here is how you can list all HDFS directories for a given path.
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfsConn = org.apache.hadoop.fs.FileSystem.get(new
java.net.URI("hdfs://<Your NN Hostname>:8020"), hadoopConf)
val c = hdfsConn.listStatus(new org.apache.hadoop.fs.Path("/user/csingh/"))
c.foreach(x => println(x.getPath))
Output:
hdfs://<NN hostname>/user/csingh/.Trash
hdfs://<NN hostname>/user/csingh/.sparkStaging
hdfs://<NN hostname>/user/csingh/.staging
hdfs://<NN hostname>/user/csingh/test1
hdfs://<NN hostname>/user/csingh/test2
hdfs://<NN hostname>/user/csingh/tmp
> On Feb 20, 2016, at 2:37 PM, Divya Gehlot <[email protected]> wrote:
>
> Hi,
> @Umesh :You understanding is partially correct as per my requirement.
> My idea which I try to implement is
> Steps which I am trying to follow
> (Not sure how feasible it is I am new new bee to spark and scala)
> 1.List all the files under parent directory
> hdfs :///Testdirectory/
> As list
> For example : val listsubdirs =(subdir1,subdir2...subdir.n)
> Iterate through this list
> for(subdir <-listsubdirs){
> val df ="df"+subdir
> df= read it using spark csv package using custom schema
>
> }
> Will get dataframes equal to subdirs
>
> Now I got stuck in first step itself .
> How do I list directories and put it in list ?
>
> Hope you understood my issue now.
> Thanks,
> Divya
> On Feb 19, 2016 6:54 PM, "UMESH CHAUDHARY" <[email protected]
> <mailto:[email protected]>> wrote:
> If I understood correctly, you can have many sub-dirs under
> hdfs:///TestDirectory and and you need to attach a schema to all part files
> in a sub-dir.
>
> 1) I am assuming that you know the sub-dirs names :
>
> For that, you need to list all sub-dirs inside hdfs:///TestDirectory
> using Scala, iterate over sub-dirs
> foreach sub-dir in the list
> read the partfiles , identify and attach schema respective to that
> sub-directory.
>
> 2) If you don't know the sub-directory names:
> You need to store schema somewhere inside that sub-directory and read it
> in iteration.
>
> On Fri, Feb 19, 2016 at 3:44 PM, Divya Gehlot <[email protected]
> <mailto:[email protected]>> wrote:
> Hi,
> I have a use case ,where I have one parent directory
>
> File stucture looks like
> hdfs:///TestDirectory/spark1/part files( created by some spark job )
> hdfs:///TestDirectory/spark2/ part files (created by some spark job )
>
> spark1 and spark 2 has different schema
>
> like spark 1 part files schema
> carname model year
>
> Spark2 part files schema
> carowner city carcost
>
>
> As these spark 1 and spark2 directory gets created dynamically
> can have spark3 directory with different schema
>
> M requirement is to read the parent directory and list sub drectory
> and create dataframe for each subdirectory
>
> I am not able to get how can I list subdirectory under parent directory and
> dynamically create dataframes.
>
> Thanks,
> Divya
>
>
>
>
>