Re: Read files dynamically having different schema under one parent directory + scala + Spakr 1.5,2

Chandeep Singh Sat, 20 Feb 2016 08:02:07 -0800

Here is how you can list all HDFS directories for a given path.

val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfsConn = org.apache.hadoop.fs.FileSystem.get(new 
java.net.URI("hdfs://<Your NN Hostname>:8020"), hadoopConf)
val c = hdfsConn.listStatus(new org.apache.hadoop.fs.Path("/user/csingh/"))
c.foreach(x => println(x.getPath))


Output:
hdfs://<NN hostname>/user/csingh/.Trash
hdfs://<NN hostname>/user/csingh/.sparkStaging
hdfs://<NN hostname>/user/csingh/.staging
hdfs://<NN hostname>/user/csingh/test1
hdfs://<NN hostname>/user/csingh/test2
hdfs://<NN hostname>/user/csingh/tmp


> On Feb 20, 2016, at 2:37 PM, Divya Gehlot <[email protected]> wrote:
> 
> Hi,
> @Umesh :You understanding is partially correct as per my requirement.
> My idea which I try to implement is 
> Steps which I am trying to follow 
> (Not sure how feasible it is I am new new bee to spark and scala)
> 1.List all the files under parent directory 
>   hdfs :///Testdirectory/
> As list 
> For example : val listsubdirs =(subdir1,subdir2...subdir.n)
> Iterate through this list 
> for(subdir <-listsubdirs){
> val df ="df"+subdir
> df= read it using spark csv package using custom schema
> 
> }
> Will get dataframes equal to subdirs
> 
> Now I got stuck in first step itself .
> How do I list directories and put it in list ?
> 
> Hope you understood my issue now.
> Thanks,
> Divya 
> On Feb 19, 2016 6:54 PM, "UMESH CHAUDHARY" <[email protected] 
> <mailto:[email protected]>> wrote:
> If I understood correctly, you can have many sub-dirs under 
> hdfs:///TestDirectory and and you need to attach a schema to all part files 
> in a sub-dir. 
> 
> 1) I am assuming that you know the sub-dirs names :
> 
>     For that, you need to list all sub-dirs inside hdfs:///TestDirectory 
> using Scala, iterate over sub-dirs 
>     foreach sub-dir in the list 
>     read the partfiles , identify and attach schema respective to that 
> sub-directory. 
> 
> 2) If you don't know the sub-directory names:
>     You need to store schema somewhere inside that sub-directory and read it 
> in iteration.
> 
> On Fri, Feb 19, 2016 at 3:44 PM, Divya Gehlot <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi,
> I have a use case ,where I have one parent directory
> 
> File stucture looks like 
> hdfs:///TestDirectory/spark1/part files( created by some spark job )
> hdfs:///TestDirectory/spark2/ part files (created by some spark job )
> 
> spark1 and spark 2 has different schema 
> 
> like spark 1  part files schema
> carname model year
> 
> Spark2 part files schema
> carowner city  carcost
> 
> 
> As these spark 1 and spark2 directory gets created dynamically 
> can have spark3 directory with different schema
> 
> M requirement is to read the parent directory and list sub drectory 
> and create dataframe for each subdirectory
> 
> I am not able to get how can I list subdirectory under parent directory and 
> dynamically create dataframes.
> 
> Thanks,
> Divya 
> 
> 
> 
> 
>

Re: Read files dynamically having different schema under one parent directory + scala + Spakr 1.5,2

Reply via email to