Pretty straight forward
#!/bin/ksh
DIR="hdfs://<hostname>:9000/data/stg/accounts/nw/xxxxxxxxx"
#
## Remove the blank header line from the spreadsheets and compress them
#
echo `date` " ""======= Started Removing blank header line and Compressing
all csv FILEs"
for FILE in `ls *.csv`
do
sed '1d' ${FILE} > ${FILE}.tmp
mv -f ${FILE}.tmp ${FILE}
/usr/bin/bzip2 ${FILE}
done
#
## Clear out hdfs staging directory
#
echo `date` " ""======= Started deleting old files from hdfs staging
directory ${DIR}"
hdfs dfs -rm -r ${DIR}/*.bz2
echo `date` " ""======= Started Putting bz2 fileS to hdfs staging
directory ${DIR}"
for FILE in `ls *.bz2`
do
hdfs dfs -copyFromLocal ${FILE} ${DIR}
done
echo `date` " ""======= Checking that all files are moved to hdfs staging
directory"
hdfs dfs -ls ${DIR}
exit 0
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
On 28 March 2016 at 22:24, Ashok Kumar <[email protected]> wrote:
> Hello Mich
>
> If you accommodate can you please share your approach to steps 1-3 above.
>
> Best regards
>
>
> On Sunday, 27 March 2016, 14:53, Mich Talebzadeh <
> [email protected]> wrote:
>
>
> Pretty simple as usual it is a combination of ETL and ELT.
>
> Basically csv files are loaded into staging directory on host, compressed
> before pushing into hdfs
>
>
> 1. ETL --> Get rid of the header blank line on the csv files
> 2. ETL --> Compress the csv files
> 3. ETL --> Put the compressed CVF files into hdfs staging directory
> 4. ELT --> Use databricks to load the csv files
> 5. ELT --> Spark FP to prcess the csv data
> 6. ELT --> register it as a temporary table
> 7. ELT --> Create an ORC table in a named database in compressed zlib2
> format in Hive database
> 8. ELT --> Insert/select from temporary table to Hive table
>
>
> So the data is stored in an ORC table and one can do whatever analysis
> using Spark, Hive etc
>
>
>
> Dr Mich Talebzadeh
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
> http://talebzadehmich.wordpress.com
>
>
> On 27 March 2016 at 03:05, Koert Kuipers <[email protected]> wrote:
>
> To me this is expected behavior that I would not want fixed, but if you
> look at the recent commits for spark-csv it has one that deals this...
> On Mar 26, 2016 21:25, "Mich Talebzadeh" <[email protected]>
> wrote:
>
>
> Hi,
>
> I have a standard csv file (saved as csv in HDFS) that has first line of
> blank at the header
> as follows
>
> [blank line]
> Date, Type, Description, Value, Balance, Account Name, Account Number
> [blank line]
> 22/03/2011,SBT,"'FUNDS TRANSFER , FROM A/C 1790999",200.00,200.00,"'BROWN
> AE","'638585-60125663",
>
> When I read this file using the following standard
>
> val df =
> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header",
> "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")
>
> it crashes.
>
> java.util.NoSuchElementException
> at java.util.ArrayList$Itr.next(ArrayList.java:794)
>
> If I go and manually delete the first blank line it works OK
>
> val df =
> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header",
> "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")
>
> df: org.apache.spark.sql.DataFrame = [Date: string, Type: string,
> Description: string, Value: double, Balance: double, Account Name:
> string, Account Number: string]
>
> I can easily write a shell script to get rid of blank line. I was
> wondering if databricks does have a flag to get rid of the first blank line
> in csv file format?
>
> P.S. If the file is stored as DOS text file, this problem goes away.
>
> Thanks
>
> Dr Mich Talebzadeh
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>
>