We do a similar process with our log files in Hive. We only handle 30 to 60 files (similar structure) at a time, but it sounds like it would fit your model…..
We create an external table, then do hdfs puts to add the files to the table: CREATE EXTERNAL TABLE log_import( date STRING, time STRING, url STRING, args STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/hive/warehouse//import'; dfs -put /data/clients/processed/20120616.txt.gz /user/hive/warehouse/import; dfs -put /data/clients/processed/20120617.txt.gz /user/hive/warehouse/import; dfs -put /data/clients/processed/20120618.txt.gz /user/hive/warehouse/import; dfs -put /data/clients/processed/20120619.txt.gz /user/hive/warehouse/import; I don't know, but it seems like Hive/Hadoop treat the separate files in the table as clusters or buckets. We do see a good level of parallel tasks when we run queries against it….. Thanks, Bob Robert Gause Senior Systems Engineer ZyQuest, Inc. bob.ga...@zyquest.com<mailto:bob.ga...@zyquest.com> 920.617.7613 On Jul 30, 2012, at 11:11 PM, Techy Teck wrote: I have around 100 files and each file is of the size of 1GB. And I need to find a String in all these 100 files and also which files contains that particular String. I am working with Hadoop File System and all those 100 files are in Hadoop File System. All the 100 files are under real folder, so If I do like this below, I will be getting all the 100 files. And I need to find which files contains a particular String hello under real folder. bash-3.00$ hadoop fs -ls /technology/dps/real And this is my data structure in hdfs- row format delimited fields terminated by '\29' collection items terminated by ',' map keys terminated by ':' stored as textfile How I can write MapReduce jobs to do this particular problem so that I can find which files contains a particular string? Any simple example will be of great help to me.