We do a similar process with our log files in Hive. We only handle 30 to 60 
files (similar structure) at a time, but it sounds like it would fit your 
model…..

We create an external table, then do hdfs puts to add the files to the table:

CREATE EXTERNAL TABLE log_import(
  date STRING,
  time STRING,
  url STRING,
  args STRING
)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/hive/warehouse//import';

dfs -put /data/clients/processed/20120616.txt.gz /user/hive/warehouse/import;
dfs -put /data/clients/processed/20120617.txt.gz /user/hive/warehouse/import;
dfs -put /data/clients/processed/20120618.txt.gz /user/hive/warehouse/import;
dfs -put /data/clients/processed/20120619.txt.gz /user/hive/warehouse/import;

I don't know, but it seems like Hive/Hadoop treat the separate files in the 
table as clusters or buckets. We do see a good level of parallel tasks when we 
run queries against it…..

Thanks,
Bob

Robert Gause
Senior Systems Engineer
ZyQuest, Inc.
bob.ga...@zyquest.com<mailto:bob.ga...@zyquest.com>
920.617.7613

On Jul 30, 2012, at 11:11 PM, Techy Teck wrote:

I have around 100 files and each file is of the size of 1GB. And I need to find 
a String in all these 100 files and also which files contains that particular 
String. I am working with Hadoop File System and all those 100 files are in 
Hadoop File System.

All the 100 files are under real folder, so If I do like this below, I will be 
getting all the 100 files. And I need to find which files contains a particular 
String hello under real folder.

bash-3.00$ hadoop fs -ls /technology/dps/real



And this is my data structure in hdfs-

row format delimited
fields terminated by '\29'
collection items terminated by ','
map keys terminated by ':'
stored as textfile


How I can write MapReduce jobs to do this particular problem so that I can find 
which files contains a particular string? Any simple example will be of great 
help to me.

Reply via email to