Hi I am new to PIG scripting and need help or suggestion to resolve below problem.
I have 1000 XML files in a folder and my PIG script has to take them one by one to parse for some values and has to store those values in a single files. I tried with below script but it is not working as expected. register piggybank.jar; *A = load 'XML/NCT{00000611,00000768}.xml' using org.apache.pig.piggybank.storage.XMLLoader('org_study_id') as (x: chararray);* *A2 = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'<org_study_id>(.*)</org_study_id>')) as (org_study_id : chararray);* *A3 = foreach A2 GENERATE CONCAT('#$',CONCAT(org_study_id,'$'));* *STORE A3 into 'piglab/result1';* *data = load 'piglab/result1' USING PigStorage('$') as (a1: chararray,a2: chararray);* *C = load 'XML/NCT{00000611,00000768}.xml' using org.apache.pig.piggybank.storage.XMLLoader('nct_id') as (x1: chararray);* *C2 = foreach C GENERATE FLATTEN(REGEX_EXTRACT_ALL(x1,'<nct_id>(.*)</nct_id>')) as (nct_id : chararray);* *C3 = foreach C2 GENERATE CONCAT('#$',CONCAT(nct_id,'$'));* *STORE C3 into 'piglab/result11';* *data11 = load 'piglab/result11' USING PigStorage('$') as (c1: chararray,c2: chararray);* *I = load 'piglab/NCT{00000611,00000768}.xml' using org.apache.pig.piggybank.storage.XMLLoader('minimum_age') as (x5: chararray);* *I2 = foreach I GENERATE FLATTEN(REGEX_EXTRACT_ALL(x5,'<minimum_age>(.*)</minimum_age>')) as (minimum_age: chararray);* *I3 = foreach I2 GENERATE CONCAT('#$',CONCAT(minimum_age,'$'));* *STORE I3 into 'piglab/result9';* *data8 = load 'piglab/result9' USING PigStorage('$') as (i1: chararray,i2: chararray);* *result3 = JOIN data by a1,data11 by c1,data8 by i1;* *Store result3 into 'piglab/result'*; The XML looks like this and each XML file has different clinical_study_rank > such as <*clinical_study rank="687"* > *<?xml version="1.0" encoding="UTF-8"?>* > *<clinical_study rank="687">* > * <!-- This xml conforms to an XML Schema at:* > * http://clinicaltrials.gov/ct2/html/images/info/public.xsd > <http://clinicaltrials.gov/ct2/html/images/info/public.xsd>* > * and an XML DTD at:* > * http://clinicaltrials.gov/ct2/html/images/info/public.dtd > <http://clinicaltrials.gov/ct2/html/images/info/public.dtd> -->* > * <required_header>* > * <download_date>ClinicalTrials.gov processed this data on November 07, > 2013</download_date>* > * <link_text>Link to the current ClinicalTrials.gov record.</link_text>* > * <url>http://clinicaltrials.gov/show/NCT00000611 > <http://clinicaltrials.gov/show/NCT00000611></url>* > * </required_header>* > * <id_info>* > * <org_study_id>114</org_study_id>* > * <nct_id>NCT00000611</nct_id>* > * </id_info>* > * <brief_title>Women's Health Initiative (WHI)</brief_title>* > * <sponsors>* > * <lead_sponsor>* > * <agency>National Heart, Lung, and Blood Institute (NHLBI)</agency>* > * <agency_class>NIH</agency_class>* > * </lead_sponsor>* > * <collaborator>* > * <agency>National Institute of Arthritis and Musculoskeletal and Skin > Diseases (NIAMS)</agency>* > * <agency_class>NIH</agency_class>* > * </collaborator>* > * <collaborator>* > * <agency>National Cancer Institute (NCI)</agency>* > * <agency_class>NIH</agency_class>* > * </collaborator>* > * <collaborator>* > * <agency>National Institute on Aging (NIA)</agency>* > * <agency_class>NIH</agency_class>* > * </collaborator>* > * </sponsors>* > *<**clinical_study>* any help on this will be highly appreciable. thanks