So for each directory you create one RDD and then union them all. On Fri, Jun 26, 2015 at 10:05 AM, Bahubali Jain <bahub...@gmail.com> wrote:
> oh..my use case is not very straight forward. > The input can have multiple directories... > > On Fri, Jun 26, 2015 at 9:30 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> > wrote: > >> Yes, only working solution was to use union. >> >> val inputRecordsDay1 = getInputRecords(startDate, endDate, inputDir, >> sc, inputType) >> >> val inputRecordsDay2 = >> getInputRecords(DateUtil.addDaysToDate(startDate, 1), >> DateUtil.addDaysToDate(endDate, 1), inputDir, sc, inputType) >> >> val inputRecords = inputRecordsDay1.union(inputRecordsDay2) >> >> >> On Fri, Jun 26, 2015 at 7:52 AM, Bahubali Jain <bahub...@gmail.com> >> wrote: >> >>> Hi Deepak, >>> were you able to find a solution to this? >>> >>> Thanks, >>> Baahu >>> >>> On Wed, Apr 8, 2015 at 9:50 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> >>> wrote: >>> >>>> Spark Version 1.3 >>>> Command: >>>> >>>> ./bin/spark-submit -v --master yarn-cluster --driver-class-path >>>> /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-company-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-company/share/hadoop/yarn/lib/guava-11.0.2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-company/share/hadoop/hdfs/hadoop-hdfs-2.4.1-company-2.jar >>>> --num-executors 100 --driver-memory 4g --driver-java-options >>>> "-XX:MaxPermSize=4G" --executor-memory 8g --executor-cores 1 --queue >>>> hdmi-express --class com. company.ep.poc.spark.reporting.SparkApp >>>> /home/dvasthimal/spark1.3/spark_reporting-1.0-SNAPSHOT.jar >>>> startDate=2015-04-6 endDate=2015-04-7 >>>> input=/user/dvasthimal/epdatasets_small/exptsession subcommand=viewItem >>>> output=/user/dvasthimal/epdatasets/viewItem >>>> >>>> On Wed, Apr 8, 2015 at 9:49 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> >>>> wrote: >>>> >>>>> Hello, >>>>> I have two HDFS directories each containing multiple avro files. I >>>>> want to specify these two directories as input. In Hadoop world, one can >>>>> specify list of comma separated directories. In Spark that does not work. >>>>> >>>>> >>>>> Logs >>>>> ==== >>>>> >>>>> 15/04/07 21:10:11 INFO storage.BlockManagerMaster: Updated info of >>>>> block broadcast_2_piece0 >>>>> >>>>> 15/04/07 21:10:11 INFO spark.SparkContext: Created broadcast 2 from >>>>> sequenceFile at DataUtil.scala:120 >>>>> >>>>> 15/04/07 21:10:11 ERROR yarn.ApplicationMaster: User class threw >>>>> exception: Input path does not exist: >>>>> hdfs://namenode_host_name:8020/user/dvasthimal/epdatasets_small/exptsession/2015/04/06,/user/dvasthimal/epdatasets_small/exptsession/2015/04/07 >>>>> >>>>> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input >>>>> path does not exist: >>>>> hdfs://namenode_host_name:8020/user/dvasthimal/epdatasets_small/exptsession/2015/04/06,/user/dvasthimal/epdatasets_small/exptsession/2015/04/07 >>>>> >>>>> at >>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:320) >>>>> >>>>> at >>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263) >>>>> >>>>> ==== >>>>> >>>>> >>>>> Input Code: >>>>> >>>>> sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, >>>>> AvroKeyInputFormat[GenericRecord]](path) >>>>> >>>>> Here path is: >>>>> >>>>> /user/dvasthimal/epdatasets_small/exptsession/2015/04/06,/user/dvasthimal/epdatasets_small/exptsession/2015/04/07 >>>>> >>>>> >>>>> -- >>>>> Deepak >>>>> >>>>> >>>> >>>> >>>> -- >>>> Deepak >>>> >>>> >>> >>> >>> -- >>> Twitter:http://twitter.com/Baahu >>> >>> >> >> >> -- >> Deepak >> >> > > > -- > Twitter:http://twitter.com/Baahu > > -- Deepak