I think that must be it Dave, but I can't for the life of me figure out where in the code that's happening. Will take another look tonight.
J On Wed, May 23, 2018 at 7:00 AM, David Ortiz <dpo5...@gmail.com> wrote: > Josh, > > Is there any chance that somehow the output path to > AvroParquetPathPerKey is getting twisted up when it goes through the > compilation step? Watching it while it runs, the output in the > /tmp/crunch/p<stage> directory basically looks like what I would expect it > to do in the output directory. It seems that AvroPathPerKeyTarget also was > showing similar behavior when I was messing around to see if that would > work. > > Thanks, > Dave > > On Thu, May 17, 2018 at 6:03 PM Josh Wills <josh.wi...@gmail.com> wrote: > >> Hrm, got it-- now at least I know where to look (although surprised that >> overriding the finalize() didn't fix it, as I ran into similar problems >> with my own cluster and created a SlackPipeline class that overrides that >> method.) >> >> >> J >> >> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <dpo5...@gmail.com> wrote: >> >>> Josh, >>> >>> Those adjustments did not appear to do anything to stop the tmp >>> directory from being removed at the end of the job execution (override >>> finalize with an empty block when creating SparkPipeline and run using >>> pipeline.run() instead of done()). However, I can confirm that I see the >>> stage output for the two output directories complete with parquet files >>> partitioned by key. However, neither they, nor anything else ever make it >>> to the output directory, which is not even created. >>> >>> Thanks, >>> Dave >>> >>> On Fri, May 11, 2018 at 8:24 AM David Ortiz <dpo5...@gmail.com> wrote: >>> >>>> Hey Josh, >>>> >>>> Thanks for taking a look. I can definitely play with that on >>>> Monday when I'm back at work. >>>> >>>> Thanks, >>>> Dave >>>> >>>> On Fri, May 11, 2018 at 1:46 AM Josh Wills <josh.wi...@gmail.com> >>>> wrote: >>>> >>>>> Hey David, >>>>> >>>>> Looking at the code, the problem isn't obvious to me, but there are >>>>> only two places things could be going wrong: writing the data out of Spark >>>>> into the temp directory where intermediate outputs get stored (i.e., Spark >>>>> isn't writing the data out for some reason) or moving the data from the >>>>> temp directory to the final location. The temp data is usually deleted at >>>>> the end of a Crunch run, but you can disable this by a) not calling >>>>> Pipeline.cleanup or Pipeline.done at the end of the run and b) subclassing >>>>> SparkPipeline with dummy code that overrides the finalize() method (which >>>>> is implemented in the top-level DistributedPipeline abstract base class) >>>>> to >>>>> be a no-op. Is that easy to try out to see if we can isolate the source of >>>>> the error? Otherwise I can play with this a bit tomorrow on my own >>>>> cluster. >>>>> >>>>> J >>>>> >>>>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <dpo5...@gmail.com> >>>>> wrote: >>>>> >>>>>> Awesome. Thanks for taking a look! >>>>>> >>>>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills <josh.wi...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> hrm, that sounds like something is wrong with the commit operation >>>>>>> on the Spark side; let me take a look at it this evening! >>>>>>> >>>>>>> J >>>>>>> >>>>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <dpo5...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> Are there any known issues with the >>>>>>>> AvroParquetPathPerKeyTarget when running a Spark pipeline? When I run >>>>>>>> my >>>>>>>> pipeline with mapreduce, I get output, and when I run with spark, the >>>>>>>> step >>>>>>>> before where I list my partition keys out (because we use them to add >>>>>>>> partitions to hive) lists data being present, but the output directory >>>>>>>> remains empty. This behavior is occurring targeting both HDFS and S3 >>>>>>>> directly. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Dave >>>>>>>> >>>>>>> >>>>>>> >>>>> >>