Hrm, got it-- now at least I know where to look (although surprised that overriding the finalize() didn't fix it, as I ran into similar problems with my own cluster and created a SlackPipeline class that overrides that method.)
J On Thu, May 17, 2018 at 12:22 PM, David Ortiz <[email protected]> wrote: > Josh, > > Those adjustments did not appear to do anything to stop the tmp > directory from being removed at the end of the job execution (override > finalize with an empty block when creating SparkPipeline and run using > pipeline.run() instead of done()). However, I can confirm that I see the > stage output for the two output directories complete with parquet files > partitioned by key. However, neither they, nor anything else ever make it > to the output directory, which is not even created. > > Thanks, > Dave > > On Fri, May 11, 2018 at 8:24 AM David Ortiz <[email protected]> wrote: > >> Hey Josh, >> >> Thanks for taking a look. I can definitely play with that on Monday >> when I'm back at work. >> >> Thanks, >> Dave >> >> On Fri, May 11, 2018 at 1:46 AM Josh Wills <[email protected]> wrote: >> >>> Hey David, >>> >>> Looking at the code, the problem isn't obvious to me, but there are only >>> two places things could be going wrong: writing the data out of Spark into >>> the temp directory where intermediate outputs get stored (i.e., Spark isn't >>> writing the data out for some reason) or moving the data from the temp >>> directory to the final location. The temp data is usually deleted at the >>> end of a Crunch run, but you can disable this by a) not calling >>> Pipeline.cleanup or Pipeline.done at the end of the run and b) subclassing >>> SparkPipeline with dummy code that overrides the finalize() method (which >>> is implemented in the top-level DistributedPipeline abstract base class) to >>> be a no-op. Is that easy to try out to see if we can isolate the source of >>> the error? Otherwise I can play with this a bit tomorrow on my own cluster. >>> >>> J >>> >>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <[email protected]> wrote: >>> >>>> Awesome. Thanks for taking a look! >>>> >>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills <[email protected]> >>>> wrote: >>>> >>>>> hrm, that sounds like something is wrong with the commit operation on >>>>> the Spark side; let me take a look at it this evening! >>>>> >>>>> J >>>>> >>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <[email protected]> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> Are there any known issues with the AvroParquetPathPerKeyTarget >>>>>> when running a Spark pipeline? When I run my pipeline with mapreduce, I >>>>>> get output, and when I run with spark, the step before where I list my >>>>>> partition keys out (because we use them to add partitions to hive) lists >>>>>> data being present, but the output directory remains empty. This >>>>>> behavior >>>>>> is occurring targeting both HDFS and S3 directly. >>>>>> >>>>>> Thanks, >>>>>> Dave >>>>>> >>>>> >>>>> >>>
