Hey Josh, I am still messing around with it a little bit, but I still seem to be getting the same behavior even after rebuilding with the patch.
Thanks, Dave On Thu, May 24, 2018 at 1:50 AM Josh Wills <josh.wi...@gmail.com> wrote: > David, > > Take a look at CRUNCH-670; I think that patch fixes the problem in the > most minimal way I can think of. > > https://issues.apache.org/jira/browse/CRUNCH-670 > > J > > On Wed, May 23, 2018 at 3:54 PM, Josh Wills <josh.wi...@gmail.com> wrote: > >> I think that must be it Dave, but I can't for the life of me figure out >> where in the code that's happening. Will take another look tonight. >> >> J >> >> On Wed, May 23, 2018 at 7:00 AM, David Ortiz <dpo5...@gmail.com> wrote: >> >>> Josh, >>> >>> Is there any chance that somehow the output path to >>> AvroParquetPathPerKey is getting twisted up when it goes through the >>> compilation step? Watching it while it runs, the output in the >>> /tmp/crunch/p<stage> directory basically looks like what I would expect it >>> to do in the output directory. It seems that AvroPathPerKeyTarget also was >>> showing similar behavior when I was messing around to see if that would >>> work. >>> >>> Thanks, >>> Dave >>> >>> On Thu, May 17, 2018 at 6:03 PM Josh Wills <josh.wi...@gmail.com> wrote: >>> >>>> Hrm, got it-- now at least I know where to look (although surprised >>>> that overriding the finalize() didn't fix it, as I ran into similar >>>> problems with my own cluster and created a SlackPipeline class that >>>> overrides that method.) >>>> >>>> >>>> J >>>> >>>> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <dpo5...@gmail.com> >>>> wrote: >>>> >>>>> Josh, >>>>> >>>>> Those adjustments did not appear to do anything to stop the tmp >>>>> directory from being removed at the end of the job execution (override >>>>> finalize with an empty block when creating SparkPipeline and run using >>>>> pipeline.run() instead of done()). However, I can confirm that I see the >>>>> stage output for the two output directories complete with parquet files >>>>> partitioned by key. However, neither they, nor anything else ever make it >>>>> to the output directory, which is not even created. >>>>> >>>>> Thanks, >>>>> Dave >>>>> >>>>> On Fri, May 11, 2018 at 8:24 AM David Ortiz <dpo5...@gmail.com> wrote: >>>>> >>>>>> Hey Josh, >>>>>> >>>>>> Thanks for taking a look. I can definitely play with that on >>>>>> Monday when I'm back at work. >>>>>> >>>>>> Thanks, >>>>>> Dave >>>>>> >>>>>> On Fri, May 11, 2018 at 1:46 AM Josh Wills <josh.wi...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hey David, >>>>>>> >>>>>>> Looking at the code, the problem isn't obvious to me, but there are >>>>>>> only two places things could be going wrong: writing the data out of >>>>>>> Spark >>>>>>> into the temp directory where intermediate outputs get stored (i.e., >>>>>>> Spark >>>>>>> isn't writing the data out for some reason) or moving the data from the >>>>>>> temp directory to the final location. The temp data is usually deleted >>>>>>> at >>>>>>> the end of a Crunch run, but you can disable this by a) not calling >>>>>>> Pipeline.cleanup or Pipeline.done at the end of the run and b) >>>>>>> subclassing >>>>>>> SparkPipeline with dummy code that overrides the finalize() method >>>>>>> (which >>>>>>> is implemented in the top-level DistributedPipeline abstract base >>>>>>> class) to >>>>>>> be a no-op. Is that easy to try out to see if we can isolate the source >>>>>>> of >>>>>>> the error? Otherwise I can play with this a bit tomorrow on my own >>>>>>> cluster. >>>>>>> >>>>>>> J >>>>>>> >>>>>>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <dpo5...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Awesome. Thanks for taking a look! >>>>>>>> >>>>>>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills <josh.wi...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> hrm, that sounds like something is wrong with the commit operation >>>>>>>>> on the Spark side; let me take a look at it this evening! >>>>>>>>> >>>>>>>>> J >>>>>>>>> >>>>>>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <dpo5...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> Are there any known issues with the >>>>>>>>>> AvroParquetPathPerKeyTarget when running a Spark pipeline? When I >>>>>>>>>> run my >>>>>>>>>> pipeline with mapreduce, I get output, and when I run with spark, >>>>>>>>>> the step >>>>>>>>>> before where I list my partition keys out (because we use them to add >>>>>>>>>> partitions to hive) lists data being present, but the output >>>>>>>>>> directory >>>>>>>>>> remains empty. This behavior is occurring targeting both HDFS and S3 >>>>>>>>>> directly. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Dave >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>> >> >