Yay!! Thanks Dave! On Mon, Sep 23, 2019 at 10:40 AM David Ortiz <[email protected]> wrote:
> Josh, > > To circle back to this after forever, I finally was able to get > permission to hand this off. I attached it to CRUNCH-670. > > Thanks, > Dave > > On May 25, 2018, at 7:21 PM, David Ortiz <[email protected]> wrote: > > The answer seems to be no from my employer on uploading a patch. Double > checking on that though. > > On Fri, May 25, 2018, 4:40 PM Josh Wills <[email protected]> wrote: > >> CRUNCH-670 is the issue, FWIW >> >> On Fri, May 25, 2018 at 1:39 PM, Josh Wills <[email protected]> wrote: >> >>> Ah, of course-- nice detective work! Can you send me the code so I can >>> patch it in? >>> >>> On Fri, May 25, 2018 at 1:33 PM, David Ortiz <[email protected]> wrote: >>> >>>> Josh, >>>> >>>> When I dug into the code a little more, I saw that both >>>> AvroPathPerKeyOutputFormat and AvroParquetPathPerKeyOutputFormat use "part" >>>> as a default when creating the basePath when there is not a value for >>>> "mapreduce.output.basename". My guess is that when running via a >>>> SparkPipeline that value is not set. I changed my local copy to use out0 >>>> as the defaultValue instead of part, and the job was able to write output >>>> successfully. >>>> >>>> Thanks, >>>> Dave >>>> >>>> On Fri, May 25, 2018 at 2:20 PM David Ortiz <[email protected]> wrote: >>>> >>>>> Josh, >>>>> >>>>> After cleaning up the logs a little bit I noticed this. >>>>> >>>>> 18/05/25 18:10:39 WARN AvroPathPerKeyTarget: Nothing to copy from >>>>> /tmp/crunch-1037479188/p12/out0 >>>>> 18/05/25 18:11:38 WARN AvroPathPerKeyTarget: Nothing to copy from >>>>> /tmp/crunch-1037479188/p13/out0 >>>>> >>>>> When I look in those tmp directories while the job runs, they are >>>>> actually writing out to the subdirectory part rather than out0, so that >>>>> would be another reason why it's having issues. Any thoughts on where >>>>> that >>>>> output path is coming from? If you point me in the right direction I can >>>>> try to figure it out. >>>>> >>>>> Thanks, >>>>> Dave >>>>> >>>>> On Fri, May 25, 2018 at 2:01 PM David Ortiz <[email protected]> wrote: >>>>> >>>>>> Hey Josh, >>>>>> >>>>>> I am still messing around with it a little bit, but I still seem >>>>>> to be getting the same behavior even after rebuilding with the patch. >>>>>> >>>>>> Thanks, >>>>>> Dave >>>>>> >>>>>> On Thu, May 24, 2018 at 1:50 AM Josh Wills <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> David, >>>>>>> >>>>>>> Take a look at CRUNCH-670; I think that patch fixes the problem in >>>>>>> the most minimal way I can think of. >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/CRUNCH-670 >>>>>>> >>>>>>> J >>>>>>> >>>>>>> On Wed, May 23, 2018 at 3:54 PM, Josh Wills <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I think that must be it Dave, but I can't for the life of me figure >>>>>>>> out where in the code that's happening. Will take another look tonight. >>>>>>>> >>>>>>>> J >>>>>>>> >>>>>>>> On Wed, May 23, 2018 at 7:00 AM, David Ortiz <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Josh, >>>>>>>>> >>>>>>>>> Is there any chance that somehow the output path to >>>>>>>>> AvroParquetPathPerKey is getting twisted up when it goes through the >>>>>>>>> compilation step? Watching it while it runs, the output in the >>>>>>>>> /tmp/crunch/p<stage> directory basically looks like what I would >>>>>>>>> expect it >>>>>>>>> to do in the output directory. It seems that AvroPathPerKeyTarget >>>>>>>>> also was >>>>>>>>> showing similar behavior when I was messing around to see if that >>>>>>>>> would >>>>>>>>> work. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Dave >>>>>>>>> >>>>>>>>> On Thu, May 17, 2018 at 6:03 PM Josh Wills <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hrm, got it-- now at least I know where to look (although >>>>>>>>>> surprised that overriding the finalize() didn't fix it, as I ran into >>>>>>>>>> similar problems with my own cluster and created a SlackPipeline >>>>>>>>>> class that >>>>>>>>>> overrides that method.) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> J >>>>>>>>>> >>>>>>>>>> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Josh, >>>>>>>>>>> >>>>>>>>>>> Those adjustments did not appear to do anything to stop the >>>>>>>>>>> tmp directory from being removed at the end of the job execution >>>>>>>>>>> (override >>>>>>>>>>> finalize with an empty block when creating SparkPipeline and run >>>>>>>>>>> using >>>>>>>>>>> pipeline.run() instead of done()). However, I can confirm that I >>>>>>>>>>> see the >>>>>>>>>>> stage output for the two output directories complete with parquet >>>>>>>>>>> files >>>>>>>>>>> partitioned by key. However, neither they, nor anything else ever >>>>>>>>>>> make it >>>>>>>>>>> to the output directory, which is not even created. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Dave >>>>>>>>>>> >>>>>>>>>>> On Fri, May 11, 2018 at 8:24 AM David Ortiz <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hey Josh, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for taking a look. I can definitely play with that >>>>>>>>>>>> on Monday when I'm back at work. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Dave >>>>>>>>>>>> >>>>>>>>>>>> On Fri, May 11, 2018 at 1:46 AM Josh Wills < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hey David, >>>>>>>>>>>>> >>>>>>>>>>>>> Looking at the code, the problem isn't obvious to me, but >>>>>>>>>>>>> there are only two places things could be going wrong: writing >>>>>>>>>>>>> the data out >>>>>>>>>>>>> of Spark into the temp directory where intermediate outputs get >>>>>>>>>>>>> stored >>>>>>>>>>>>> (i.e., Spark isn't writing the data out for some reason) or >>>>>>>>>>>>> moving the data >>>>>>>>>>>>> from the temp directory to the final location. The temp data is >>>>>>>>>>>>> usually >>>>>>>>>>>>> deleted at the end of a Crunch run, but you can disable this by >>>>>>>>>>>>> a) not >>>>>>>>>>>>> calling Pipeline.cleanup or Pipeline.done at the end of the run >>>>>>>>>>>>> and b) >>>>>>>>>>>>> subclassing SparkPipeline with dummy code that overrides the >>>>>>>>>>>>> finalize() >>>>>>>>>>>>> method (which is implemented in the top-level DistributedPipeline >>>>>>>>>>>>> abstract >>>>>>>>>>>>> base class) to be a no-op. Is that easy to try out to see if we >>>>>>>>>>>>> can isolate >>>>>>>>>>>>> the source of the error? Otherwise I can play with this a bit >>>>>>>>>>>>> tomorrow on >>>>>>>>>>>>> my own cluster. >>>>>>>>>>>>> >>>>>>>>>>>>> J >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Awesome. Thanks for taking a look! >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> hrm, that sounds like something is wrong with the commit >>>>>>>>>>>>>>> operation on the Spark side; let me take a look at it this >>>>>>>>>>>>>>> evening! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> J >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Are there any known issues with the >>>>>>>>>>>>>>>> AvroParquetPathPerKeyTarget when running a Spark pipeline? >>>>>>>>>>>>>>>> When I run my >>>>>>>>>>>>>>>> pipeline with mapreduce, I get output, and when I run with >>>>>>>>>>>>>>>> spark, the step >>>>>>>>>>>>>>>> before where I list my partition keys out (because we use them >>>>>>>>>>>>>>>> to add >>>>>>>>>>>>>>>> partitions to hive) lists data being present, but the output >>>>>>>>>>>>>>>> directory >>>>>>>>>>>>>>>> remains empty. This behavior is occurring targeting both HDFS >>>>>>>>>>>>>>>> and S3 >>>>>>>>>>>>>>>> directly. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Dave >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>> >> >
