The answer seems to be no from my employer on uploading a patch. Double checking on that though.
On Fri, May 25, 2018, 4:40 PM Josh Wills <[email protected]> wrote: > CRUNCH-670 is the issue, FWIW > > On Fri, May 25, 2018 at 1:39 PM, Josh Wills <[email protected]> wrote: > >> Ah, of course-- nice detective work! Can you send me the code so I can >> patch it in? >> >> On Fri, May 25, 2018 at 1:33 PM, David Ortiz <[email protected]> wrote: >> >>> Josh, >>> >>> When I dug into the code a little more, I saw that both >>> AvroPathPerKeyOutputFormat and AvroParquetPathPerKeyOutputFormat use "part" >>> as a default when creating the basePath when there is not a value for >>> "mapreduce.output.basename". My guess is that when running via a >>> SparkPipeline that value is not set. I changed my local copy to use out0 >>> as the defaultValue instead of part, and the job was able to write output >>> successfully. >>> >>> Thanks, >>> Dave >>> >>> On Fri, May 25, 2018 at 2:20 PM David Ortiz <[email protected]> wrote: >>> >>>> Josh, >>>> >>>> After cleaning up the logs a little bit I noticed this. >>>> >>>> 18/05/25 18:10:39 WARN AvroPathPerKeyTarget: Nothing to copy from >>>> /tmp/crunch-1037479188/p12/out0 >>>> 18/05/25 18:11:38 WARN AvroPathPerKeyTarget: Nothing to copy from >>>> /tmp/crunch-1037479188/p13/out0 >>>> >>>> When I look in those tmp directories while the job runs, they are >>>> actually writing out to the subdirectory part rather than out0, so that >>>> would be another reason why it's having issues. Any thoughts on where that >>>> output path is coming from? If you point me in the right direction I can >>>> try to figure it out. >>>> >>>> Thanks, >>>> Dave >>>> >>>> On Fri, May 25, 2018 at 2:01 PM David Ortiz <[email protected]> wrote: >>>> >>>>> Hey Josh, >>>>> >>>>> I am still messing around with it a little bit, but I still seem >>>>> to be getting the same behavior even after rebuilding with the patch. >>>>> >>>>> Thanks, >>>>> Dave >>>>> >>>>> On Thu, May 24, 2018 at 1:50 AM Josh Wills <[email protected]> >>>>> wrote: >>>>> >>>>>> David, >>>>>> >>>>>> Take a look at CRUNCH-670; I think that patch fixes the problem in >>>>>> the most minimal way I can think of. >>>>>> >>>>>> https://issues.apache.org/jira/browse/CRUNCH-670 >>>>>> >>>>>> J >>>>>> >>>>>> On Wed, May 23, 2018 at 3:54 PM, Josh Wills <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> I think that must be it Dave, but I can't for the life of me figure >>>>>>> out where in the code that's happening. Will take another look tonight. >>>>>>> >>>>>>> J >>>>>>> >>>>>>> On Wed, May 23, 2018 at 7:00 AM, David Ortiz <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Josh, >>>>>>>> >>>>>>>> Is there any chance that somehow the output path to >>>>>>>> AvroParquetPathPerKey is getting twisted up when it goes through the >>>>>>>> compilation step? Watching it while it runs, the output in the >>>>>>>> /tmp/crunch/p<stage> directory basically looks like what I would >>>>>>>> expect it >>>>>>>> to do in the output directory. It seems that AvroPathPerKeyTarget >>>>>>>> also was >>>>>>>> showing similar behavior when I was messing around to see if that would >>>>>>>> work. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Dave >>>>>>>> >>>>>>>> On Thu, May 17, 2018 at 6:03 PM Josh Wills <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hrm, got it-- now at least I know where to look (although >>>>>>>>> surprised that overriding the finalize() didn't fix it, as I ran into >>>>>>>>> similar problems with my own cluster and created a SlackPipeline >>>>>>>>> class that >>>>>>>>> overrides that method.) >>>>>>>>> >>>>>>>>> >>>>>>>>> J >>>>>>>>> >>>>>>>>> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Josh, >>>>>>>>>> >>>>>>>>>> Those adjustments did not appear to do anything to stop the >>>>>>>>>> tmp directory from being removed at the end of the job execution >>>>>>>>>> (override >>>>>>>>>> finalize with an empty block when creating SparkPipeline and run >>>>>>>>>> using >>>>>>>>>> pipeline.run() instead of done()). However, I can confirm that I >>>>>>>>>> see the >>>>>>>>>> stage output for the two output directories complete with parquet >>>>>>>>>> files >>>>>>>>>> partitioned by key. However, neither they, nor anything else ever >>>>>>>>>> make it >>>>>>>>>> to the output directory, which is not even created. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Dave >>>>>>>>>> >>>>>>>>>> On Fri, May 11, 2018 at 8:24 AM David Ortiz <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hey Josh, >>>>>>>>>>> >>>>>>>>>>> Thanks for taking a look. I can definitely play with that >>>>>>>>>>> on Monday when I'm back at work. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Dave >>>>>>>>>>> >>>>>>>>>>> On Fri, May 11, 2018 at 1:46 AM Josh Wills <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hey David, >>>>>>>>>>>> >>>>>>>>>>>> Looking at the code, the problem isn't obvious to me, but there >>>>>>>>>>>> are only two places things could be going wrong: writing the data >>>>>>>>>>>> out of >>>>>>>>>>>> Spark into the temp directory where intermediate outputs get >>>>>>>>>>>> stored (i.e., >>>>>>>>>>>> Spark isn't writing the data out for some reason) or moving the >>>>>>>>>>>> data from >>>>>>>>>>>> the temp directory to the final location. The temp data is usually >>>>>>>>>>>> deleted >>>>>>>>>>>> at the end of a Crunch run, but you can disable this by a) not >>>>>>>>>>>> calling >>>>>>>>>>>> Pipeline.cleanup or Pipeline.done at the end of the run and b) >>>>>>>>>>>> subclassing >>>>>>>>>>>> SparkPipeline with dummy code that overrides the finalize() method >>>>>>>>>>>> (which >>>>>>>>>>>> is implemented in the top-level DistributedPipeline abstract base >>>>>>>>>>>> class) to >>>>>>>>>>>> be a no-op. Is that easy to try out to see if we can isolate the >>>>>>>>>>>> source of >>>>>>>>>>>> the error? Otherwise I can play with this a bit tomorrow on my own >>>>>>>>>>>> cluster. >>>>>>>>>>>> >>>>>>>>>>>> J >>>>>>>>>>>> >>>>>>>>>>>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <[email protected] >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Awesome. Thanks for taking a look! >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> hrm, that sounds like something is wrong with the commit >>>>>>>>>>>>>> operation on the Spark side; let me take a look at it this >>>>>>>>>>>>>> evening! >>>>>>>>>>>>>> >>>>>>>>>>>>>> J >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Are there any known issues with the >>>>>>>>>>>>>>> AvroParquetPathPerKeyTarget when running a Spark pipeline? >>>>>>>>>>>>>>> When I run my >>>>>>>>>>>>>>> pipeline with mapreduce, I get output, and when I run with >>>>>>>>>>>>>>> spark, the step >>>>>>>>>>>>>>> before where I list my partition keys out (because we use them >>>>>>>>>>>>>>> to add >>>>>>>>>>>>>>> partitions to hive) lists data being present, but the output >>>>>>>>>>>>>>> directory >>>>>>>>>>>>>>> remains empty. This behavior is occurring targeting both HDFS >>>>>>>>>>>>>>> and S3 >>>>>>>>>>>>>>> directly. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Dave >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >> >
