Ah, of course-- nice detective work! Can you send me the code so I can patch it in?
On Fri, May 25, 2018 at 1:33 PM, David Ortiz <dpo5...@gmail.com> wrote: > Josh, > > When I dug into the code a little more, I saw that both > AvroPathPerKeyOutputFormat and AvroParquetPathPerKeyOutputFormat use > "part" as a default when creating the basePath when there is not a value > for "mapreduce.output.basename". My guess is that when running via a > SparkPipeline that value is not set. I changed my local copy to use out0 > as the defaultValue instead of part, and the job was able to write output > successfully. > > Thanks, > Dave > > On Fri, May 25, 2018 at 2:20 PM David Ortiz <dpo5...@gmail.com> wrote: > >> Josh, >> >> After cleaning up the logs a little bit I noticed this. >> >> 18/05/25 18:10:39 WARN AvroPathPerKeyTarget: Nothing to copy from >> /tmp/crunch-1037479188/p12/out0 >> 18/05/25 18:11:38 WARN AvroPathPerKeyTarget: Nothing to copy from >> /tmp/crunch-1037479188/p13/out0 >> >> When I look in those tmp directories while the job runs, they are >> actually writing out to the subdirectory part rather than out0, so that >> would be another reason why it's having issues. Any thoughts on where that >> output path is coming from? If you point me in the right direction I can >> try to figure it out. >> >> Thanks, >> Dave >> >> On Fri, May 25, 2018 at 2:01 PM David Ortiz <dpo5...@gmail.com> wrote: >> >>> Hey Josh, >>> >>> I am still messing around with it a little bit, but I still seem to >>> be getting the same behavior even after rebuilding with the patch. >>> >>> Thanks, >>> Dave >>> >>> On Thu, May 24, 2018 at 1:50 AM Josh Wills <josh.wi...@gmail.com> wrote: >>> >>>> David, >>>> >>>> Take a look at CRUNCH-670; I think that patch fixes the problem in the >>>> most minimal way I can think of. >>>> >>>> https://issues.apache.org/jira/browse/CRUNCH-670 >>>> >>>> J >>>> >>>> On Wed, May 23, 2018 at 3:54 PM, Josh Wills <josh.wi...@gmail.com> >>>> wrote: >>>> >>>>> I think that must be it Dave, but I can't for the life of me figure >>>>> out where in the code that's happening. Will take another look tonight. >>>>> >>>>> J >>>>> >>>>> On Wed, May 23, 2018 at 7:00 AM, David Ortiz <dpo5...@gmail.com> >>>>> wrote: >>>>> >>>>>> Josh, >>>>>> >>>>>> Is there any chance that somehow the output path to >>>>>> AvroParquetPathPerKey is getting twisted up when it goes through the >>>>>> compilation step? Watching it while it runs, the output in the >>>>>> /tmp/crunch/p<stage> directory basically looks like what I would expect >>>>>> it >>>>>> to do in the output directory. It seems that AvroPathPerKeyTarget also >>>>>> was >>>>>> showing similar behavior when I was messing around to see if that would >>>>>> work. >>>>>> >>>>>> Thanks, >>>>>> Dave >>>>>> >>>>>> On Thu, May 17, 2018 at 6:03 PM Josh Wills <josh.wi...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hrm, got it-- now at least I know where to look (although surprised >>>>>>> that overriding the finalize() didn't fix it, as I ran into similar >>>>>>> problems with my own cluster and created a SlackPipeline class that >>>>>>> overrides that method.) >>>>>>> >>>>>>> >>>>>>> J >>>>>>> >>>>>>> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <dpo5...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Josh, >>>>>>>> >>>>>>>> Those adjustments did not appear to do anything to stop the >>>>>>>> tmp directory from being removed at the end of the job execution >>>>>>>> (override >>>>>>>> finalize with an empty block when creating SparkPipeline and run using >>>>>>>> pipeline.run() instead of done()). However, I can confirm that I see >>>>>>>> the >>>>>>>> stage output for the two output directories complete with parquet files >>>>>>>> partitioned by key. However, neither they, nor anything else ever >>>>>>>> make it >>>>>>>> to the output directory, which is not even created. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Dave >>>>>>>> >>>>>>>> On Fri, May 11, 2018 at 8:24 AM David Ortiz <dpo5...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hey Josh, >>>>>>>>> >>>>>>>>> Thanks for taking a look. I can definitely play with that on >>>>>>>>> Monday when I'm back at work. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Dave >>>>>>>>> >>>>>>>>> On Fri, May 11, 2018 at 1:46 AM Josh Wills <josh.wi...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hey David, >>>>>>>>>> >>>>>>>>>> Looking at the code, the problem isn't obvious to me, but there >>>>>>>>>> are only two places things could be going wrong: writing the data >>>>>>>>>> out of >>>>>>>>>> Spark into the temp directory where intermediate outputs get stored >>>>>>>>>> (i.e., >>>>>>>>>> Spark isn't writing the data out for some reason) or moving the data >>>>>>>>>> from >>>>>>>>>> the temp directory to the final location. The temp data is usually >>>>>>>>>> deleted >>>>>>>>>> at the end of a Crunch run, but you can disable this by a) not >>>>>>>>>> calling >>>>>>>>>> Pipeline.cleanup or Pipeline.done at the end of the run and b) >>>>>>>>>> subclassing >>>>>>>>>> SparkPipeline with dummy code that overrides the finalize() method >>>>>>>>>> (which >>>>>>>>>> is implemented in the top-level DistributedPipeline abstract base >>>>>>>>>> class) to >>>>>>>>>> be a no-op. Is that easy to try out to see if we can isolate the >>>>>>>>>> source of >>>>>>>>>> the error? Otherwise I can play with this a bit tomorrow on my own >>>>>>>>>> cluster. >>>>>>>>>> >>>>>>>>>> J >>>>>>>>>> >>>>>>>>>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <dpo5...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Awesome. Thanks for taking a look! >>>>>>>>>>> >>>>>>>>>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills <josh.wi...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> hrm, that sounds like something is wrong with the commit >>>>>>>>>>>> operation on the Spark side; let me take a look at it this evening! >>>>>>>>>>>> >>>>>>>>>>>> J >>>>>>>>>>>> >>>>>>>>>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <dpo5...@gmail.com >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello, >>>>>>>>>>>>> >>>>>>>>>>>>> Are there any known issues with the >>>>>>>>>>>>> AvroParquetPathPerKeyTarget when running a Spark pipeline? When >>>>>>>>>>>>> I run my >>>>>>>>>>>>> pipeline with mapreduce, I get output, and when I run with spark, >>>>>>>>>>>>> the step >>>>>>>>>>>>> before where I list my partition keys out (because we use them to >>>>>>>>>>>>> add >>>>>>>>>>>>> partitions to hive) lists data being present, but the output >>>>>>>>>>>>> directory >>>>>>>>>>>>> remains empty. This behavior is occurring targeting both HDFS >>>>>>>>>>>>> and S3 >>>>>>>>>>>>> directly. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Dave >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>> >>>>