Re: AvroParquetPathPerKeyTarget with Spark

David Ortiz Mon, 23 Sep 2019 10:40:58 -0700

Josh,

     To circle back to this after forever, I finally was able to get permission 
to hand this off.  I attached it to CRUNCH-670.


Thanks,
     Dave

> On May 25, 2018, at 7:21 PM, David Ortiz <[email protected]> wrote:
> 
> The answer seems to be no from my employer on uploading a patch. Double 
> checking on that though. 
> 
> On Fri, May 25, 2018, 4:40 PM Josh Wills <[email protected] 
> <mailto:[email protected]>> wrote:
> CRUNCH-670 is the issue, FWIW
> 
> On Fri, May 25, 2018 at 1:39 PM, Josh Wills <[email protected] 
> <mailto:[email protected]>> wrote:
> Ah, of course-- nice detective work! Can you send me the code so I can patch 
> it in?
> 
> On Fri, May 25, 2018 at 1:33 PM, David Ortiz <[email protected] 
> <mailto:[email protected]>> wrote:
> Josh,
> 
>      When I dug into the code a little more, I saw that both 
> AvroPathPerKeyOutputFormat and AvroParquetPathPerKeyOutputFormat use "part" 
> as a default when creating the basePath when there is not a value for 
> "mapreduce.output.basename".  My guess is that when running via a 
> SparkPipeline that value is not set.  I changed my local copy to use out0 as 
> the defaultValue instead of part, and the job was able to write output 
> successfully.
> 
> Thanks,
>     Dave
> 
> On Fri, May 25, 2018 at 2:20 PM David Ortiz <[email protected] 
> <mailto:[email protected]>> wrote:
> Josh,
> 
>      After cleaning up the logs a little bit I noticed this.
> 
> 18/05/25 18:10:39 WARN AvroPathPerKeyTarget: Nothing to copy from 
> /tmp/crunch-1037479188/p12/out0
> 18/05/25 18:11:38 WARN AvroPathPerKeyTarget: Nothing to copy from 
> /tmp/crunch-1037479188/p13/out0
> 
> When I look in those tmp directories while the job runs, they are actually 
> writing out to the subdirectory part rather than out0, so that would be 
> another reason why it's having issues.  Any thoughts on where that output 
> path is coming from?  If you point me in the right direction I can try to 
> figure it out.
> 
> Thanks,
>      Dave
> 
> On Fri, May 25, 2018 at 2:01 PM David Ortiz <[email protected] 
> <mailto:[email protected]>> wrote:
> Hey Josh,
> 
>      I am still messing around with it a little bit, but I still seem to be 
> getting the same behavior even after rebuilding with the patch.
> 
> Thanks,
>      Dave
> 
> On Thu, May 24, 2018 at 1:50 AM Josh Wills <[email protected] 
> <mailto:[email protected]>> wrote:
> David,
> 
> Take a look at CRUNCH-670; I think that patch fixes the problem in the most 
> minimal way I can think of.
> 
> https://issues.apache.org/jira/browse/CRUNCH-670 
> <https://issues.apache.org/jira/browse/CRUNCH-670>
> 
> J
> 
> On Wed, May 23, 2018 at 3:54 PM, Josh Wills <[email protected] 
> <mailto:[email protected]>> wrote:
> I think that must be it Dave, but I can't for the life of me figure out where 
> in the code that's happening. Will take another look tonight.
> 
> J
> 
> On Wed, May 23, 2018 at 7:00 AM, David Ortiz <[email protected] 
> <mailto:[email protected]>> wrote:
> Josh,
> 
>      Is there any chance that somehow the output path to 
> AvroParquetPathPerKey is getting twisted up when it goes through the 
> compilation step?  Watching it while it runs, the output in the 
> /tmp/crunch/p<stage> directory basically looks like what I would expect it to 
> do in the output directory.  It seems that AvroPathPerKeyTarget also was 
> showing similar behavior when I was messing around to see if that would work.
> 
> Thanks,
>      Dave
> 
> On Thu, May 17, 2018 at 6:03 PM Josh Wills <[email protected] 
> <mailto:[email protected]>> wrote:
> Hrm, got it-- now at least I know where to look (although surprised that 
> overriding the finalize() didn't fix it, as I ran into similar problems with 
> my own cluster and created a SlackPipeline class that overrides that method.)
> 
> 
> J
> 
> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <[email protected] 
> <mailto:[email protected]>> wrote:
> Josh,
> 
>      Those adjustments did not appear to do anything to stop the tmp 
> directory from being removed at the end of the job execution (override 
> finalize with an empty block when creating SparkPipeline and run using 
> pipeline.run() instead of done()).  However, I can confirm that I see the 
> stage output for the two output directories complete with parquet files 
> partitioned by key.  However, neither they, nor anything else ever make it to 
> the output directory, which is not even created.
> 
> Thanks,
>      Dave
> 
> On Fri, May 11, 2018 at 8:24 AM David Ortiz <[email protected] 
> <mailto:[email protected]>> wrote:
> Hey Josh,
> 
>      Thanks for taking a look.  I can definitely play with that on Monday 
> when I'm back at work.
> 
> Thanks,
>      Dave
> 
> On Fri, May 11, 2018 at 1:46 AM Josh Wills <[email protected] 
> <mailto:[email protected]>> wrote:
> Hey David,
> 
> Looking at the code, the problem isn't obvious to me, but there are only two 
> places things could be going wrong: writing the data out of Spark into the 
> temp directory where intermediate outputs get stored (i.e., Spark isn't 
> writing the data out for some reason) or moving the data from the temp 
> directory to the final location. The temp data is usually deleted at the end 
> of a Crunch run, but you can disable this by a) not calling Pipeline.cleanup 
> or Pipeline.done at the end of the run and b) subclassing SparkPipeline with 
> dummy code that overrides the finalize() method (which is implemented in the 
> top-level DistributedPipeline abstract base class) to be a no-op. Is that 
> easy to try out to see if we can isolate the source of the error? Otherwise I 
> can play with this a bit tomorrow on my own cluster.
> 
> J
> 
> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <[email protected] 
> <mailto:[email protected]>> wrote:
> Awesome.  Thanks for taking a look!
> 
> On Thu, May 10, 2018 at 5:18 PM Josh Wills <[email protected] 
> <mailto:[email protected]>> wrote:
> hrm, that sounds like something is wrong with the commit operation on the 
> Spark side; let me take a look at it this evening!
> 
> J
> 
> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <[email protected] 
> <mailto:[email protected]>> wrote:
> Hello,
> 
>      Are there any known issues with the AvroParquetPathPerKeyTarget when 
> running a Spark pipeline?  When I run my pipeline with mapreduce, I get 
> output, and when I run with spark, the step before where I list my partition 
> keys out (because we use them to add partitions to hive) lists data being 
> present, but the output directory remains empty.  This behavior is occurring 
> targeting both HDFS and S3 directly.
> 
> Thanks,
>      Dave
> 
> 
> 
> 
> 
> 
>

Re: AvroParquetPathPerKeyTarget with Spark

Reply via email to