Re: AvroParquetPathPerKeyTarget with Spark

David Ortiz Fri, 25 May 2018 13:34:34 -0700

Josh,

     When I dug into the code a little more, I saw that both
AvroPathPerKeyOutputFormat and AvroParquetPathPerKeyOutputFormat use "part"
as a default when creating the basePath when there is not a value for
"mapreduce.output.basename".  My guess is that when running via a
SparkPipeline that value is not set.  I changed my local copy to use out0
as the defaultValue instead of part, and the job was able to write output
successfully.


Thanks,
    Dave

On Fri, May 25, 2018 at 2:20 PM David Ortiz <[email protected]> wrote:

> Josh,
>
>      After cleaning up the logs a little bit I noticed this.
>
> 18/05/25 18:10:39 WARN AvroPathPerKeyTarget: Nothing to copy from
> /tmp/crunch-1037479188/p12/out0
> 18/05/25 18:11:38 WARN AvroPathPerKeyTarget: Nothing to copy from
> /tmp/crunch-1037479188/p13/out0
>
> When I look in those tmp directories while the job runs, they are actually
> writing out to the subdirectory part rather than out0, so that would be
> another reason why it's having issues.  Any thoughts on where that output
> path is coming from?  If you point me in the right direction I can try to
> figure it out.
>
> Thanks,
>      Dave
>
> On Fri, May 25, 2018 at 2:01 PM David Ortiz <[email protected]> wrote:
>
>> Hey Josh,
>>
>>      I am still messing around with it a little bit, but I still seem to
>> be getting the same behavior even after rebuilding with the patch.
>>
>> Thanks,
>>      Dave
>>
>> On Thu, May 24, 2018 at 1:50 AM Josh Wills <[email protected]> wrote:
>>
>>> David,
>>>
>>> Take a look at CRUNCH-670; I think that patch fixes the problem in the
>>> most minimal way I can think of.
>>>
>>> https://issues.apache.org/jira/browse/CRUNCH-670
>>>
>>> J
>>>
>>> On Wed, May 23, 2018 at 3:54 PM, Josh Wills <[email protected]>
>>> wrote:
>>>
>>>> I think that must be it Dave, but I can't for the life of me figure out
>>>> where in the code that's happening. Will take another look tonight.
>>>>
>>>> J
>>>>
>>>> On Wed, May 23, 2018 at 7:00 AM, David Ortiz <[email protected]> wrote:
>>>>
>>>>> Josh,
>>>>>
>>>>>      Is there any chance that somehow the output path to
>>>>> AvroParquetPathPerKey is getting twisted up when it goes through the
>>>>> compilation step?  Watching it while it runs, the output in the
>>>>> /tmp/crunch/p<stage> directory basically looks like what I would expect it
>>>>> to do in the output directory.  It seems that AvroPathPerKeyTarget also 
>>>>> was
>>>>> showing similar behavior when I was messing around to see if that would
>>>>> work.
>>>>>
>>>>> Thanks,
>>>>>      Dave
>>>>>
>>>>> On Thu, May 17, 2018 at 6:03 PM Josh Wills <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hrm, got it-- now at least I know where to look (although surprised
>>>>>> that overriding the finalize() didn't fix it, as I ran into similar
>>>>>> problems with my own cluster and created a SlackPipeline class that
>>>>>> overrides that method.)
>>>>>>
>>>>>>
>>>>>> J
>>>>>>
>>>>>> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Josh,
>>>>>>>
>>>>>>>      Those adjustments did not appear to do anything to stop the tmp
>>>>>>> directory from being removed at the end of the job execution (override
>>>>>>> finalize with an empty block when creating SparkPipeline and run using
>>>>>>> pipeline.run() instead of done()).  However, I can confirm that I see 
>>>>>>> the
>>>>>>> stage output for the two output directories complete with parquet files
>>>>>>> partitioned by key.  However, neither they, nor anything else ever make 
>>>>>>> it
>>>>>>> to the output directory, which is not even created.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>      Dave
>>>>>>>
>>>>>>> On Fri, May 11, 2018 at 8:24 AM David Ortiz <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Josh,
>>>>>>>>
>>>>>>>>      Thanks for taking a look.  I can definitely play with that on
>>>>>>>> Monday when I'm back at work.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>      Dave
>>>>>>>>
>>>>>>>> On Fri, May 11, 2018 at 1:46 AM Josh Wills <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey David,
>>>>>>>>>
>>>>>>>>> Looking at the code, the problem isn't obvious to me, but there
>>>>>>>>> are only two places things could be going wrong: writing the data out 
>>>>>>>>> of
>>>>>>>>> Spark into the temp directory where intermediate outputs get stored 
>>>>>>>>> (i.e.,
>>>>>>>>> Spark isn't writing the data out for some reason) or moving the data 
>>>>>>>>> from
>>>>>>>>> the temp directory to the final location. The temp data is usually 
>>>>>>>>> deleted
>>>>>>>>> at the end of a Crunch run, but you can disable this by a) not calling
>>>>>>>>> Pipeline.cleanup or Pipeline.done at the end of the run and b) 
>>>>>>>>> subclassing
>>>>>>>>> SparkPipeline with dummy code that overrides the finalize() method 
>>>>>>>>> (which
>>>>>>>>> is implemented in the top-level DistributedPipeline abstract base 
>>>>>>>>> class) to
>>>>>>>>> be a no-op. Is that easy to try out to see if we can isolate the 
>>>>>>>>> source of
>>>>>>>>> the error? Otherwise I can play with this a bit tomorrow on my own 
>>>>>>>>> cluster.
>>>>>>>>>
>>>>>>>>> J
>>>>>>>>>
>>>>>>>>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Awesome.  Thanks for taking a look!
>>>>>>>>>>
>>>>>>>>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> hrm, that sounds like something is wrong with the commit
>>>>>>>>>>> operation on the Spark side; let me take a look at it this evening!
>>>>>>>>>>>
>>>>>>>>>>> J
>>>>>>>>>>>
>>>>>>>>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>>      Are there any known issues with the
>>>>>>>>>>>> AvroParquetPathPerKeyTarget when running a Spark pipeline?  When I 
>>>>>>>>>>>> run my
>>>>>>>>>>>> pipeline with mapreduce, I get output, and when I run with spark, 
>>>>>>>>>>>> the step
>>>>>>>>>>>> before where I list my partition keys out (because we use them to 
>>>>>>>>>>>> add
>>>>>>>>>>>> partitions to hive) lists data being present, but the output 
>>>>>>>>>>>> directory
>>>>>>>>>>>> remains empty.  This behavior is occurring targeting both HDFS and 
>>>>>>>>>>>> S3
>>>>>>>>>>>> directly.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>      Dave
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>
>>>

Re: AvroParquetPathPerKeyTarget with Spark

Reply via email to