Re: AvroParquetPathPerKeyTarget with Spark

Josh Wills Mon, 23 Sep 2019 10:44:59 -0700

Yay!! Thanks Dave!

On Mon, Sep 23, 2019 at 10:40 AM David Ortiz <[email protected]> wrote:


> Josh,
>
>      To circle back to this after forever, I finally was able to get
> permission to hand this off.  I attached it to CRUNCH-670.
>
> Thanks,
>      Dave
>
> On May 25, 2018, at 7:21 PM, David Ortiz <[email protected]> wrote:
>
> The answer seems to be no from my employer on uploading a patch. Double
> checking on that though.
>
> On Fri, May 25, 2018, 4:40 PM Josh Wills <[email protected]> wrote:
>
>> CRUNCH-670 is the issue, FWIW
>>
>> On Fri, May 25, 2018 at 1:39 PM, Josh Wills <[email protected]> wrote:
>>
>>> Ah, of course-- nice detective work! Can you send me the code so I can
>>> patch it in?
>>>
>>> On Fri, May 25, 2018 at 1:33 PM, David Ortiz <[email protected]> wrote:
>>>
>>>> Josh,
>>>>
>>>>      When I dug into the code a little more, I saw that both
>>>> AvroPathPerKeyOutputFormat and AvroParquetPathPerKeyOutputFormat use "part"
>>>> as a default when creating the basePath when there is not a value for
>>>> "mapreduce.output.basename".  My guess is that when running via a
>>>> SparkPipeline that value is not set.  I changed my local copy to use out0
>>>> as the defaultValue instead of part, and the job was able to write output
>>>> successfully.
>>>>
>>>> Thanks,
>>>>     Dave
>>>>
>>>> On Fri, May 25, 2018 at 2:20 PM David Ortiz <[email protected]> wrote:
>>>>
>>>>> Josh,
>>>>>
>>>>>      After cleaning up the logs a little bit I noticed this.
>>>>>
>>>>> 18/05/25 18:10:39 WARN AvroPathPerKeyTarget: Nothing to copy from
>>>>> /tmp/crunch-1037479188/p12/out0
>>>>> 18/05/25 18:11:38 WARN AvroPathPerKeyTarget: Nothing to copy from
>>>>> /tmp/crunch-1037479188/p13/out0
>>>>>
>>>>> When I look in those tmp directories while the job runs, they are
>>>>> actually writing out to the subdirectory part rather than out0, so that
>>>>> would be another reason why it's having issues.  Any thoughts on where 
>>>>> that
>>>>> output path is coming from?  If you point me in the right direction I can
>>>>> try to figure it out.
>>>>>
>>>>> Thanks,
>>>>>      Dave
>>>>>
>>>>> On Fri, May 25, 2018 at 2:01 PM David Ortiz <[email protected]> wrote:
>>>>>
>>>>>> Hey Josh,
>>>>>>
>>>>>>      I am still messing around with it a little bit, but I still seem
>>>>>> to be getting the same behavior even after rebuilding with the patch.
>>>>>>
>>>>>> Thanks,
>>>>>>      Dave
>>>>>>
>>>>>> On Thu, May 24, 2018 at 1:50 AM Josh Wills <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> David,
>>>>>>>
>>>>>>> Take a look at CRUNCH-670; I think that patch fixes the problem in
>>>>>>> the most minimal way I can think of.
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/CRUNCH-670
>>>>>>>
>>>>>>> J
>>>>>>>
>>>>>>> On Wed, May 23, 2018 at 3:54 PM, Josh Wills <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think that must be it Dave, but I can't for the life of me figure
>>>>>>>> out where in the code that's happening. Will take another look tonight.
>>>>>>>>
>>>>>>>> J
>>>>>>>>
>>>>>>>> On Wed, May 23, 2018 at 7:00 AM, David Ortiz <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Josh,
>>>>>>>>>
>>>>>>>>>      Is there any chance that somehow the output path to
>>>>>>>>> AvroParquetPathPerKey is getting twisted up when it goes through the
>>>>>>>>> compilation step?  Watching it while it runs, the output in the
>>>>>>>>> /tmp/crunch/p<stage> directory basically looks like what I would 
>>>>>>>>> expect it
>>>>>>>>> to do in the output directory.  It seems that AvroPathPerKeyTarget 
>>>>>>>>> also was
>>>>>>>>> showing similar behavior when I was messing around to see if that 
>>>>>>>>> would
>>>>>>>>> work.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>      Dave
>>>>>>>>>
>>>>>>>>> On Thu, May 17, 2018 at 6:03 PM Josh Wills <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hrm, got it-- now at least I know where to look (although
>>>>>>>>>> surprised that overriding the finalize() didn't fix it, as I ran into
>>>>>>>>>> similar problems with my own cluster and created a SlackPipeline 
>>>>>>>>>> class that
>>>>>>>>>> overrides that method.)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> J
>>>>>>>>>>
>>>>>>>>>> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Josh,
>>>>>>>>>>>
>>>>>>>>>>>      Those adjustments did not appear to do anything to stop the
>>>>>>>>>>> tmp directory from being removed at the end of the job execution 
>>>>>>>>>>> (override
>>>>>>>>>>> finalize with an empty block when creating SparkPipeline and run 
>>>>>>>>>>> using
>>>>>>>>>>> pipeline.run() instead of done()).  However, I can confirm that I 
>>>>>>>>>>> see the
>>>>>>>>>>> stage output for the two output directories complete with parquet 
>>>>>>>>>>> files
>>>>>>>>>>> partitioned by key.  However, neither they, nor anything else ever 
>>>>>>>>>>> make it
>>>>>>>>>>> to the output directory, which is not even created.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>      Dave
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 11, 2018 at 8:24 AM David Ortiz <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Josh,
>>>>>>>>>>>>
>>>>>>>>>>>>      Thanks for taking a look.  I can definitely play with that
>>>>>>>>>>>> on Monday when I'm back at work.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>      Dave
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 11, 2018 at 1:46 AM Josh Wills <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey David,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Looking at the code, the problem isn't obvious to me, but
>>>>>>>>>>>>> there are only two places things could be going wrong: writing 
>>>>>>>>>>>>> the data out
>>>>>>>>>>>>> of Spark into the temp directory where intermediate outputs get 
>>>>>>>>>>>>> stored
>>>>>>>>>>>>> (i.e., Spark isn't writing the data out for some reason) or 
>>>>>>>>>>>>> moving the data
>>>>>>>>>>>>> from the temp directory to the final location. The temp data is 
>>>>>>>>>>>>> usually
>>>>>>>>>>>>> deleted at the end of a Crunch run, but you can disable this by 
>>>>>>>>>>>>> a) not
>>>>>>>>>>>>> calling Pipeline.cleanup or Pipeline.done at the end of the run 
>>>>>>>>>>>>> and b)
>>>>>>>>>>>>> subclassing SparkPipeline with dummy code that overrides the 
>>>>>>>>>>>>> finalize()
>>>>>>>>>>>>> method (which is implemented in the top-level DistributedPipeline 
>>>>>>>>>>>>> abstract
>>>>>>>>>>>>> base class) to be a no-op. Is that easy to try out to see if we 
>>>>>>>>>>>>> can isolate
>>>>>>>>>>>>> the source of the error? Otherwise I can play with this a bit 
>>>>>>>>>>>>> tomorrow on
>>>>>>>>>>>>> my own cluster.
>>>>>>>>>>>>>
>>>>>>>>>>>>> J
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Awesome.  Thanks for taking a look!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> hrm, that sounds like something is wrong with the commit
>>>>>>>>>>>>>>> operation on the Spark side; let me take a look at it this 
>>>>>>>>>>>>>>> evening!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> J
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>      Are there any known issues with the
>>>>>>>>>>>>>>>> AvroParquetPathPerKeyTarget when running a Spark pipeline?  
>>>>>>>>>>>>>>>> When I run my
>>>>>>>>>>>>>>>> pipeline with mapreduce, I get output, and when I run with 
>>>>>>>>>>>>>>>> spark, the step
>>>>>>>>>>>>>>>> before where I list my partition keys out (because we use them 
>>>>>>>>>>>>>>>> to add
>>>>>>>>>>>>>>>> partitions to hive) lists data being present, but the output 
>>>>>>>>>>>>>>>> directory
>>>>>>>>>>>>>>>> remains empty.  This behavior is occurring targeting both HDFS 
>>>>>>>>>>>>>>>> and S3
>>>>>>>>>>>>>>>> directly.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>      Dave
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>
>>
>

Re: AvroParquetPathPerKeyTarget with Spark

Reply via email to