Re: AvroParquetPathPerKeyTarget with Spark

David Ortiz Fri, 25 May 2018 16:22:55 -0700

The answer seems to be no from my employer on uploading a patch. Double
checking on that though.


On Fri, May 25, 2018, 4:40 PM Josh Wills <[email protected]> wrote:

> CRUNCH-670 is the issue, FWIW
>
> On Fri, May 25, 2018 at 1:39 PM, Josh Wills <[email protected]> wrote:
>
>> Ah, of course-- nice detective work! Can you send me the code so I can
>> patch it in?
>>
>> On Fri, May 25, 2018 at 1:33 PM, David Ortiz <[email protected]> wrote:
>>
>>> Josh,
>>>
>>>      When I dug into the code a little more, I saw that both
>>> AvroPathPerKeyOutputFormat and AvroParquetPathPerKeyOutputFormat use "part"
>>> as a default when creating the basePath when there is not a value for
>>> "mapreduce.output.basename".  My guess is that when running via a
>>> SparkPipeline that value is not set.  I changed my local copy to use out0
>>> as the defaultValue instead of part, and the job was able to write output
>>> successfully.
>>>
>>> Thanks,
>>>     Dave
>>>
>>> On Fri, May 25, 2018 at 2:20 PM David Ortiz <[email protected]> wrote:
>>>
>>>> Josh,
>>>>
>>>>      After cleaning up the logs a little bit I noticed this.
>>>>
>>>> 18/05/25 18:10:39 WARN AvroPathPerKeyTarget: Nothing to copy from
>>>> /tmp/crunch-1037479188/p12/out0
>>>> 18/05/25 18:11:38 WARN AvroPathPerKeyTarget: Nothing to copy from
>>>> /tmp/crunch-1037479188/p13/out0
>>>>
>>>> When I look in those tmp directories while the job runs, they are
>>>> actually writing out to the subdirectory part rather than out0, so that
>>>> would be another reason why it's having issues.  Any thoughts on where that
>>>> output path is coming from?  If you point me in the right direction I can
>>>> try to figure it out.
>>>>
>>>> Thanks,
>>>>      Dave
>>>>
>>>> On Fri, May 25, 2018 at 2:01 PM David Ortiz <[email protected]> wrote:
>>>>
>>>>> Hey Josh,
>>>>>
>>>>>      I am still messing around with it a little bit, but I still seem
>>>>> to be getting the same behavior even after rebuilding with the patch.
>>>>>
>>>>> Thanks,
>>>>>      Dave
>>>>>
>>>>> On Thu, May 24, 2018 at 1:50 AM Josh Wills <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> David,
>>>>>>
>>>>>> Take a look at CRUNCH-670; I think that patch fixes the problem in
>>>>>> the most minimal way I can think of.
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/CRUNCH-670
>>>>>>
>>>>>> J
>>>>>>
>>>>>> On Wed, May 23, 2018 at 3:54 PM, Josh Wills <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I think that must be it Dave, but I can't for the life of me figure
>>>>>>> out where in the code that's happening. Will take another look tonight.
>>>>>>>
>>>>>>> J
>>>>>>>
>>>>>>> On Wed, May 23, 2018 at 7:00 AM, David Ortiz <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Josh,
>>>>>>>>
>>>>>>>>      Is there any chance that somehow the output path to
>>>>>>>> AvroParquetPathPerKey is getting twisted up when it goes through the
>>>>>>>> compilation step?  Watching it while it runs, the output in the
>>>>>>>> /tmp/crunch/p<stage> directory basically looks like what I would 
>>>>>>>> expect it
>>>>>>>> to do in the output directory.  It seems that AvroPathPerKeyTarget 
>>>>>>>> also was
>>>>>>>> showing similar behavior when I was messing around to see if that would
>>>>>>>> work.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>      Dave
>>>>>>>>
>>>>>>>> On Thu, May 17, 2018 at 6:03 PM Josh Wills <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hrm, got it-- now at least I know where to look (although
>>>>>>>>> surprised that overriding the finalize() didn't fix it, as I ran into
>>>>>>>>> similar problems with my own cluster and created a SlackPipeline 
>>>>>>>>> class that
>>>>>>>>> overrides that method.)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> J
>>>>>>>>>
>>>>>>>>> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Josh,
>>>>>>>>>>
>>>>>>>>>>      Those adjustments did not appear to do anything to stop the
>>>>>>>>>> tmp directory from being removed at the end of the job execution 
>>>>>>>>>> (override
>>>>>>>>>> finalize with an empty block when creating SparkPipeline and run 
>>>>>>>>>> using
>>>>>>>>>> pipeline.run() instead of done()).  However, I can confirm that I 
>>>>>>>>>> see the
>>>>>>>>>> stage output for the two output directories complete with parquet 
>>>>>>>>>> files
>>>>>>>>>> partitioned by key.  However, neither they, nor anything else ever 
>>>>>>>>>> make it
>>>>>>>>>> to the output directory, which is not even created.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>      Dave
>>>>>>>>>>
>>>>>>>>>> On Fri, May 11, 2018 at 8:24 AM David Ortiz <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Josh,
>>>>>>>>>>>
>>>>>>>>>>>      Thanks for taking a look.  I can definitely play with that
>>>>>>>>>>> on Monday when I'm back at work.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>      Dave
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 11, 2018 at 1:46 AM Josh Wills <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey David,
>>>>>>>>>>>>
>>>>>>>>>>>> Looking at the code, the problem isn't obvious to me, but there
>>>>>>>>>>>> are only two places things could be going wrong: writing the data 
>>>>>>>>>>>> out of
>>>>>>>>>>>> Spark into the temp directory where intermediate outputs get 
>>>>>>>>>>>> stored (i.e.,
>>>>>>>>>>>> Spark isn't writing the data out for some reason) or moving the 
>>>>>>>>>>>> data from
>>>>>>>>>>>> the temp directory to the final location. The temp data is usually 
>>>>>>>>>>>> deleted
>>>>>>>>>>>> at the end of a Crunch run, but you can disable this by a) not 
>>>>>>>>>>>> calling
>>>>>>>>>>>> Pipeline.cleanup or Pipeline.done at the end of the run and b) 
>>>>>>>>>>>> subclassing
>>>>>>>>>>>> SparkPipeline with dummy code that overrides the finalize() method 
>>>>>>>>>>>> (which
>>>>>>>>>>>> is implemented in the top-level DistributedPipeline abstract base 
>>>>>>>>>>>> class) to
>>>>>>>>>>>> be a no-op. Is that easy to try out to see if we can isolate the 
>>>>>>>>>>>> source of
>>>>>>>>>>>> the error? Otherwise I can play with this a bit tomorrow on my own 
>>>>>>>>>>>> cluster.
>>>>>>>>>>>>
>>>>>>>>>>>> J
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <[email protected]
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Awesome.  Thanks for taking a look!
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> hrm, that sounds like something is wrong with the commit
>>>>>>>>>>>>>> operation on the Spark side; let me take a look at it this 
>>>>>>>>>>>>>> evening!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> J
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>      Are there any known issues with the
>>>>>>>>>>>>>>> AvroParquetPathPerKeyTarget when running a Spark pipeline?  
>>>>>>>>>>>>>>> When I run my
>>>>>>>>>>>>>>> pipeline with mapreduce, I get output, and when I run with 
>>>>>>>>>>>>>>> spark, the step
>>>>>>>>>>>>>>> before where I list my partition keys out (because we use them 
>>>>>>>>>>>>>>> to add
>>>>>>>>>>>>>>> partitions to hive) lists data being present, but the output 
>>>>>>>>>>>>>>> directory
>>>>>>>>>>>>>>> remains empty.  This behavior is occurring targeting both HDFS 
>>>>>>>>>>>>>>> and S3
>>>>>>>>>>>>>>> directly.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>      Dave
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>
>

Re: AvroParquetPathPerKeyTarget with Spark

Reply via email to