Re: AvroParquetPathPerKeyTarget with Spark

David Ortiz Fri, 25 May 2018 11:20:35 -0700

Josh,

     After cleaning up the logs a little bit I noticed this.


18/05/25 18:10:39 WARN AvroPathPerKeyTarget: Nothing to copy from
/tmp/crunch-1037479188/p12/out0
18/05/25 18:11:38 WARN AvroPathPerKeyTarget: Nothing to copy from
/tmp/crunch-1037479188/p13/out0

When I look in those tmp directories while the job runs, they are actually
writing out to the subdirectory part rather than out0, so that would be
another reason why it's having issues.  Any thoughts on where that output
path is coming from?  If you point me in the right direction I can try to
figure it out.

Thanks,
     Dave

On Fri, May 25, 2018 at 2:01 PM David Ortiz <dpo5...@gmail.com> wrote:

> Hey Josh,
>
>      I am still messing around with it a little bit, but I still seem to
> be getting the same behavior even after rebuilding with the patch.
>
> Thanks,
>      Dave
>
> On Thu, May 24, 2018 at 1:50 AM Josh Wills <josh.wi...@gmail.com> wrote:
>
>> David,
>>
>> Take a look at CRUNCH-670; I think that patch fixes the problem in the
>> most minimal way I can think of.
>>
>> https://issues.apache.org/jira/browse/CRUNCH-670
>>
>> J
>>
>> On Wed, May 23, 2018 at 3:54 PM, Josh Wills <josh.wi...@gmail.com> wrote:
>>
>>> I think that must be it Dave, but I can't for the life of me figure out
>>> where in the code that's happening. Will take another look tonight.
>>>
>>> J
>>>
>>> On Wed, May 23, 2018 at 7:00 AM, David Ortiz <dpo5...@gmail.com> wrote:
>>>
>>>> Josh,
>>>>
>>>>      Is there any chance that somehow the output path to
>>>> AvroParquetPathPerKey is getting twisted up when it goes through the
>>>> compilation step?  Watching it while it runs, the output in the
>>>> /tmp/crunch/p<stage> directory basically looks like what I would expect it
>>>> to do in the output directory.  It seems that AvroPathPerKeyTarget also was
>>>> showing similar behavior when I was messing around to see if that would
>>>> work.
>>>>
>>>> Thanks,
>>>>      Dave
>>>>
>>>> On Thu, May 17, 2018 at 6:03 PM Josh Wills <josh.wi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hrm, got it-- now at least I know where to look (although surprised
>>>>> that overriding the finalize() didn't fix it, as I ran into similar
>>>>> problems with my own cluster and created a SlackPipeline class that
>>>>> overrides that method.)
>>>>>
>>>>>
>>>>> J
>>>>>
>>>>> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <dpo5...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Josh,
>>>>>>
>>>>>>      Those adjustments did not appear to do anything to stop the tmp
>>>>>> directory from being removed at the end of the job execution (override
>>>>>> finalize with an empty block when creating SparkPipeline and run using
>>>>>> pipeline.run() instead of done()).  However, I can confirm that I see the
>>>>>> stage output for the two output directories complete with parquet files
>>>>>> partitioned by key.  However, neither they, nor anything else ever make 
>>>>>> it
>>>>>> to the output directory, which is not even created.
>>>>>>
>>>>>> Thanks,
>>>>>>      Dave
>>>>>>
>>>>>> On Fri, May 11, 2018 at 8:24 AM David Ortiz <dpo5...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Josh,
>>>>>>>
>>>>>>>      Thanks for taking a look.  I can definitely play with that on
>>>>>>> Monday when I'm back at work.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>      Dave
>>>>>>>
>>>>>>> On Fri, May 11, 2018 at 1:46 AM Josh Wills <josh.wi...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey David,
>>>>>>>>
>>>>>>>> Looking at the code, the problem isn't obvious to me, but there are
>>>>>>>> only two places things could be going wrong: writing the data out of 
>>>>>>>> Spark
>>>>>>>> into the temp directory where intermediate outputs get stored (i.e., 
>>>>>>>> Spark
>>>>>>>> isn't writing the data out for some reason) or moving the data from the
>>>>>>>> temp directory to the final location. The temp data is usually deleted 
>>>>>>>> at
>>>>>>>> the end of a Crunch run, but you can disable this by a) not calling
>>>>>>>> Pipeline.cleanup or Pipeline.done at the end of the run and b) 
>>>>>>>> subclassing
>>>>>>>> SparkPipeline with dummy code that overrides the finalize() method 
>>>>>>>> (which
>>>>>>>> is implemented in the top-level DistributedPipeline abstract base 
>>>>>>>> class) to
>>>>>>>> be a no-op. Is that easy to try out to see if we can isolate the 
>>>>>>>> source of
>>>>>>>> the error? Otherwise I can play with this a bit tomorrow on my own 
>>>>>>>> cluster.
>>>>>>>>
>>>>>>>> J
>>>>>>>>
>>>>>>>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <dpo5...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Awesome.  Thanks for taking a look!
>>>>>>>>>
>>>>>>>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills <josh.wi...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> hrm, that sounds like something is wrong with the commit
>>>>>>>>>> operation on the Spark side; let me take a look at it this evening!
>>>>>>>>>>
>>>>>>>>>> J
>>>>>>>>>>
>>>>>>>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <dpo5...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>>      Are there any known issues with the
>>>>>>>>>>> AvroParquetPathPerKeyTarget when running a Spark pipeline?  When I 
>>>>>>>>>>> run my
>>>>>>>>>>> pipeline with mapreduce, I get output, and when I run with spark, 
>>>>>>>>>>> the step
>>>>>>>>>>> before where I list my partition keys out (because we use them to 
>>>>>>>>>>> add
>>>>>>>>>>> partitions to hive) lists data being present, but the output 
>>>>>>>>>>> directory
>>>>>>>>>>> remains empty.  This behavior is occurring targeting both HDFS and 
>>>>>>>>>>> S3
>>>>>>>>>>> directly.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>      Dave
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>
>>>
>>

Re: AvroParquetPathPerKeyTarget with Spark

Reply via email to