Re: AvroParquetPathPerKeyTarget with Spark

David Ortiz Wed, 23 May 2018 07:02:12 -0700

Josh,

     Is there any chance that somehow the output path to
AvroParquetPathPerKey is getting twisted up when it goes through the
compilation step?  Watching it while it runs, the output in the
/tmp/crunch/p<stage> directory basically looks like what I would expect it
to do in the output directory.  It seems that AvroPathPerKeyTarget also was
showing similar behavior when I was messing around to see if that would
work.


Thanks,
     Dave

On Thu, May 17, 2018 at 6:03 PM Josh Wills <[email protected]> wrote:

> Hrm, got it-- now at least I know where to look (although surprised that
> overriding the finalize() didn't fix it, as I ran into similar problems
> with my own cluster and created a SlackPipeline class that overrides that
> method.)
>
>
> J
>
> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <[email protected]> wrote:
>
>> Josh,
>>
>>      Those adjustments did not appear to do anything to stop the tmp
>> directory from being removed at the end of the job execution (override
>> finalize with an empty block when creating SparkPipeline and run using
>> pipeline.run() instead of done()).  However, I can confirm that I see the
>> stage output for the two output directories complete with parquet files
>> partitioned by key.  However, neither they, nor anything else ever make it
>> to the output directory, which is not even created.
>>
>> Thanks,
>>      Dave
>>
>> On Fri, May 11, 2018 at 8:24 AM David Ortiz <[email protected]> wrote:
>>
>>> Hey Josh,
>>>
>>>      Thanks for taking a look.  I can definitely play with that on
>>> Monday when I'm back at work.
>>>
>>> Thanks,
>>>      Dave
>>>
>>> On Fri, May 11, 2018 at 1:46 AM Josh Wills <[email protected]> wrote:
>>>
>>>> Hey David,
>>>>
>>>> Looking at the code, the problem isn't obvious to me, but there are
>>>> only two places things could be going wrong: writing the data out of Spark
>>>> into the temp directory where intermediate outputs get stored (i.e., Spark
>>>> isn't writing the data out for some reason) or moving the data from the
>>>> temp directory to the final location. The temp data is usually deleted at
>>>> the end of a Crunch run, but you can disable this by a) not calling
>>>> Pipeline.cleanup or Pipeline.done at the end of the run and b) subclassing
>>>> SparkPipeline with dummy code that overrides the finalize() method (which
>>>> is implemented in the top-level DistributedPipeline abstract base class) to
>>>> be a no-op. Is that easy to try out to see if we can isolate the source of
>>>> the error? Otherwise I can play with this a bit tomorrow on my own cluster.
>>>>
>>>> J
>>>>
>>>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <[email protected]> wrote:
>>>>
>>>>> Awesome.  Thanks for taking a look!
>>>>>
>>>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> hrm, that sounds like something is wrong with the commit operation on
>>>>>> the Spark side; let me take a look at it this evening!
>>>>>>
>>>>>> J
>>>>>>
>>>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>>      Are there any known issues with the AvroParquetPathPerKeyTarget
>>>>>>> when running a Spark pipeline?  When I run my pipeline with mapreduce, I
>>>>>>> get output, and when I run with spark, the step before where I list my
>>>>>>> partition keys out (because we use them to add partitions to hive) lists
>>>>>>> data being present, but the output directory remains empty.  This 
>>>>>>> behavior
>>>>>>> is occurring targeting both HDFS and S3 directly.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>      Dave
>>>>>>>
>>>>>>
>>>>>>
>>>>
>

Re: AvroParquetPathPerKeyTarget with Spark

Reply via email to