Re: AvroParquetPathPerKeyTarget with Spark

Josh Wills Thu, 17 May 2018 15:03:50 -0700

Hrm, got it-- now at least I know where to look (although surprised that
overriding the finalize() didn't fix it, as I ran into similar problems
with my own cluster and created a SlackPipeline class that overrides that
method.)


J

On Thu, May 17, 2018 at 12:22 PM, David Ortiz <[email protected]> wrote:

> Josh,
>
>      Those adjustments did not appear to do anything to stop the tmp
> directory from being removed at the end of the job execution (override
> finalize with an empty block when creating SparkPipeline and run using
> pipeline.run() instead of done()).  However, I can confirm that I see the
> stage output for the two output directories complete with parquet files
> partitioned by key.  However, neither they, nor anything else ever make it
> to the output directory, which is not even created.
>
> Thanks,
>      Dave
>
> On Fri, May 11, 2018 at 8:24 AM David Ortiz <[email protected]> wrote:
>
>> Hey Josh,
>>
>>      Thanks for taking a look.  I can definitely play with that on Monday
>> when I'm back at work.
>>
>> Thanks,
>>      Dave
>>
>> On Fri, May 11, 2018 at 1:46 AM Josh Wills <[email protected]> wrote:
>>
>>> Hey David,
>>>
>>> Looking at the code, the problem isn't obvious to me, but there are only
>>> two places things could be going wrong: writing the data out of Spark into
>>> the temp directory where intermediate outputs get stored (i.e., Spark isn't
>>> writing the data out for some reason) or moving the data from the temp
>>> directory to the final location. The temp data is usually deleted at the
>>> end of a Crunch run, but you can disable this by a) not calling
>>> Pipeline.cleanup or Pipeline.done at the end of the run and b) subclassing
>>> SparkPipeline with dummy code that overrides the finalize() method (which
>>> is implemented in the top-level DistributedPipeline abstract base class) to
>>> be a no-op. Is that easy to try out to see if we can isolate the source of
>>> the error? Otherwise I can play with this a bit tomorrow on my own cluster.
>>>
>>> J
>>>
>>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <[email protected]> wrote:
>>>
>>>> Awesome.  Thanks for taking a look!
>>>>
>>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills <[email protected]>
>>>> wrote:
>>>>
>>>>> hrm, that sounds like something is wrong with the commit operation on
>>>>> the Spark side; let me take a look at it this evening!
>>>>>
>>>>> J
>>>>>
>>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>>      Are there any known issues with the AvroParquetPathPerKeyTarget
>>>>>> when running a Spark pipeline?  When I run my pipeline with mapreduce, I
>>>>>> get output, and when I run with spark, the step before where I list my
>>>>>> partition keys out (because we use them to add partitions to hive) lists
>>>>>> data being present, but the output directory remains empty.  This 
>>>>>> behavior
>>>>>> is occurring targeting both HDFS and S3 directly.
>>>>>>
>>>>>> Thanks,
>>>>>>      Dave
>>>>>>
>>>>>
>>>>>
>>>

Re: AvroParquetPathPerKeyTarget with Spark

Reply via email to