Josh,

     Those adjustments did not appear to do anything to stop the tmp
directory from being removed at the end of the job execution (override
finalize with an empty block when creating SparkPipeline and run using
pipeline.run() instead of done()).  However, I can confirm that I see the
stage output for the two output directories complete with parquet files
partitioned by key.  However, neither they, nor anything else ever make it
to the output directory, which is not even created.

Thanks,
     Dave

On Fri, May 11, 2018 at 8:24 AM David Ortiz <[email protected]> wrote:

> Hey Josh,
>
>      Thanks for taking a look.  I can definitely play with that on Monday
> when I'm back at work.
>
> Thanks,
>      Dave
>
> On Fri, May 11, 2018 at 1:46 AM Josh Wills <[email protected]> wrote:
>
>> Hey David,
>>
>> Looking at the code, the problem isn't obvious to me, but there are only
>> two places things could be going wrong: writing the data out of Spark into
>> the temp directory where intermediate outputs get stored (i.e., Spark isn't
>> writing the data out for some reason) or moving the data from the temp
>> directory to the final location. The temp data is usually deleted at the
>> end of a Crunch run, but you can disable this by a) not calling
>> Pipeline.cleanup or Pipeline.done at the end of the run and b) subclassing
>> SparkPipeline with dummy code that overrides the finalize() method (which
>> is implemented in the top-level DistributedPipeline abstract base class) to
>> be a no-op. Is that easy to try out to see if we can isolate the source of
>> the error? Otherwise I can play with this a bit tomorrow on my own cluster.
>>
>> J
>>
>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <[email protected]> wrote:
>>
>>> Awesome.  Thanks for taking a look!
>>>
>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills <[email protected]> wrote:
>>>
>>>> hrm, that sounds like something is wrong with the commit operation on
>>>> the Spark side; let me take a look at it this evening!
>>>>
>>>> J
>>>>
>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <[email protected]> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>>      Are there any known issues with the AvroParquetPathPerKeyTarget
>>>>> when running a Spark pipeline?  When I run my pipeline with mapreduce, I
>>>>> get output, and when I run with spark, the step before where I list my
>>>>> partition keys out (because we use them to add partitions to hive) lists
>>>>> data being present, but the output directory remains empty.  This behavior
>>>>> is occurring targeting both HDFS and S3 directly.
>>>>>
>>>>> Thanks,
>>>>>      Dave
>>>>>
>>>>
>>>>
>>

Reply via email to