Josh,
Is there any chance that somehow the output path to
AvroParquetPathPerKey is getting twisted up when it goes through the
compilation step? Watching it while it runs, the output in the
/tmp/crunch/p<stage> directory basically looks like what I would expect it
to do in the output directory. It seems that AvroPathPerKeyTarget also was
showing similar behavior when I was messing around to see if that would
work.
Thanks,
Dave
On Thu, May 17, 2018 at 6:03 PM Josh Wills <[email protected]> wrote:
> Hrm, got it-- now at least I know where to look (although surprised that
> overriding the finalize() didn't fix it, as I ran into similar problems
> with my own cluster and created a SlackPipeline class that overrides that
> method.)
>
>
> J
>
> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <[email protected]> wrote:
>
>> Josh,
>>
>> Those adjustments did not appear to do anything to stop the tmp
>> directory from being removed at the end of the job execution (override
>> finalize with an empty block when creating SparkPipeline and run using
>> pipeline.run() instead of done()). However, I can confirm that I see the
>> stage output for the two output directories complete with parquet files
>> partitioned by key. However, neither they, nor anything else ever make it
>> to the output directory, which is not even created.
>>
>> Thanks,
>> Dave
>>
>> On Fri, May 11, 2018 at 8:24 AM David Ortiz <[email protected]> wrote:
>>
>>> Hey Josh,
>>>
>>> Thanks for taking a look. I can definitely play with that on
>>> Monday when I'm back at work.
>>>
>>> Thanks,
>>> Dave
>>>
>>> On Fri, May 11, 2018 at 1:46 AM Josh Wills <[email protected]> wrote:
>>>
>>>> Hey David,
>>>>
>>>> Looking at the code, the problem isn't obvious to me, but there are
>>>> only two places things could be going wrong: writing the data out of Spark
>>>> into the temp directory where intermediate outputs get stored (i.e., Spark
>>>> isn't writing the data out for some reason) or moving the data from the
>>>> temp directory to the final location. The temp data is usually deleted at
>>>> the end of a Crunch run, but you can disable this by a) not calling
>>>> Pipeline.cleanup or Pipeline.done at the end of the run and b) subclassing
>>>> SparkPipeline with dummy code that overrides the finalize() method (which
>>>> is implemented in the top-level DistributedPipeline abstract base class) to
>>>> be a no-op. Is that easy to try out to see if we can isolate the source of
>>>> the error? Otherwise I can play with this a bit tomorrow on my own cluster.
>>>>
>>>> J
>>>>
>>>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <[email protected]> wrote:
>>>>
>>>>> Awesome. Thanks for taking a look!
>>>>>
>>>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> hrm, that sounds like something is wrong with the commit operation on
>>>>>> the Spark side; let me take a look at it this evening!
>>>>>>
>>>>>> J
>>>>>>
>>>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Are there any known issues with the AvroParquetPathPerKeyTarget
>>>>>>> when running a Spark pipeline? When I run my pipeline with mapreduce, I
>>>>>>> get output, and when I run with spark, the step before where I list my
>>>>>>> partition keys out (because we use them to add partitions to hive) lists
>>>>>>> data being present, but the output directory remains empty. This
>>>>>>> behavior
>>>>>>> is occurring targeting both HDFS and S3 directly.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Dave
>>>>>>>
>>>>>>
>>>>>>
>>>>
>