Hey Josh,
Thanks for taking a look. I can definitely play with that on Monday
when I'm back at work.
Thanks,
Dave
On Fri, May 11, 2018 at 1:46 AM Josh Wills <[email protected]> wrote:
> Hey David,
>
> Looking at the code, the problem isn't obvious to me, but there are only
> two places things could be going wrong: writing the data out of Spark into
> the temp directory where intermediate outputs get stored (i.e., Spark isn't
> writing the data out for some reason) or moving the data from the temp
> directory to the final location. The temp data is usually deleted at the
> end of a Crunch run, but you can disable this by a) not calling
> Pipeline.cleanup or Pipeline.done at the end of the run and b) subclassing
> SparkPipeline with dummy code that overrides the finalize() method (which
> is implemented in the top-level DistributedPipeline abstract base class) to
> be a no-op. Is that easy to try out to see if we can isolate the source of
> the error? Otherwise I can play with this a bit tomorrow on my own cluster.
>
> J
>
> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <[email protected]> wrote:
>
>> Awesome. Thanks for taking a look!
>>
>> On Thu, May 10, 2018 at 5:18 PM Josh Wills <[email protected]> wrote:
>>
>>> hrm, that sounds like something is wrong with the commit operation on
>>> the Spark side; let me take a look at it this evening!
>>>
>>> J
>>>
>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <[email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> Are there any known issues with the AvroParquetPathPerKeyTarget
>>>> when running a Spark pipeline? When I run my pipeline with mapreduce, I
>>>> get output, and when I run with spark, the step before where I list my
>>>> partition keys out (because we use them to add partitions to hive) lists
>>>> data being present, but the output directory remains empty. This behavior
>>>> is occurring targeting both HDFS and S3 directly.
>>>>
>>>> Thanks,
>>>> Dave
>>>>
>>>
>>>
>