Re: AvroParquetPathPerKeyTarget with Spark

Josh Wills Thu, 10 May 2018 22:47:11 -0700

Hey David,

Looking at the code, the problem isn't obvious to me, but there are only
two places things could be going wrong: writing the data out of Spark into
the temp directory where intermediate outputs get stored (i.e., Spark isn't
writing the data out for some reason) or moving the data from the temp
directory to the final location. The temp data is usually deleted at the
end of a Crunch run, but you can disable this by a) not calling
Pipeline.cleanup or Pipeline.done at the end of the run and b) subclassing
SparkPipeline with dummy code that overrides the finalize() method (which
is implemented in the top-level DistributedPipeline abstract base class) to
be a no-op. Is that easy to try out to see if we can isolate the source of
the error? Otherwise I can play with this a bit tomorrow on my own cluster.


J

On Thu, May 10, 2018 at 2:20 PM, David Ortiz <[email protected]> wrote:

> Awesome.  Thanks for taking a look!
>
> On Thu, May 10, 2018 at 5:18 PM Josh Wills <[email protected]> wrote:
>
>> hrm, that sounds like something is wrong with the commit operation on the
>> Spark side; let me take a look at it this evening!
>>
>> J
>>
>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <[email protected]> wrote:
>>
>>> Hello,
>>>
>>>      Are there any known issues with the AvroParquetPathPerKeyTarget
>>> when running a Spark pipeline?  When I run my pipeline with mapreduce, I
>>> get output, and when I run with spark, the step before where I list my
>>> partition keys out (because we use them to add partitions to hive) lists
>>> data being present, but the output directory remains empty.  This behavior
>>> is occurring targeting both HDFS and S3 directly.
>>>
>>> Thanks,
>>>      Dave
>>>
>>
>>

Re: AvroParquetPathPerKeyTarget with Spark

Reply via email to