Re: AvroParquetPathPerKeyTarget with Spark

Josh Wills Wed, 23 May 2018 22:50:56 -0700

David,

Take a look at CRUNCH-670; I think that patch fixes the problem in the most
minimal way I can think of.


https://issues.apache.org/jira/browse/CRUNCH-670

J

On Wed, May 23, 2018 at 3:54 PM, Josh Wills <[email protected]> wrote:

> I think that must be it Dave, but I can't for the life of me figure out
> where in the code that's happening. Will take another look tonight.
>
> J
>
> On Wed, May 23, 2018 at 7:00 AM, David Ortiz <[email protected]> wrote:
>
>> Josh,
>>
>>      Is there any chance that somehow the output path to
>> AvroParquetPathPerKey is getting twisted up when it goes through the
>> compilation step?  Watching it while it runs, the output in the
>> /tmp/crunch/p<stage> directory basically looks like what I would expect it
>> to do in the output directory.  It seems that AvroPathPerKeyTarget also was
>> showing similar behavior when I was messing around to see if that would
>> work.
>>
>> Thanks,
>>      Dave
>>
>> On Thu, May 17, 2018 at 6:03 PM Josh Wills <[email protected]> wrote:
>>
>>> Hrm, got it-- now at least I know where to look (although surprised that
>>> overriding the finalize() didn't fix it, as I ran into similar problems
>>> with my own cluster and created a SlackPipeline class that overrides that
>>> method.)
>>>
>>>
>>> J
>>>
>>> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <[email protected]> wrote:
>>>
>>>> Josh,
>>>>
>>>>      Those adjustments did not appear to do anything to stop the tmp
>>>> directory from being removed at the end of the job execution (override
>>>> finalize with an empty block when creating SparkPipeline and run using
>>>> pipeline.run() instead of done()).  However, I can confirm that I see the
>>>> stage output for the two output directories complete with parquet files
>>>> partitioned by key.  However, neither they, nor anything else ever make it
>>>> to the output directory, which is not even created.
>>>>
>>>> Thanks,
>>>>      Dave
>>>>
>>>> On Fri, May 11, 2018 at 8:24 AM David Ortiz <[email protected]> wrote:
>>>>
>>>>> Hey Josh,
>>>>>
>>>>>      Thanks for taking a look.  I can definitely play with that on
>>>>> Monday when I'm back at work.
>>>>>
>>>>> Thanks,
>>>>>      Dave
>>>>>
>>>>> On Fri, May 11, 2018 at 1:46 AM Josh Wills <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hey David,
>>>>>>
>>>>>> Looking at the code, the problem isn't obvious to me, but there are
>>>>>> only two places things could be going wrong: writing the data out of 
>>>>>> Spark
>>>>>> into the temp directory where intermediate outputs get stored (i.e., 
>>>>>> Spark
>>>>>> isn't writing the data out for some reason) or moving the data from the
>>>>>> temp directory to the final location. The temp data is usually deleted at
>>>>>> the end of a Crunch run, but you can disable this by a) not calling
>>>>>> Pipeline.cleanup or Pipeline.done at the end of the run and b) 
>>>>>> subclassing
>>>>>> SparkPipeline with dummy code that overrides the finalize() method (which
>>>>>> is implemented in the top-level DistributedPipeline abstract base class) 
>>>>>> to
>>>>>> be a no-op. Is that easy to try out to see if we can isolate the source 
>>>>>> of
>>>>>> the error? Otherwise I can play with this a bit tomorrow on my own 
>>>>>> cluster.
>>>>>>
>>>>>> J
>>>>>>
>>>>>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Awesome.  Thanks for taking a look!
>>>>>>>
>>>>>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> hrm, that sounds like something is wrong with the commit operation
>>>>>>>> on the Spark side; let me take a look at it this evening!
>>>>>>>>
>>>>>>>> J
>>>>>>>>
>>>>>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>>      Are there any known issues with the
>>>>>>>>> AvroParquetPathPerKeyTarget when running a Spark pipeline?  When I 
>>>>>>>>> run my
>>>>>>>>> pipeline with mapreduce, I get output, and when I run with spark, the 
>>>>>>>>> step
>>>>>>>>> before where I list my partition keys out (because we use them to add
>>>>>>>>> partitions to hive) lists data being present, but the output directory
>>>>>>>>> remains empty.  This behavior is occurring targeting both HDFS and S3
>>>>>>>>> directly.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>      Dave
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>
>

Re: AvroParquetPathPerKeyTarget with Spark

Reply via email to