Re: AvroParquetPathPerKeyTarget with Spark

David Ortiz Fri, 25 May 2018 11:01:57 -0700

Hey Josh,

     I am still messing around with it a little bit, but I still seem to be
getting the same behavior even after rebuilding with the patch.


Thanks,
     Dave

On Thu, May 24, 2018 at 1:50 AM Josh Wills <josh.wi...@gmail.com> wrote:

> David,
>
> Take a look at CRUNCH-670; I think that patch fixes the problem in the
> most minimal way I can think of.
>
> https://issues.apache.org/jira/browse/CRUNCH-670
>
> J
>
> On Wed, May 23, 2018 at 3:54 PM, Josh Wills <josh.wi...@gmail.com> wrote:
>
>> I think that must be it Dave, but I can't for the life of me figure out
>> where in the code that's happening. Will take another look tonight.
>>
>> J
>>
>> On Wed, May 23, 2018 at 7:00 AM, David Ortiz <dpo5...@gmail.com> wrote:
>>
>>> Josh,
>>>
>>>      Is there any chance that somehow the output path to
>>> AvroParquetPathPerKey is getting twisted up when it goes through the
>>> compilation step?  Watching it while it runs, the output in the
>>> /tmp/crunch/p<stage> directory basically looks like what I would expect it
>>> to do in the output directory.  It seems that AvroPathPerKeyTarget also was
>>> showing similar behavior when I was messing around to see if that would
>>> work.
>>>
>>> Thanks,
>>>      Dave
>>>
>>> On Thu, May 17, 2018 at 6:03 PM Josh Wills <josh.wi...@gmail.com> wrote:
>>>
>>>> Hrm, got it-- now at least I know where to look (although surprised
>>>> that overriding the finalize() didn't fix it, as I ran into similar
>>>> problems with my own cluster and created a SlackPipeline class that
>>>> overrides that method.)
>>>>
>>>>
>>>> J
>>>>
>>>> On Thu, May 17, 2018 at 12:22 PM, David Ortiz <dpo5...@gmail.com>
>>>> wrote:
>>>>
>>>>> Josh,
>>>>>
>>>>>      Those adjustments did not appear to do anything to stop the tmp
>>>>> directory from being removed at the end of the job execution (override
>>>>> finalize with an empty block when creating SparkPipeline and run using
>>>>> pipeline.run() instead of done()).  However, I can confirm that I see the
>>>>> stage output for the two output directories complete with parquet files
>>>>> partitioned by key.  However, neither they, nor anything else ever make it
>>>>> to the output directory, which is not even created.
>>>>>
>>>>> Thanks,
>>>>>      Dave
>>>>>
>>>>> On Fri, May 11, 2018 at 8:24 AM David Ortiz <dpo5...@gmail.com> wrote:
>>>>>
>>>>>> Hey Josh,
>>>>>>
>>>>>>      Thanks for taking a look.  I can definitely play with that on
>>>>>> Monday when I'm back at work.
>>>>>>
>>>>>> Thanks,
>>>>>>      Dave
>>>>>>
>>>>>> On Fri, May 11, 2018 at 1:46 AM Josh Wills <josh.wi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey David,
>>>>>>>
>>>>>>> Looking at the code, the problem isn't obvious to me, but there are
>>>>>>> only two places things could be going wrong: writing the data out of 
>>>>>>> Spark
>>>>>>> into the temp directory where intermediate outputs get stored (i.e., 
>>>>>>> Spark
>>>>>>> isn't writing the data out for some reason) or moving the data from the
>>>>>>> temp directory to the final location. The temp data is usually deleted 
>>>>>>> at
>>>>>>> the end of a Crunch run, but you can disable this by a) not calling
>>>>>>> Pipeline.cleanup or Pipeline.done at the end of the run and b) 
>>>>>>> subclassing
>>>>>>> SparkPipeline with dummy code that overrides the finalize() method 
>>>>>>> (which
>>>>>>> is implemented in the top-level DistributedPipeline abstract base 
>>>>>>> class) to
>>>>>>> be a no-op. Is that easy to try out to see if we can isolate the source 
>>>>>>> of
>>>>>>> the error? Otherwise I can play with this a bit tomorrow on my own 
>>>>>>> cluster.
>>>>>>>
>>>>>>> J
>>>>>>>
>>>>>>> On Thu, May 10, 2018 at 2:20 PM, David Ortiz <dpo5...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Awesome.  Thanks for taking a look!
>>>>>>>>
>>>>>>>> On Thu, May 10, 2018 at 5:18 PM Josh Wills <josh.wi...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> hrm, that sounds like something is wrong with the commit operation
>>>>>>>>> on the Spark side; let me take a look at it this evening!
>>>>>>>>>
>>>>>>>>> J
>>>>>>>>>
>>>>>>>>> On Thu, May 10, 2018 at 8:56 AM, David Ortiz <dpo5...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>>      Are there any known issues with the
>>>>>>>>>> AvroParquetPathPerKeyTarget when running a Spark pipeline?  When I 
>>>>>>>>>> run my
>>>>>>>>>> pipeline with mapreduce, I get output, and when I run with spark, 
>>>>>>>>>> the step
>>>>>>>>>> before where I list my partition keys out (because we use them to add
>>>>>>>>>> partitions to hive) lists data being present, but the output 
>>>>>>>>>> directory
>>>>>>>>>> remains empty.  This behavior is occurring targeting both HDFS and S3
>>>>>>>>>> directly.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>      Dave
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>>
>

Re: AvroParquetPathPerKeyTarget with Spark

Reply via email to