Re: logging in pyspark

Shivani Rao Thu, 22 May 2014 09:11:43 -0700

I am having trouble adding logging to the class that does serialization and
deserialization. Where is the code for  org.apache.spark.Logging located?
and is this serializable?



On Mon, May 12, 2014 at 10:02 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Ah, yes, that is correct. You need a serializable object one way or the
> other.
>
> An alternate suggestion would be to use a combination of 
> RDD.sample()<http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#sample>and
>  collect() to take a look at some small amount of data and just log it
> from the driver. That's pretty awkward as well, but will spare you having
> to make some kind of serializable logger function.
>
>
> On Wed, May 7, 2014 at 9:32 AM, Diana Carroll <dcarr...@cloudera.com>wrote:
>
>> foreach vs. map isn't the issue.  Both require serializing the called
>> function, so the pickle error would still apply, yes?
>>
>> And at the moment, I'm just testing.  Definitely wouldn't want to log
>> something for each element, but may want to detect something and log for
>> SOME elements.
>>
>> So my question is: how are other people doing logging from distributed
>> tasks, given the serialization issues?
>>
>> The same issue actually exists in Scala, too.  I could work around it by
>> creating a small serializable object that provides a logger, but it seems
>> kind of kludgy to me, so I'm wondering if other people are logging from
>> tasks, and if so, how?
>>
>> Diana
>>
>>
>> On Tue, May 6, 2014 at 6:24 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I think you're looking for 
>>> RDD.foreach()<http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#foreach>
>>> .
>>>
>>> According to the programming 
>>> guide<http://spark.apache.org/docs/latest/scala-programming-guide.html>
>>> :
>>>
>>> Run a function func on each element of the dataset. This is usually done
>>>> for side effects such as updating an accumulator variable (see below) or
>>>> interacting with external storage systems.
>>>
>>>
>>> Do you really want to log something for each element of your RDD?
>>>
>>> Nick
>>>
>>>
>>> On Tue, May 6, 2014 at 3:31 PM, Diana Carroll <dcarr...@cloudera.com>wrote:
>>>
>>>> What should I do if I want to log something as part of a task?
>>>>
>>>> This is what I tried.  To set up a logger, I followed the advice here:
>>>> http://py4j.sourceforge.net/faq.html#how-to-turn-logging-on-off
>>>>
>>>> logger = logging.getLogger("py4j")
>>>> logger.setLevel(logging.INFO)
>>>> logger.addHandler(logging.StreamHandler())
>>>>
>>>> This works fine when I call it from my driver (ie pyspark):
>>>> logger.info("this works fine")
>>>>
>>>> But I want to try logging within a distributed task so I did this:
>>>>
>>>> def logTestMap(a):
>>>>      logger.info("test")
>>>>     return a
>>>>
>>>> myrdd.map(logTestMap).count()
>>>>
>>>> and got:
>>>> PicklingError: Can't pickle 'lock' object
>>>>
>>>> So it's trying to serialize my function and can't because of a lock
>>>> object used in logger, presumably for thread-safeness.  But then...how
>>>> would I do it?  Or is this just a really bad idea?
>>>>
>>>> Thanks
>>>> Diana
>>>>
>>>
>>>
>>
>


-- 
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA

Re: logging in pyspark

Reply via email to