Ah, yes, that is correct. You need a serializable object one way or the
other.

An alternate suggestion would be to use a combination of
RDD.sample()<http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#sample>and
collect() to take a look at some small amount of data and just log it
from the driver. That's pretty awkward as well, but will spare you having
to make some kind of serializable logger function.


On Wed, May 7, 2014 at 9:32 AM, Diana Carroll <dcarr...@cloudera.com> wrote:

> foreach vs. map isn't the issue.  Both require serializing the called
> function, so the pickle error would still apply, yes?
>
> And at the moment, I'm just testing.  Definitely wouldn't want to log
> something for each element, but may want to detect something and log for
> SOME elements.
>
> So my question is: how are other people doing logging from distributed
> tasks, given the serialization issues?
>
> The same issue actually exists in Scala, too.  I could work around it by
> creating a small serializable object that provides a logger, but it seems
> kind of kludgy to me, so I'm wondering if other people are logging from
> tasks, and if so, how?
>
> Diana
>
>
> On Tue, May 6, 2014 at 6:24 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I think you're looking for 
>> RDD.foreach()<http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#foreach>
>> .
>>
>> According to the programming 
>> guide<http://spark.apache.org/docs/latest/scala-programming-guide.html>
>> :
>>
>> Run a function func on each element of the dataset. This is usually done
>>> for side effects such as updating an accumulator variable (see below) or
>>> interacting with external storage systems.
>>
>>
>> Do you really want to log something for each element of your RDD?
>>
>> Nick
>>
>>
>> On Tue, May 6, 2014 at 3:31 PM, Diana Carroll <dcarr...@cloudera.com>wrote:
>>
>>> What should I do if I want to log something as part of a task?
>>>
>>> This is what I tried.  To set up a logger, I followed the advice here:
>>> http://py4j.sourceforge.net/faq.html#how-to-turn-logging-on-off
>>>
>>> logger = logging.getLogger("py4j")
>>> logger.setLevel(logging.INFO)
>>> logger.addHandler(logging.StreamHandler())
>>>
>>> This works fine when I call it from my driver (ie pyspark):
>>> logger.info("this works fine")
>>>
>>> But I want to try logging within a distributed task so I did this:
>>>
>>> def logTestMap(a):
>>>      logger.info("test")
>>>     return a
>>>
>>> myrdd.map(logTestMap).count()
>>>
>>> and got:
>>> PicklingError: Can't pickle 'lock' object
>>>
>>> So it's trying to serialize my function and can't because of a lock
>>> object used in logger, presumably for thread-safeness.  But then...how
>>> would I do it?  Or is this just a really bad idea?
>>>
>>> Thanks
>>> Diana
>>>
>>
>>
>

Reply via email to