I am having trouble adding logging to the class that does serialization and deserialization. Where is the code for org.apache.spark.Logging located? and is this serializable?
On Mon, May 12, 2014 at 10:02 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Ah, yes, that is correct. You need a serializable object one way or the > other. > > An alternate suggestion would be to use a combination of > RDD.sample()<http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#sample>and > collect() to take a look at some small amount of data and just log it > from the driver. That's pretty awkward as well, but will spare you having > to make some kind of serializable logger function. > > > On Wed, May 7, 2014 at 9:32 AM, Diana Carroll <dcarr...@cloudera.com>wrote: > >> foreach vs. map isn't the issue. Both require serializing the called >> function, so the pickle error would still apply, yes? >> >> And at the moment, I'm just testing. Definitely wouldn't want to log >> something for each element, but may want to detect something and log for >> SOME elements. >> >> So my question is: how are other people doing logging from distributed >> tasks, given the serialization issues? >> >> The same issue actually exists in Scala, too. I could work around it by >> creating a small serializable object that provides a logger, but it seems >> kind of kludgy to me, so I'm wondering if other people are logging from >> tasks, and if so, how? >> >> Diana >> >> >> On Tue, May 6, 2014 at 6:24 PM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> I think you're looking for >>> RDD.foreach()<http://spark.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#foreach> >>> . >>> >>> According to the programming >>> guide<http://spark.apache.org/docs/latest/scala-programming-guide.html> >>> : >>> >>> Run a function func on each element of the dataset. This is usually done >>>> for side effects such as updating an accumulator variable (see below) or >>>> interacting with external storage systems. >>> >>> >>> Do you really want to log something for each element of your RDD? >>> >>> Nick >>> >>> >>> On Tue, May 6, 2014 at 3:31 PM, Diana Carroll <dcarr...@cloudera.com>wrote: >>> >>>> What should I do if I want to log something as part of a task? >>>> >>>> This is what I tried. To set up a logger, I followed the advice here: >>>> http://py4j.sourceforge.net/faq.html#how-to-turn-logging-on-off >>>> >>>> logger = logging.getLogger("py4j") >>>> logger.setLevel(logging.INFO) >>>> logger.addHandler(logging.StreamHandler()) >>>> >>>> This works fine when I call it from my driver (ie pyspark): >>>> logger.info("this works fine") >>>> >>>> But I want to try logging within a distributed task so I did this: >>>> >>>> def logTestMap(a): >>>> logger.info("test") >>>> return a >>>> >>>> myrdd.map(logTestMap).count() >>>> >>>> and got: >>>> PicklingError: Can't pickle 'lock' object >>>> >>>> So it's trying to serialize my function and can't because of a lock >>>> object used in logger, presumably for thread-safeness. But then...how >>>> would I do it? Or is this just a really bad idea? >>>> >>>> Thanks >>>> Diana >>>> >>> >>> >> > -- Software Engineer Analytics Engineering Team@ Box Mountain View, CA