Thanks everyone for helping me out, I figured it was one of those logical
errors which lead to infinite loops. Actually indexof operation doesnt
always return -1 on failure which was causing this to get into infinite
loop (I should have thought about this). (ie. indexof('[', 187) would
return 187 and the loop would continue always.
Thanks again,
AniketOn Thu, February 24, 2011 7:47 pm, Aniket Mokashi wrote: > This is a map side udf. > pig script loads a log file and grabs contents inside angle brackets. a = > load; b = foreach a generate F(a); dump b; > > I see following on tasktrackers- > 2011-02-23 18:01:25,992 INFO > org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call > - Collection threshold init = 5439488(5312K) used = 409337824(399743K) > committed = 534118400(521600K) max = 715849728(699072K) 2011-02-23 > 18:01:26,102 INFO > org.apache.pig.impl.util.SpillableMemoryManager: first memory handler > call- Usage threshold init = 5439488(5312K) used = 546751088(533936K) > committed = 671547392(655808K) max = 715849728(699072K) > > I am trying out some changes in udf to see if they work. > > > Thanks, > Aniket > > > On Thu, February 24, 2011 7:25 pm, Daniel Dai wrote: > >> Hi, Aniket, >> What is your Pig script? Is the UDF in map side or reduce side? >> >> >> >> Daniel >> >> >> >> Dmitriy Ryaboy wrote: >> >> >>> That's a max of 3.3K single-character strings. Even with the java >>> overhead that shouldn't be more than a meg right? none of these should >>> make it out of young gen assuming the list "cats" doesn't stick >>> around outside the udf. >>> >>> On Thu, Feb 24, 2011 at 3:49 PM, Aniket Mokashi >>> <[email protected]>wrote: >>> >>> >>> >>> >>>> Hi Jai, >>>> >>>> >>>> >>>> Thanks for your email. I suspect that its the Strings in tight loop >>>> reason as you have suggested. I have a loop in my udf that does >>>> the following. >>>> >>>> while((startInd = someLog.indexOf('[',startInd)) > 0) { endInd = >>>> someLog.indexOf(']', startInd); if(endInd > 0) { category = >>>> someLog.substring(startInd, endInd+1); cats.add(category); } >>>> startInd = endInd; } >>>> >>>> >>>> My jobs are failing in both local and mr mode. UDF works fine for a >>>> smaller input (a few lines). Also, I checked that sizeof someLog >>>> doesnt exceed a 10000. >>>> >>>> Thanks, >>>> Aniket >>>> >>>> >>>> >>>> >>>> On Thu, February 24, 2011 3:58 am, Jai Krishna wrote: >>>> >>>> >>>> >>>>> Sharing the code would be useful as mentioned. Also of help would >>>>> the heap settings that the JVM had. >>>>> >>>>> However, off the top of my head, one common situation (esp. in >>>>> text processing/tokenizing) is instantiating Strings in a tight >>>>> loop. >>>>> >>>>> Besides you could also exercise your UDF in a local JVM and take >>>>> a heap dump / profile it. If your heap is less than 512M, you >>>>> could use basic profiling via hprof/hat (see >>>>> http://java.sun.com/developer/technicalArticles/Programming/HPROF >>>>> .h >>>>> tml). >>>>> >>>>> >>>>> Thanks, >>>>> Jai >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 2/24/11 9:26 AM, "Dmitriy Ryaboy" <[email protected]> wrote: >>>>> >>>>> >>>>> >>>>> >>>>> Aniket, share the code? >>>>> It really depends on how you create them. >>>>> >>>>> >>>>> >>>>> >>>>> -D >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Feb 23, 2011 at 7:49 PM, Aniket Mokashi >>>>> <[email protected]>wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> I ve written a simple UDF that parses a chararray (which looks >>>>>> like ...[a].....[b]...[a]...) to capture stuff inside brackets >>>>>> and return them as String a=2;b=1; and so on. The input >>>>>> chararray are rarely more than 1000 characters and are not more >>>>>> than 100000 (I ve added log.warn in my udf to ensure this). But, >>>>>> I still see java >>>>>> heap error while running this udf (even in local mode, the job >>>>>> simply fails). My assumption is maps and lists that I use >>>>>> locally will be recollected by gc. Am I missing something? >>>>>> >>>>>> Thanks, >>>>>> Aniket >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >> >> >> > > > >
