There are some small different lookup files so that I need to process each single lookup files. From your example it can be that way:
a = LOAD 'small1'; --for example taking source_id=1 --> then find source_name d = LOAD 'small2'; --for example taking campaign_id=2 --> then find campaign_name e = LOAD 'small3'; --for example taking offer_id=3 --> then find offer_name B = LOAD 'big'; C = JOIN B BY 1, A BY 1 USING 'replicated'; f = JOIN c BY 1, d BY 1 USING 'replicated'; g = JOIN f BY 1, e BY 1 USING 'replicated'; dump g; small1, small2 and small3 is different files so they store different rows. At the end of the process I need to attach to all rows in my big file. I know HDFS doesn't perform well with the small files but originally it stores in different environment. I pull the data from there and load into HDFS. Anyway because of our architecture I can't change it right now. Thanks Best regards... On Tue, Nov 5, 2013 at 7:43 PM, Pradeep Gollakota <pradeep...@gmail.com>wrote: > CROSS is grossly expensive to compute so I’m not surprised that the > performance is good enough. Are you repeating your LOAD and FILTER op’s for > every one of your small files? At the end of the day, what is it that > you’re trying to accomplish? Find the 1 row you’re after and attach to all > rows in your big file? > > In terms of using DistributedCache, if you’re computing the cross product > of two (and no more than two) relations, AND one of the relations is small > enough to fit in memory, you can use a replicated JOIN instead which would > be much more performant. > > A = LOAD 'small'; > B = LOAD 'big'; > C = JOIN B BY 1, A BY 1 USING 'replicated'; > dump C; > > Note that the smaller relation that will be loaded into memory needs to be > specified second in the JOIN statement. > > Also keep in mind that HDFS doesn't perform well with lots of small files. > If you're design has (lots of) small files, you might benefit from loading > that data into some database (e.g. HBase). > > > On Tue, Nov 5, 2013 at 7:29 AM, burakkk <burak.isi...@gmail.com> wrote: > > > Hi, > > I'm using Pig 0.8.1-cdh3u5. Is there any method to use distributed cache > > inside Pig? > > > > My problem is that: I have lots of small files in hdfs. Let's say 10 > files. > > Each files contain more than one rows but I need only one row. But there > > isn't any relationship between each other. So I filter them what I need > and > > then join them without any relationship(cross join) This is my workaround > > solution: > > > > a = load(smallFile1) --ex: rows count: 1000 > > b = FILTER a BY myrow=='filter by exp1' > > c = load(smallFile2) --ex: rows count: 30000 > > d = FILTER c BY myrow2=='filter by exp2' > > e = CROSS b,d > > ... > > f = load(bigFile) --ex:rows count: 50mio > > g = CROSS e, f > > > > But it's performance isn't good enough. So if I can use distributed cache > > inside pig script, I can lookup the files which I first read and filter > in > > the memory. What is your suggestion? Is there any other performance > > efficient way to do it? > > > > Thanks > > Best regards... > > > > > > -- > > > > *BURAK ISIKLI* | *http://burakisikli.wordpress.com > > <http://burakisikli.wordpress.com>* > > > -- *BURAK ISIKLI* | *http://burakisikli.wordpress.com <http://burakisikli.wordpress.com>*