Hi there-

Because your topic is webcrawling, you might want to read the BigTable
paper because the example in that paper is about webcrawling.

You can find that, and other info, in the RefGuide...

http://hbase.apache.org/book.html#other.info.papers






On 4/18/12 2:08 PM, "petri koski" <[email protected]> wrote:

>Hello,
>
>I am quite new to Hbase, and here comes my question:
>
>I have a table. What I do with hadoop is to download webpages in MAP
>-phase, extract Urls found and save them in Reduce -phase. I read  from
>one
>table, and I save them (put) to same table to avoid duplicates etc.
>
>I will get millions of rows, unique ones. Some times, actually quite
>often,
>timestamps are reset because sometimes duplicates are found.
>
>Question is:
>
>Should I keep on doing those M/R in what way:
>
>1. somehow save last Maps ROW -position and pass that info to next MAP to
>start from .. this way I wouldnt have to process processed rows .. Of
>course I have to spider sites all over again after they are finnished so
>..
>but this option would give me some control when site is finnished ..
>2. Everytime start from row 0 and proceed to last one and start all over
>again and go little bit deeper to site you are "spidering" ..
>
>That option number 2 is good coz many sites get newest info on first
>pages,
>so in that way I could keep my own data updated from those sites, but
>flipside is that I dont know when site is crawled ..
>
>Option nro 1. seemed to be wise, but there is something un - Hbase, and un
>- Hadoop like thinking: They are ment to take all in at once and process
>them at once and in case you need more, you chain M/R .. So, my option nro
>2 is more like hadoop/hbase way.. And like I said before, I will not just
>spider one site once and forget it, I will do it again after I have
>finnished doing it once etc.
>
>Which one is better ..
>
>Yours,
>
>Peter


Reply via email to