Hi Lewis, Thank you for your answer, and your patience to respond thoroughly to my questions! I am currently trying to use a more lightweight tool, crawler4j. If I won't manage to get something useful, I will return to Nutch.
Vlad On Mon, Jun 25, 2012 at 11:42 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi Vlad, > > On Fri, Jun 22, 2012 at 10:10 AM, Vlad Paunescu <[email protected]> > wrote: > > The problem is > > that I can't keep these synchronized (reading for an url the content, and > > reading the for the same url the parse_data). > > Why can't you keep this in sync? What have you tried so far? Why isn't > it working? > > > I would also want to know, if > > parse_data processes the outlinks inside the css for example (because in > > css we can have backround urls for instance, and they can be abolute > too). > > Do you have an example of this CSS? > Out of the box Nutch doesn't do processing of outlinks instead > extracting themn from fecthed pages and storing them for future > fetching. Only when an actual URL (outlink) is fetched will it be > processed. In this case these URLs would be treated the exact same as > the ones you fetch and process above. > > > > > - Third, and this isn't done yet, is to create a directory structure on > the > > disk where I write the every website page that was crawled. Nutch doesn't > > provide this by default, as far as I know. Am I right?. > > Nutch has a number of tools for us to obtain specific data about urls > from the crawldb, linkdb etc. Are you wanting to store a complete copy > of every page locally? If this is the case then no Nutch doesn't do > this. > > > Another way of doing this is to have some way to be notified in our > program > > when a page is fetched by the fetcher, and to do something with it. I > would > > need to attach a listener to the fetcher, but I am not sure this can be > > done without modifying the Nutch source code. > > A modification to a src distribution is most certainly required > here... This will require you to ascertain exactly when a FetchItem > (@see Fetcher.java) is queued and subsequently fetched. Once this has > been successful you will wish to obtain notification. > > > > > I also run Nutch in the local mode, where hadoop is writing on my local > > disk. Do you think that the steps I need to implement in order to write > the > > file structure on disk should be thought as a map- reduce Hadoop program, > > or a normal approach is better? > > Initially I would say that this depends on what scale you wish to run > the application @? > If you get it working w/o running over Hadoop then fine, if you then > need to scale it up then experiment running MR... > > > So, finally, do you think that Nutch is the appropriate tool for what we > > need, or I have to choose another tool? > > > > I think in order to achieve what you are trying, there may be a number > of places within Nutch where you'll need to hack. I'm kinda still > uncertain the complete aim of the project which you're working on as > it seems like your trying to use Nutch in a slightly different manner > than users usually do, which undoubtedly may/will requires some code > alterations. > > hth >

