Hi Bayu, On Sat, Jan 12, 2013 at 9:15 PM, Bayu Widyasanyata <[email protected]>wrote:
> Hi Tejas, > Sorry if my questions are confusing :) > Its ok :) > > I have read your post on StackOverflow, and made some clarity for me. > > What makes me still didn't understand is how nutch will know when he will > not parsed a segment (as appear on "segment already parsed")? > When nutch parses a segment, it creates parse_text, parse_data and crawl_parse sub-directories inside the segments directory. These store the output of the parse command. Next time if you try to run parse command on the same segment, it finds that these sub-directories are already present and thus the logs a message indicating that the segment was already parsed. > Some times I should do more two times to make document (a URL) and its > outlinks fetched and parsed by nutch (get more depth). Didn't get what you wanted to convey. > > Back to my question. > As a simple example is the front page of newspaper online website. > If they add 1 (one) news on frontpage, does nutch will create new segment > inside crawl/segments directory (e.g. YYYYMMDDMMSSSS format)? > Segments are created for every individual round. They are not generated for individual urls. > > Hence, if nutch cannot identify if a page is actually being updated (for > above example is frontpage of newspaper online add 1 news / 1 outlink), > then should we force nutch to re-fetch the URL? Is it correct? > Or we will add -addays option periodically to ensure that we have updated > database? > Well, if you know that the front page is updated frequently, set "db.fetch.interval.default" to lower value so that urls will be eligible for re-fetch sooner. By default, if a url is fetched successfully, it becomes eligible for re-fetching after 30 days. > > Thanks.- > > On Sat, Jan 12, 2013 at 1:09 PM, Tejas Patil <[email protected] > >wrote: > > > Hi Bayu, > > > > I did not understand your question properly but I will try to address > your > > questions as far as I can. > > > > Generate phase creates a segment which will just have the fetch list > (this > > is inside the "crawl_generate" directory inside segments). If there are > no > > urls in the crawldb which are eligible for fetching at that point, then > it > > will end up creating an empty directory. > > > > It is during Fetch and Parse phases, the actual data is populated inside > > the segments. ([0] is a shameless plug of my answer on StackOverlfow > which > > has description about the subdirectories inside the segments dir). During > > generate or fetch, Nutch cannot identify if a page is actually being > > updated at the content owners' end. It will have to re-fetch the > > corresponding url. > > > > Does that answer what you wanted ? > > > > [0] : > > > > > http://stackoverflow.com/questions/10225239/what-the-outputs-exactly-are-when-integrating-nutch1-4-and-solr/10262243 > > > > Thanks, > > Tejas Patil > > > > On Fri, Jan 11, 2013 at 5:35 PM, Bayu Widyasanyata > > <[email protected]>wrote: > > > > > Hi, > > > > > > When "nutch generate" is executed the new segments will create and > > somehow > > > they would'nt? > > > > It's when "segment already parsed" generated, in example: > > > > > > ParseSegment: segment: crawl/segments/20130106091814 Exception in > thread > > > "main" java.io.IOException: Segment already parsed! > > > > > > My question is how the new segments is created or how nutch know that > the > > > page is updated? > > > Does it handle by fetching process which know when a page is updated? > > > > > > Does my analyzing above is correct? > > > > > > Now, I do "trick" to force the generating of segments by put adddays > > > command of nutch. > > > > > > Thanks, > > > > > > -- > > > wassalam, > > > [bayu] > > > > > > > > > -- > wassalam, > [bayu] >

