Lewis has a point. We have seen such case before. Depending on the editor you used for creating/modifying the seeds file, there can be a backup file generated. You can check that using 'ls -a' inside the seeds directory and remove it.
From my Android phone on T-Mobile. The first nationwide 4G network. -------- Original message -------- From: Lewis John Mcgibbney <[email protected]> Date: 07/01/2013 11:17 AM (GMT-08:00) To: [email protected] Subject: Re: Questions/issues with nutch Is there a temporary file within the urls directory. something like seed.txt~ ? On Monday, July 1, 2013, h b <[email protected]> wrote: > Hi, > I started to inspect the content of the crawled html. > I have 2 urls in my seed.txt. So I should just have 2 documents in my solr > response, right? I dropped 'webpage' database and recreated it by running > just a single iteration of inject,generate,fetch,parse,solr > However, I am seeing 8 different documents and the url key in these does > not even match the url from my seed. What could be going wrong here. It > almost feels like the crawl crawled entirely different page than what I > requested. I verified the urls from my seed.txt and they do not redirect. > > > On Sun, Jun 30, 2013 at 8:40 AM, h b <[email protected]> wrote: > >> Because we have a separate non Java legacy process that would take care of >> the parsing, and it requires raw html. It's more of a process reasoning >> than anything else. >> On Jun 30, 2013 8:06 AM, "Tejas Patil" <[email protected]> wrote: >> >>> I am curious to know why do needed the raw html content instead of parsed >>> stuff. Search engines are meant to index parsed text. The data to be >>> stored >>> and indexed reduces after parsing. >>> >>> >>> On Sat, Jun 29, 2013 at 9:20 PM, h b <[email protected]> wrote: >>> >>> > Thanks Tejas, >>> > I have just 2 urls in my seed file, and the second run of fetch ran for >>> a >>> > few hours. I will verify if I got what I wanted. >>> > >>> > Regarding the raw html, its a ugly hack, so I did not really create a >>> > patch. But this is what I did >>> > >>> > >>> > In >>> > >>> src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java >>> > getParse method, >>> > >>> > //text = sb.toString(); >>> > text = new String(page.getContent().array()); >>> > >>> > Would be nice to make this as a configuration in the plugin xml. >>> > >>> > Other thing I will try soon is to extract the content only for a >>> specific >>> > depth. >>> > >>> > >>> > >>> > On Sat, Jun 29, 2013 at 12:49 AM, Tejas Patil < [email protected] >>> > >wrote: >>> > >>> > > Yes. Nutch would parse the HTML and extract the content out of it. >>> > Tweaking >>> > > around the code surrounding the parser would have made that happen. If >>> > you >>> > > did something else, would you mind sharing it ? >>> > > >>> > > The "depth" is used by the Crawl class in 1.x which is deprecated in >>> 2.x. >>> > > Use bin/crawl instead. >>> > > While running the "bin/crawl" script, the "<numberOfRounds>" option is >>> > > nothing but the depth till which you want the crawling to be >>> performed. >>> > > >>> > > If you want to use the individual commands instead, run generate -> >>> fetch >>> > > -> parse -> update multiple times. The crawl script internally does >>> the >>> > > same thing. >>> > > eg. If you want to fetch till depth 3, this is how you could do: >>> > > inject -> (generate -> fetch -> parse -> update) >>> > > -> (generate -> fetch -> parse -> update) >>> > > -> (generate -> fetch -> parse -> update) >>> > > -> solrindex >>> > > >>> > > On Fri, Jun 28, 2013 at 7:24 PM, h b <[email protected]> wrote: >>> > > >>> > > > Ok, I tweaked the code a bit to extract the html as is from the >>> parser, >>> > > to >>> > > > realize that it is too much of a text and too much depth of >>> crawling. >>> > So >>> > > I >>> > > > am looking to see if I can somehow limit the depth. Nutch 1.x docs >>> > > mention >>> > > > about the -depth parameter. However, I do not see this in the >>> > > > nutch-default.xml under Nutch 2.x. The -topN is used for number of >>> > links >>> > > > per depth. So for Nutch 2.x where/how do I set the depth? >>> > > > >>> > > > >>> > > > On Fri, Jun 28, 2013 at 11:32 AM, h b <[email protected]> wrote: >>> > > > >>> -- *Lewis*

