Re: Questions/issues with nutch

tejas.patil.cs Mon, 01 Jul 2013 16:30:45 -0700

Lewis has a point. We have seen such case before. Depending on the editor you 
used for creating/modifying the seeds file, there can be a backup file 
generated. You can check that using 'ls -a' inside the seeds directory and 
remove it.



From my Android phone on T-Mobile. The first nationwide 4G network.

-------- Original message --------
From: Lewis John Mcgibbney <[email protected]> 
Date: 07/01/2013  11:17 AM  (GMT-08:00) 
To: [email protected] 
Subject: Re: Questions/issues with nutch 
 
Is there a temporary file within the urls directory.
something like seed.txt~ ?

On Monday, July 1, 2013, h b <[email protected]> wrote:
> Hi,
> I started to inspect the content of the crawled html.
> I have 2 urls in my seed.txt. So I should just have 2 documents in my solr
> response, right? I dropped 'webpage' database and recreated it by running
> just a single iteration of inject,generate,fetch,parse,solr
> However, I am seeing 8 different documents and the url key in these does
> not even match the url from my seed. What could be going wrong here. It
> almost feels like the crawl crawled entirely different page than what I
> requested. I verified the urls from my seed.txt and they do not redirect.
>
>
> On Sun, Jun 30, 2013 at 8:40 AM, h b <[email protected]> wrote:
>
>> Because we have a separate non Java legacy process that would take care
of
>> the parsing, and it requires raw html. It's more of a process reasoning
>> than anything else.
>> On Jun 30, 2013 8:06 AM, "Tejas Patil" <[email protected]> wrote:
>>
>>> I am curious to know why do needed the raw html content instead of
parsed
>>> stuff. Search engines are meant to index parsed text. The data to be
>>> stored
>>> and indexed reduces after parsing.
>>>
>>>
>>> On Sat, Jun 29, 2013 at 9:20 PM, h b <[email protected]> wrote:
>>>
>>> > Thanks Tejas,
>>> > I have just 2 urls in my seed file, and the second run of fetch ran
for
>>> a
>>> > few hours. I will verify if I got what I wanted.
>>> >
>>> > Regarding the raw html, its a ugly hack, so I did not really create a
>>> > patch. But this is what I did
>>> >
>>> >
>>> > In
>>> >
>>>
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
>>> > getParse method,
>>> >
>>> >       //text = sb.toString();
>>> >       text = new String(page.getContent().array());
>>> >
>>> > Would be nice to make this as a configuration in the plugin xml.
>>> >
>>> > Other thing I will try soon is to extract the content only for a
>>> specific
>>> > depth.
>>> >
>>> >
>>> >
>>> > On Sat, Jun 29, 2013 at 12:49 AM, Tejas Patil <
[email protected]
>>> > >wrote:
>>> >
>>> > > Yes. Nutch would parse the HTML and extract the content out of it.
>>> > Tweaking
>>> > > around the code surrounding the parser would have made that happen.
If
>>> > you
>>> > > did something else, would you mind sharing it ?
>>> > >
>>> > > The "depth" is used by the Crawl class in 1.x which is deprecated in
>>> 2.x.
>>> > > Use bin/crawl instead.
>>> > > While running the "bin/crawl" script, the "<numberOfRounds>" option
is
>>> > > nothing but the depth till which you want the crawling to be
>>> performed.
>>> > >
>>> > > If you want to use the individual commands instead, run generate ->
>>> fetch
>>> > > -> parse -> update multiple times. The crawl script internally does
>>> the
>>> > > same thing.
>>> > > eg. If you want to fetch till depth 3, this is how you could do:
>>> > > inject -> (generate -> fetch -> parse -> update)
>>> > >           -> (generate -> fetch -> parse -> update)
>>> > >           -> (generate -> fetch -> parse -> update)
>>> > >                -> solrindex
>>> > >
>>> > > On Fri, Jun 28, 2013 at 7:24 PM, h b <[email protected]> wrote:
>>> > >
>>> > > > Ok, I tweaked the code a bit to extract the html as is from the
>>> parser,
>>> > > to
>>> > > > realize that it is too much of a text and too much depth of
>>> crawling.
>>> > So
>>> > > I
>>> > > > am looking to see if I can somehow limit the depth. Nutch 1.x docs
>>> > > mention
>>> > > > about the -depth parameter. However, I do not see this in the
>>> > > > nutch-default.xml under Nutch 2.x. The -topN is used for number of
>>> > links
>>> > > > per depth. So for Nutch 2.x where/how do I set the depth?
>>> > > >
>>> > > >
>>> > > > On Fri, Jun 28, 2013 at 11:32 AM, h b <[email protected]> wrote:
>>> > > >
>>>

-- 
*Lewis*

Re: Questions/issues with nutch

Reply via email to