Re: Recursively searching through web dirs

Markus Jelsma Mon, 29 Aug 2011 17:34:15 -0700

If the url's are not linked from any point (e.g. do not have inlinks) you 
cannot discover them. Only work-around is to inject them manually.


If they are linked from somewhere and that somewhere is linked to from any 
discoverable page from your injected url's then Nutch should find it unless 
there are too many outlinks on a single page. By default Nutch only discovers 
100 outlinks per page.

Check you pages and configuration.

> Hi...the files are available on the server but are not necessarily
> hyperlinked in the html from the main page. In fact, I was just using that
> directory for storage. Now, I want to be able to discover files like them
> on other servers. This is what I am wondering is possible or not.
> 
> Thanks,
> Adam
> 
> On Wed, Aug 24, 2011 at 4:58 PM, lewis john mcgibbney <
> 
> [email protected]> wrote:
> > Hi Adam,
> > 
> > My initial thoughts are that you are correct. It is very unusual for your
> > files to be located on an URL in the same domain which is not referenced
> > by the top level or a subsequent level URL within the domain.
> > 
> > What I would suggest is that you have a look through your hadoop.log as
> > well
> > as use some of the commans which enable you to investigate your crawldb,
> > segment(s) and linkdb if you've created one.
> > 
> > have a look at the wiki under command line options
> > 
> > On Wed, Aug 24, 2011 at 9:03 PM, Adam Estrada <
> > [email protected]
> > 
> > > wrote:
> > > 
> > > All,
> > > 
> > > I have a root domain and a couple directories deep I have some files
> > > that
> > 
> > I
> > 
> > > want to index. The problem is that they are not referenced on the main
> > 
> > page
> > 
> > > using a hyperlink or anything like that.
> > > 
> > > http://www.geoglobaldomination.org/kml/temp/
> > > 
> > > I want to be able to crawl down in to /kml/temp/ without knowing that
> > 
> > it's
> > 
> > > even there. Is there a way to do this in Nutch?
> > > 
> > > echo http://www.geoglobaldomination.org > urls
> > > 
> > > ./nutch crawl urls -threads 10 -depth 10 -topN 20 -solr
> > > http://172.16.2.107:8983/solr
> > > 
> > > Nothing and I suspect that it's because there is not a hyperlink on the
> > > main
> > > page.
> > > 
> > > Thoughts?
> > > Adam
> > 
> > --
> > *Lewis*

Re: Recursively searching through web dirs

Reply via email to