Re: Recursively searching through web dirs

Adam Estrada Thu, 25 Aug 2011 07:00:32 -0700

Hi...the files are available on the server but are not necessarily
hyperlinked in the html from the main page. In fact, I was just using that
directory for storage. Now, I want to be able to discover files like them on
other servers. This is what I am wondering is possible or not.


Thanks,
Adam

On Wed, Aug 24, 2011 at 4:58 PM, lewis john mcgibbney <
[email protected]> wrote:

> Hi Adam,
>
> My initial thoughts are that you are correct. It is very unusual for your
> files to be located on an URL in the same domain which is not referenced by
> the top level or a subsequent level URL within the domain.
>
> What I would suggest is that you have a look through your hadoop.log as
> well
> as use some of the commans which enable you to investigate your crawldb,
> segment(s) and linkdb if you've created one.
>
> have a look at the wiki under command line options
>
> On Wed, Aug 24, 2011 at 9:03 PM, Adam Estrada <
> [email protected]
> > wrote:
>
> > All,
> >
> > I have a root domain and a couple directories deep I have some files that
> I
> > want to index. The problem is that they are not referenced on the main
> page
> > using a hyperlink or anything like that.
> >
> > http://www.geoglobaldomination.org/kml/temp/
> >
> > I want to be able to crawl down in to /kml/temp/ without knowing that
> it's
> > even there. Is there a way to do this in Nutch?
> >
> > echo http://www.geoglobaldomination.org > urls
> >
> > ./nutch crawl urls -threads 10 -depth 10 -topN 20 -solr
> > http://172.16.2.107:8983/solr
> >
> > Nothing and I suspect that it's because there is not a hyperlink on the
> > main
> > page.
> >
> > Thoughts?
> > Adam
> >
>
>
>
> --
> *Lewis*
>

Re: Recursively searching through web dirs

Reply via email to