Re: nutch crawling file system SOLVED

alessio crisantemi Mon, 12 Mar 2012 01:39:59 -0700

I add the path of my directory on regex-urlfilter but nutch crawl also
other directories...


And more: I follow your suggestions and I indexing again my root, But I
have still a index with the name of my pdf's files and not the content of
those.

I don't comprend..
alessio

Il giorno 12 marzo 2012 06:06, remi tassing <tassingr...@gmail.com> ha
scritto:

> Using crawl-ulrfilter (or regex-urlfilter depending on which one you're
> using), you should be able to solve this. Unless you're not clear on what
> folders to exclude...?
>
> On Sunday, March 11, 2012, alessio crisantemi <
> alessio.crisant...@gmail.com>
> wrote:
> > thank you Remi for your preciuos help. I try again and I write you the
> > results.
> > But I have another little question: how can I do for limit the crawling
> > only to my selected root?
> >
> > Because all time, Nutch crawl also the parent directories. I read that
> "The
> > code that is responsable for this is in
> >
>
> org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File
> > f). "
> >
> > And a guy suggest to change the following line:
> > this.content = list2html(f.listFiles(), path, "/".equals(path) ? false :
> > true);
> >
> > to
> > this.content = list2html(f.listFiles(), path, false);
> >
> > and recompiled.
> >
> > But in my class file, I have just this raw...And that's not a simple mode
> >
> > There is another method, I suppose?
> >
> > thank you
> >
> > alessio
> >
> >
> >
> > Il giorno 11 marzo 2012 18:32, Lewis John Mcgibbney <
> > lewis.mcgibb...@gmail.com> ha scritto:
> >
> >> Please see below
> >>
> >> On Sun, Mar 11, 2012 at 5:10 PM, alessio crisantemi <
> >> alessio.crisant...@gmail.com> wrote:
> >>
> >> >
> >> > [1]
> >> http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
> >> >
> >>
> >> I've now updated this link, thanks for pointing this out.
> >>
> >>
> >> > And Now, I have another problem:
> >> > I crawled my local file system: a directory with a lot of Pdf files.
> All
> >> > works, and nutch index on Solr the results.
> >> >
> >>
> >> OK
> >>
> >>
> >> > But this is the problem: whe I submit a query on solr, I can see only
> a
> >> > list of files, and not the pdf contents.
> >> > why, in your opinion?
> >> >
> >>
> >> Well this might be to with your file.content.limit in nutch-site.xml,
> maybe
> >> your documents are being truncated if they are too large.
> >> Additionally your Solr mapping's and or schema configuration may need to
> be
> >> tweaked slightly to permit you to view snippets of the PDF content
> within
> >> your Solr search results. In your schema configuration for index-basec,
> try
> >> changing
> >>
> >> <field name="content" type="text" stored="false" indexed="true"/>
> >>
> >> to
> >>
> >> <field name="content" type="text" stored="true" indexed="true"/>
> >>
> >>
> >> You will need to reindex your content if you wish to see the results
> >> through Solr.
> >>
> >
>

Re: nutch crawling file system SOLVED

Reply via email to