I add the path of my directory on regex-urlfilter but nutch crawl also other directories...
And more: I follow your suggestions and I indexing again my root, But I have still a index with the name of my pdf's files and not the content of those. I don't comprend.. alessio Il giorno 12 marzo 2012 06:06, remi tassing <tassingr...@gmail.com> ha scritto: > Using crawl-ulrfilter (or regex-urlfilter depending on which one you're > using), you should be able to solve this. Unless you're not clear on what > folders to exclude...? > > On Sunday, March 11, 2012, alessio crisantemi < > alessio.crisant...@gmail.com> > wrote: > > thank you Remi for your preciuos help. I try again and I write you the > > results. > > But I have another little question: how can I do for limit the crawling > > only to my selected root? > > > > Because all time, Nutch crawl also the parent directories. I read that > "The > > code that is responsable for this is in > > > > org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File > > f). " > > > > And a guy suggest to change the following line: > > this.content = list2html(f.listFiles(), path, "/".equals(path) ? false : > > true); > > > > to > > this.content = list2html(f.listFiles(), path, false); > > > > and recompiled. > > > > But in my class file, I have just this raw...And that's not a simple mode > > > > There is another method, I suppose? > > > > thank you > > > > alessio > > > > > > > > Il giorno 11 marzo 2012 18:32, Lewis John Mcgibbney < > > lewis.mcgibb...@gmail.com> ha scritto: > > > >> Please see below > >> > >> On Sun, Mar 11, 2012 at 5:10 PM, alessio crisantemi < > >> alessio.crisant...@gmail.com> wrote: > >> > >> > > >> > [1] > >> http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F > >> > > >> > >> I've now updated this link, thanks for pointing this out. > >> > >> > >> > And Now, I have another problem: > >> > I crawled my local file system: a directory with a lot of Pdf files. > All > >> > works, and nutch index on Solr the results. > >> > > >> > >> OK > >> > >> > >> > But this is the problem: whe I submit a query on solr, I can see only > a > >> > list of files, and not the pdf contents. > >> > why, in your opinion? > >> > > >> > >> Well this might be to with your file.content.limit in nutch-site.xml, > maybe > >> your documents are being truncated if they are too large. > >> Additionally your Solr mapping's and or schema configuration may need to > be > >> tweaked slightly to permit you to view snippets of the PDF content > within > >> your Solr search results. In your schema configuration for index-basec, > try > >> changing > >> > >> <field name="content" type="text" stored="false" indexed="true"/> > >> > >> to > >> > >> <field name="content" type="text" stored="true" indexed="true"/> > >> > >> > >> You will need to reindex your content if you wish to see the results > >> through Solr. > >> > > >