Re: Indexing static html files

2008-07-07 Thread Winton Davies
I meant that you could just do a http://external_url.com/y/z/ crawl . But yes, if you have pages from someone elses server locally, you will need to rewrite the BASE component of the URL in the search results. For that you could probably just hack search.jsp (but dont tell anyone I told you

Re: Indexing static html files

2008-07-07 Thread 宫照
hi everybody, I setup nuthc-0.9, and I can search txt and html in local system . Now i want to search pdf and msword , can you tell me how to do? BR, mingkong

Re: Indexing static html files

2008-07-06 Thread Ryan Smith
Ok, so you merge your other crawls into the same search dir, thats understood thanks. My other question is concerning when you do a search in nutch. Right now, it returns links to file:///x/y/z/.../foo.html and i was wondering if there was a simple way to change that link to be

Re: Indexing static html files

2008-07-05 Thread Ryan Smith
Hello, I tried what Winton said. I generated a file with all the file:///x/y/z urls, but nutch wont inject any into the crawldb I even set the crawl-urlfilter.txt to allow everything: +. It seems like ./bin/nutch crawl is reading the file, but its finding 0 urls to fetch. I test this on

Re: Indexing static html files

2008-07-05 Thread Winton Davies
Hi Ryan, I just used the regular intranet crawl, didnt try to do the inject W At 6:16 PM -0400 7/5/08, Ryan Smith wrote: Winton, I added the override property to nutch-site.xml ( i saw the one in nutch-default.xml after your email ) , still no urls being added to the crawldb. Can you verify

Re: Indexing static html files

2008-07-05 Thread Ryan Smith
Hi Winton, I found my problem. I was only editing crawl-urlfilter.txt and not regexp-urlfilter.txt Thanks for the help. I have 2 questions: After i crawl my files, they will be indexed with file:///x/y/z/... Is there an chance i can easily change the link prefix to http://somesite.com/ ?

Re: Indexing static html files

2008-07-05 Thread Winton Davies
Not without modifying the code. I dont think it respects BASE for example, if you crawl it as File:/// Frankly if you can, just serve it thru DOCROOT - it will be less painful in the end! - Serving URL - You can change it if you know how to set up Tomcat. Winton Hi Winton, I found my

Re: Indexing static html files

2008-07-05 Thread Winton Davies
oh sorry I misunderstood the question - I think you can only serve from 1 directory (aka Crawl by default). Of course you can create multiple instances that serve from different crawls, but then you'd have to deal with joining them together. You can definitely MERGE multiple crawl

Indexing static html files

2008-07-03 Thread Ryan Smith
Is there a simple way to have nutch index a folder full of other folders and html files? I was hoping to avoid having to run apache to serve the html files, and then have nutch crawl the site on apache. Thank you, -Ryan

Re: Indexing static html files

2008-07-03 Thread Winton Davies
Ryan, You can generate a file of FILE urls (eg) file:///x/y/z/file1.html file:///x/y/z/file2.html Use find and AWK accordingly to generate this. put it in the url directory and just set depth to 1, and change crawl_urlfilter.txt to admit file:///x/y/z/ (note, if you dont head qualify it,