I meant that you could just do a http://external_url.com/y/z/
crawl . But yes, if you have pages from someone elses server locally,
you will need to rewrite the BASE component of the URL in the search
results.
For that you could probably just hack search.jsp (but dont tell
anyone I told you
hi everybody,
I setup nuthc-0.9, and I can search txt and html in local system . Now i
want to search pdf and msword , can you tell me how to do?
BR,
mingkong
Ok, so you merge your other crawls into the same search dir, thats
understood thanks.
My other question is concerning when you do a search in nutch. Right now,
it returns links to file:///x/y/z/.../foo.html and i was wondering if
there was a simple way to change that link to be
Hello,
I tried what Winton said. I generated a file with all the file:///x/y/z
urls, but nutch wont inject any into the crawldb
I even set the crawl-urlfilter.txt to allow everything:
+.
It seems like ./bin/nutch crawl is reading the file, but its finding 0
urls to fetch. I test this on
Hi Ryan,
I just used the regular intranet crawl, didnt try to do the inject
W
At 6:16 PM -0400 7/5/08, Ryan Smith wrote:
Winton,
I added the override property to nutch-site.xml ( i saw the one in
nutch-default.xml after your email ) , still no urls being added to the
crawldb.
Can you verify
Hi Winton,
I found my problem. I was only editing crawl-urlfilter.txt and not
regexp-urlfilter.txt
Thanks for the help.
I have 2 questions:
After i crawl my files, they will be indexed with file:///x/y/z/...
Is there an chance i can easily change the link prefix to
http://somesite.com/ ?
Not without modifying the code. I dont think it respects BASE for
example, if you crawl it as File:///
Frankly if you can, just serve it thru DOCROOT - it will be less
painful in the end!
- Serving URL - You can change it if you know how to set up Tomcat.
Winton
Hi Winton,
I found my
oh sorry I misunderstood the question - I think you can only serve
from 1 directory (aka Crawl by default). Of course you can create
multiple instances that serve from different crawls, but then you'd
have to deal with joining them together.
You can definitely MERGE multiple crawl
Is there a simple way to have nutch index a folder full of other folders and
html files?
I was hoping to avoid having to run apache to serve the html files, and then
have nutch crawl the site on apache.
Thank you,
-Ryan
Ryan,
You can generate a file of FILE urls (eg)
file:///x/y/z/file1.html
file:///x/y/z/file2.html
Use find and AWK accordingly to generate this. put it in the url
directory and just set depth to 1, and change crawl_urlfilter.txt to
admit file:///x/y/z/ (note, if you dont head qualify it,
10 matches
Mail list logo