URL containing "?", "&" and "="

2006-03-09 Thread Vertical Search
Okay, I have noticed that for URLs containing "?", "&" and "=" I cannot crawl. I have tried all combinations of modifying crawl-urlfilter.txt and # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] But invain. I have hit a road block.. that is terrible.. :(

org.apache.nutch.net.URLFilter not found.

2006-03-09 Thread Vertical Search
I am trying to crawl a site with Query string containing "?" and "=". So, I have modified the following line in crawl-urlfilter and regex-urlfilter as per the advise in one of the posting on archieve # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] But still

Re: Why does crawler skips some files and scan others of the same suffix?

2006-03-09 Thread Jérôme Charron
> I am guessing the links here also include HREF values for images? For HTML parser, the outlinks are (if no rel="nofollow" attribute and method="post" attribute)

RE: Why does crawler skips some files and scan others of the same suffix?

2006-03-09 Thread Teruhiko Kurosaka
> There could be many reasons. > Have you checked these properties for instance: > http.content.limit > db.max.outlinks.per.page Bingo! It was set to 100, the default value because the page has fewer than 100 hyperlinks. But, I doubled the number anyway then the files in questions have been fetc

What are valid names and location(s) for segments

2006-03-09 Thread Bryan Woliner
I am using nutch 0.7.1 and have a couple questions about valid segment names and locations: I can get nutch to work fine when I store my segments, with their original nutch assigned names in the folder: "/usr/local/nutch-0.7.1/live/segments/" and then start tomcat from the "/usr/local/nutch-0.7.1/

Re: Why does crawler skips some files and scan others of the same suffix?

2006-03-09 Thread Jérôme Charron
> So I suppose "depth" and other parameters won't play a role here, > do they? There could be many reasons. Have you checked these properties for instance: http.content.limit db.max.outlinks.per.page Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

RE: Why does crawler skips some files and scan others of the same suffix?

2006-03-09 Thread Teruhiko Kurosaka
I forgot to mention clearly that these files reside in the same directory. Links to these files appear in the file listing page generated by Tomcat. So I suppose "depth" and other parameters won't play a role here, do they? > -Original Message- > From: Teruhiko Kurosaka [mailto:[EMAIL PROT

Why does crawler skips some files and scan others of the same suffix?

2006-03-09 Thread Teruhiko Kurosaka
I placed a bunch of files in a directory in Apache web server's htdocs directory, and had Nutch crawl that directory. But, according to the output from "nutch crawl" command some files were scanned while some were not. For example, these were scanned: jp5-fwroman_UTF8B.txt jp5_EUCJP.html jp5-UTF

Crawling accuracy

2006-03-09 Thread carmmello
I have an experimental site with about 450 seed sites, that results in about 500,000 pages indexed (using Nutch 0.7.1.x). Doing some searches I noticed that the results showed somewhat low number for some specific terms. I went further and tried to index only the site "http://.nrc.gov/";, us

RE: writing a metadata content tag:use case example

2006-03-09 Thread Richard Braman
I am following this thread as I have a similar issue to deal with in my coming developments. Howie thanks for your insights into this as I think this may solve my problem. I am trying to index Title 26 of the US Code http://www.access.gpo.gov/uscode/title26/title26.html The problem is I don't

RE: Indexing a web site over HTTPS using username/passwd

2006-03-09 Thread Richard Braman
I don't know Dan, but its something on my list too. I kind of doubt that this is a feature in nutch, because generally this is thought of as specialized intelligent agent (IA) capability instead of more general spidering/indexing technology. Certainly it is possible to do, but there are two probl

Re: Nutch and authorization

2006-03-09 Thread jay jiang
Just to clarify, for this to work Nutch as a special user should be able to access all the data during the crawl/indexing. jay jiang wrote: You can add metadata (e.g. group) via index filter that marks the grouping/category of a document during indexing and during the search map user's entitl

Re: Nutch and authorization

2006-03-09 Thread jay jiang
You can add metadata (e.g. group) via index filter that marks the grouping/category of a document during indexing and during the search map user's entitlement to corresponding documents' grouping via query filter. For the better, I think Nutch/Lucene should have a plugin in result filter wher

Re: Nutch and authorization

2006-03-09 Thread Patrice Neff
El 08/03/2006, a las 03:16 AM, Laurent Michenaud escribió: Do u know good strategies to manage authorization ? I mean a user should only see the nutch results he has the rights on. Nutch will IMHO only index public pages. So any user is allowed to see any of the indexed pages. Patrice

Re: A possible error in the tutorial

2006-03-09 Thread Patrice Neff
El 07/03/2006, a las 07:40 PM, fabrizio silvestri escribió: the line bin/nutch org.apache.nutch.crawl.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls shouldn't be bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls Yes, that's right. Cheers Patrice

Vertical Search

2006-03-09 Thread Sudhi Seshachala
Hello folks, I am working on adopting nutch for a vertical. I have been able to get it up and running in pretty basic scenarios. I need some help in getting up to speed in trying to crawl sites which has some weird encoding on the URLs. I am kind of lost, how to go about it? If some one can share s

Indexing a web site over HTTPS using username/passwd

2006-03-09 Thread Dan Fundatureanu
Hi, Could you point me were I can find some info about how can I use NUTCH to crawl over a website where the access is provided only via HTTPS using username/passwd? Are there any config settings that I have to do or do I have to hack in the code to change this? Thanks, Dan Fundatureanu

Re: Adaptive Refetching

2006-03-09 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Doug Cutting wrote: are refetched, their links are processed again. I think the easiest way to fix this would be to change ParseOutputFormat to not generate STATUS_LINKED crawldata when a page has been refetched. That way scores would only be adju

the result page generator

2006-03-09 Thread Vinny
Hello, What s the name of the template/page/class that generates the results? I wanted to place some context ads on the results page. -- Ghetto Java: http://www.xaymaca.com/roller/page/gj

Help with removing menu garb from the results summars

2006-03-09 Thread Stephen Ensor
Hi, I have crawled a whole bunch of sites and have increased the results summaries so there is more detail and more terms in contect. These summaries remained a jabbel of words and I finally realised that it was the sites actual navigation menues that were being indexed and showing up in the summ

Re: writing a metadata content tag

2006-03-09 Thread Raghavendra Prabhu
Hi Howie That is what i am looking at it But as you said generalize for all requirements including intranet requirement I am better off doing what u said Rgds Prabu On 3/9/06, Howie Wang <[EMAIL PROTECTED]> wrote: > > >What i want to do is i should add some header info in parse-filter which >