Okay, I have noticed that for URLs containing "?", "&" and "=" I cannot
crawl.
I have tried all combinations of modifying crawl-urlfilter.txt and
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
But invain. I have hit a road block.. that is terrible.. :(
I am trying to crawl a site with Query string containing "?" and "=".
So, I have modified the following line in crawl-urlfilter and
regex-urlfilter as per the advise in one of the posting on archieve
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
But still
> I am guessing the links here also include HREF values for images?
For HTML parser, the outlinks are (if no rel="nofollow" attribute and
method="post" attribute)
> There could be many reasons.
> Have you checked these properties for instance:
> http.content.limit
> db.max.outlinks.per.page
Bingo!
It was set to 100, the default value because the page has fewer than 100
hyperlinks.
But, I doubled the number anyway then the files in questions have been
fetc
I am using nutch 0.7.1 and have a couple questions about valid segment names
and locations:
I can get nutch to work fine when I store my segments, with their original
nutch assigned names in the folder: "/usr/local/nutch-0.7.1/live/segments/"
and then start tomcat from the "/usr/local/nutch-0.7.1/
> So I suppose "depth" and other parameters won't play a role here,
> do they?
There could be many reasons.
Have you checked these properties for instance:
http.content.limit
db.max.outlinks.per.page
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
I forgot to mention clearly that these files reside in the same
directory.
Links to these files appear in the file listing page generated by
Tomcat.
So I suppose "depth" and other parameters won't play a role here,
do they?
> -Original Message-
> From: Teruhiko Kurosaka [mailto:[EMAIL PROT
I placed a bunch of files in a directory in Apache web server's
htdocs directory, and had Nutch crawl that directory.
But, according to the output from "nutch crawl" command some files
were scanned while some were not. For example, these were scanned:
jp5-fwroman_UTF8B.txt
jp5_EUCJP.html
jp5-UTF
I have an experimental site with about 450 seed sites, that results in
about 500,000 pages indexed (using Nutch 0.7.1.x). Doing some searches I
noticed that the results showed somewhat low number for some specific
terms. I went further and tried to index only the site
"http://.nrc.gov/";, us
I am following this thread as I have a similar issue to deal with in my
coming developments. Howie thanks for your insights into this as I
think this may solve my problem.
I am trying to index Title 26 of the US Code
http://www.access.gpo.gov/uscode/title26/title26.html
The problem is I don't
I don't know Dan, but its something on my list too. I kind of doubt
that this is a feature in nutch, because generally this is thought of as
specialized intelligent agent (IA) capability instead of more general
spidering/indexing technology. Certainly it is possible to do, but
there are two probl
Just to clarify, for this to work Nutch as a special user should be able
to access all the data during the crawl/indexing.
jay jiang wrote:
You can add metadata (e.g. group) via index filter that marks the
grouping/category of a document during indexing and during the search
map user's entitl
You can add metadata (e.g. group) via index filter that marks the
grouping/category of a document during indexing and during the search
map user's entitlement to corresponding documents' grouping via query
filter.
For the better, I think Nutch/Lucene should have a plugin in result
filter wher
El 08/03/2006, a las 03:16 AM, Laurent Michenaud escribió:
Do u know good strategies to manage authorization ?
I mean a user should only see the nutch results he has the rights on.
Nutch will IMHO only index public pages. So any user is allowed to
see any of the indexed pages.
Patrice
El 07/03/2006, a las 07:40 PM, fabrizio silvestri escribió:
the line
bin/nutch org.apache.nutch.crawl.DmozParser content.rdf.u8 -subset
5000 > dmoz/urls
shouldn't be
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset
5000 > dmoz/urls
Yes, that's right.
Cheers
Patrice
Hello folks,
I am working on adopting nutch for a vertical.
I have been able to get it up and running in pretty basic scenarios.
I need some help in getting up to speed in trying to crawl sites which has
some weird encoding on the URLs.
I am kind of lost, how to go about it? If some one can share s
Hi,
Could you point me were I can find some info about how can I use NUTCH to
crawl over a website where the access is provided only via HTTPS using
username/passwd?
Are there any config settings that I have to do or do I have to hack in the
code to change this?
Thanks,
Dan Fundatureanu
Doug Cutting wrote:
Andrzej Bialecki wrote:
Doug Cutting wrote:
are refetched, their links are processed again. I think the easiest
way to fix this would be to change ParseOutputFormat to not generate
STATUS_LINKED crawldata when a page has been refetched. That way
scores would only be adju
Hello,
What s the name of the template/page/class that generates the results?
I wanted to place some context ads on the results page.
--
Ghetto Java: http://www.xaymaca.com/roller/page/gj
Hi,
I have crawled a whole bunch of sites and have increased the results
summaries so there is more detail and more terms in contect. These
summaries remained a jabbel of words and I finally realised that it was the
sites actual navigation menues that were being indexed and showing up in the
summ
Hi Howie
That is what i am looking at it
But as you said generalize for all requirements including intranet
requirement
I am better off doing what u said
Rgds
Prabu
On 3/9/06, Howie Wang <[EMAIL PROTECTED]> wrote:
>
> >What i want to do is i should add some header info in parse-filter which
>
21 matches
Mail list logo