, it means that your crawler is
sacrificing these multiple links because they have a very low rank in
which case you might want to increase the 'topN' value.
Hope this helps you.
Regards,
Susam Pal
very deep are also important and you want to
crawl them, you might have to sacrifice low ranking URLs by setting a
smaller topN value, say, 1000, or whatever works for you.
Regards,
Susam Pal
included it in CC.
This feature is not present in Nutch. We have recorded the summary of
some old discussions regarding this here:
http://wiki.apache.org/nutch/HttpPostAuthentication But this was never
implemented.
Regards,
Susam Pal
On Mon, Mar 15, 2010 at 2:32 PM, Graziano Aliberti
graziano.alibe...@eng.it wrote:
Il 13/03/2010 22.55, Susam Pal ha scritto:
On Fri, Mar 12, 2010 at 3:17 PM, Susam Palsusam@gmail.com wrote:
On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti
graziano.alibe...@eng.it wrote:
Il 11/03
On Tue, Mar 16, 2010 at 12:55 AM, Susam Pal susam@gmail.com wrote:
On Mon, Mar 15, 2010 at 2:32 PM, Graziano Aliberti
graziano.alibe...@eng.it wrote:
Il 13/03/2010 22.55, Susam Pal ha scritto:
On Fri, Mar 12, 2010 at 3:17 PM, Susam Palsusam@gmail.com wrote:
On Fri, Mar 12, 2010
On Fri, Mar 12, 2010 at 3:17 PM, Susam Pal susam@gmail.com wrote:
On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti
graziano.alibe...@eng.it wrote:
Il 11/03/2010 16.20, Susam Pal ha scritto:
On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti
graziano.alibe...@eng.it wrote:
Hi
On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti
graziano.alibe...@eng.it wrote:
Il 11/03/2010 16.20, Susam Pal ha scritto:
On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti
graziano.alibe...@eng.it wrote:
Hi everyone,
I'm trying to use nutch ver. 1.0 on a system under squid proxy
' property in 'conf/nutch-site.xml'?
Regards,
Susam Pal
for. You can use it with -r
option to recursively download pages and store them as separate files
on the hard disk, which is exactly what you need. You might want to
use the -np option too. It is available for Windows as well as Linux.
Regards,
Susam Pal
/httpclient-auth.xml
3. logs/hadoop.log
4. Output from telnet, netcat, etc.
Please go through Need Help? section of
http://wiki.apache.org/nutch/HttpAuthenticationSchemes to make sure
you haven't missed anything important.
Regards,
Susam Pal
,
Susam Pal
for not being
able to help you soon enough as I am on a vacation at a small town
with poor internet connectivity.
Regards,
Susam Pal
that it mentions
that NTLM can not be used to authenticate with both a proxy and the
server.
Regards,
Susam Pal
.
Regards,
Susam Pal
that it has to do something with the NTLM
version. However, I don't have any experience with errors of this
kind. So, I can't really tell. This section might be of some help to
you:
http://hc.apache.org/httpclient-3.x/authentication.html#Known_limitations_and_problems
Regards,
Susam Pal
://hc.apache.org/httpclient-3.x/authentication.html
Regards,
Susam Pal
' property in hadoop-site.xml to
specify an alternate path for temporary directory.
Example:
property
namehadoop.tmp.dir/name
value/opt/tmp//value
description/description
/property
Regards,
Susam Pal
String(bean.getContent(hitDetails));
You can also see 'src/web/jsp/cached.jsp' or 'cached.jsp' in the
directory where Nutch WAR file is deployed to see how NutchBean object
is used to get the page content.
Regards,
Susam Pal
don't see why you
can not run Hadoop in VMware virtual machines.
Regards,
Susam Pal
as the development progressed.
Regards,
Susam Pal
of these three cases.
Regards,
Susam Pal
On Tue, Mar 31, 2009 at 9:44 PM, Austin, David david.aus...@encana.com wrote:
Hi Susam,
Thanks for your quick response. I've gone through the Need Help section.
Modified a few things accordingly.
Turned on the debugging using
the logs for #1 as well as #2. It
would be interesting to see why the fetch fails for #1 but succeeds
for #2.
Regards,
Susam Pal
On Tue, Mar 31, 2009 at 11:01 PM, Austin, David david.aus...@encana.com wrote:
Hello again,
Did you set the 'http.agent.host' in 'conf/nutch-site.xml' ?
I didn't have
I am not sure what exactly you mean by this. One can know whether a
page contains a certain keyword or not only after the page has been
fetched.
Regards,
Susam Pal
On Mon, Nov 17, 2008 at 11:26 AM, Miao [EMAIL PROTECTED] wrote:
Hi all,
I have a question about using Nutch. I only want
://wiki.apache.org/nutch/HttpPostAuthentication
Regards,
Susam Pal
Could you please let me know???
Best regards,
Biswajit.
Susam Pal wrote:
Hi Biswajit,
I don't find a single error caused due to authentication problem in
the 'new.txt' file you have attached in some mail before.. Most
time.
Regards,
Susam Pal
On Fri, Sep 19, 2008 at 5:38 AM, biswajit_rout
[EMAIL PROTECTED] wrote:
Hi Susam,
Please give a look into the attached file (new.txt) and suggest a solution
for this. This time i have crawled another site. I am able to crawl all the
public pages but password
You can use the 'hadoop.tmp.dir' property in hadoop-site.xml to
specify an alternate path for temporary directory.
Example:
property
namehadoop.tmp.dir/name
value/home2/tmp//value
description/description
/property
Regards,
Susam Pal
On Tue, Sep 16, 2008 at 10:50 AM, Srinivas Gokavarapu
,
Susam Pal
On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout
[EMAIL PROTECTED] wrote:
Hi Susam,
The ip 10.222.18.113 is nothing but the ip address of my machine(localhost).
Now also i changed http://localhost:8080/ to http://10.222.18.113:8080.
However no result, i mean to say still not able
to fetch a page but fails due to
authentication, then it is a problem with authentication.
In this case, it is not even attempting to fetch those pages. So, the
problem lies elsewhere. You need to first find out why it is fetching
only one page and not others.
Regards,
Susam Pal
On Tue, Sep 16, 2008
The logs show that it is fetching http://localhost:8080/ but you have
set credentials for 10.222.18.113:8080 which is never being fetched.
So, no authentication takes place.
Regards,
Susam Pal
On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout
[EMAIL PROTECTED] wrote:
Hi Susam,
In order to crawl
'.
To enable the DEBUG logs for a particular package, say, the httpclient
package, you can open 'conf/log4j.properties' and add the following
line:
log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
Regards,
Susam Pal
On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann [EMAIL PROTECTED] wrote
Please see my reply inline.
On Thu, May 8, 2008 at 12:04 PM, POIRIER David
[EMAIL PROTECTED] wrote:
Yoav,
You are right. With the help of the protocol-httpclient plugin you
will be able to use cookies when crawling. There is one thing that you
need to watch out though (quoting Susam Pal
with .+
Regards,
Susam Pal
On Thu, May 1, 2008 at 2:39 PM, ili chimad [EMAIL PROTECTED] wrote:
Hi, i'm using nutch 0.9 with tomcat6 / Windows-Vista+cygwin for 2days only
before sending this mail i read many posts here but i didn't find this
problem,
after finishing the crawl step
is causing the
problem. You could then try something like C:/nutch-0.9/crawl/ and see
if it works. By the way, did you try searching from command prompt
using the bin/nutch crawl command. That will ensure that your index is
correct and provides results.
Regards,
Susam Pal
Please see my previous mail and tell us what you get when you run
those commands.
Regards,
Susam Pal
On 4/10/08, subrat mahanty [EMAIL PROTECTED] wrote:
dear
i am try to use the NTLM proxy instade of http because the http due to the
error :
org.apache.nutch.protocol.http.api.HttpException
first time and get 0 hits?
Regards,
Susam Pal
Susam Pal wrote:
Find my reply inline.
On Wed, Apr 2, 2008 at 5:04 PM, Vineet Garg [EMAIL PROTECTED] wrote:
Hi,
I am using Nutch to crawl local file system. I am crawling by
bin/nutch
crawl urls -dir crawl -depth 5 -topN 500
are able to resolve the domain name into IP
address.
Regards,
Susam Pal
On Thu, Apr 3, 2008 at 3:38 PM, subrat mahanty
[EMAIL PROTECTED] wrote:
Dear
i am new in nutch and get a fetching error as
org.apache.nutch.protocol.http.api.HttpException:
java.net.UnknownHostException:
so
is included or ignored.
Hope this helps.
Regards,
Susam Pal
# skip image and other suffixes we can't yet parse
-\.(css|gif|GIF|jpg|JPG|png|PNG|ico|ICO|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
What could be the reason??
Regards,
Vineet
generation.
Regards,
Susam Pal
On Mon, Mar 31, 2008 at 7:14 PM, matt davies [EMAIL PROTECTED] wrote:
Hi Dennis
If you have a crawl depth of 3 then there should be only 3 segments/*
folder
Thanks for that titbit, that makes a bit more sense now.
I have no idea where the other ones
manually to
your Nutch 0.9 source code directory.
Once you make the changes, just build your project again with ant and
you would be ready for recrawl.
Regards,
Susam Pal
On Tue, Mar 18, 2008 at 7:12 PM, Jean-Christophe Alleman
[EMAIL PROTECTED] wrote:
Hi, I'm interested by this patch but I
(indexes) should work since I can find such a
method (though it is deprecated now) in the latest Hadoop API.
Regards,
Susam pal
On Tue, Mar 18, 2008 at 9:09 PM, Jean-Christophe Alleman
[EMAIL PROTECTED] wrote:
Thank's for your reply Susam Pal !
I have run ant and I have an error I can't
,
Susam Pal
On Fri, Mar 14, 2008 at 3:48 AM, Bradford Stephens
[EMAIL PROTECTED] wrote:
Greetings,
A coworker and I are experimenting with Nutch in anticipation of a
pretty large rollout at our company. However, we seem to be stuck on
something -- after the crawler is finished, we can't
this line:
log4j.logger.org.apache.nutch.crawl.Crawl=INFO,cmdstdout
add this line:-
log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
3. Save conf/log4j.properties and delete all files in 'logs' directory.
4. Do a new crawl and obtain the new log.
Regards,
Susam Pal
On Wed, Mar 12
server.
You can see the logs in 'logs/catalina.out' file of Tomcat.
Regards,
Susam Pal
On Fri, Mar 7, 2008 at 8:40 PM, vanderkerkoff [EMAIL PROTECTED] wrote:
Hello everyone
i started looking at nutch today and have installed my ubuntu box, followed
alot of advice and have run a crawl
inside crawl directory. When you start Tomcat, NutchBean
would search for 'crawl' directory in the directory you are starting
Tomcat.
Regards,
Susam Pal
On Fri, Mar 7, 2008 at 9:11 PM, matt davies [EMAIL PROTECTED] wrote:
Does the order of this and the places the commands are being run look
ok
://lucene.apache.org/nutch/tutorial8.html
Regards,
Susam Pal
On Fri, Mar 7, 2008 at 9:34 PM, matt davies [EMAIL PROTECTED] wrote:
Well that's worked a treat, thanks again Susam
I've now got to start adding other sites to the index.
Is it simply adding a line like this +^http://([a-z0-9
Do you put the URLs to all 35 documents in the text file?
If yes, you can check logs/hadoop.log to see if any fetch fails.
If not, may be some of the documents are too deep and increasing the
depth value while crawling, might solve the problem.
Regards,
Susam Pal
On 3/3/08, Jean-Christophe
-urlfilter.txt
works.
Regards,
Susam Pal
You will also find a logs/hadoop.log file. Do you find any clue here?
Maybe, instead of trying to inject dmoz you can try injecting a set of
4 to 10 URLs written in a file and see the hadoop.log file and find
out what is going wrong.
Regards,
Susam Pal
On 2/20/08, Nick Duan [EMAIL PROTECTED
the top 1000 URLs for this
particular crawl. For the next crawl, again top 1000 URLs would be
generated.
Regards,
Susam Pal
-Original Message-
From: Susam Pal [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 05, 2008 10:36 PM
To: nutch-user@lucene.apache.org
Subject: Re: Limiting Crawl
.
Regards,
Susam Pal
Susam Pal wrote:
(2) In Generator.java, the normalizers.normalize() statement is inside
the following 'if' block.
Generator.java (Line: 186)
if (maxPerHost 0) {
I am curious to know why we should avoid URL normalization if
generate.max.per.host = -1 (which also happens
Yes, this should be a simple patch. I will upload one tomorrow.
Regards,
Susam Pal
On Feb 7, 2008 12:11 AM, Dennis Kubes [EMAIL PROTECTED] wrote:
Susam Pal wrote:
I am adding a few more observations.
On Feb 6, 2008 1:47 AM, Dennis Kubes [EMAIL PROTECTED] wrote:
For the generator
issue for
this and submit a one line fix?
Regards,
Susam Pal
You have not added nutch-default.xml and nutch-site.xml to your
Configuration object. Adding the following two lines to your code
should solve the problem:-
conf.addDefaultResource(nutch-default.xml);
conf.addDefaultResource(nutch-site.xml);
Regards,
Susam Pal
On Feb 6, 2008 12:17 AM, devj
Did you try specifying a topN value? -depth 3 -topN 1000 should be
close to what you want.
On 2/6/08, Paul Stewart [EMAIL PROTECTED] wrote:
Hi folks...
What is the best way to say limit crawling to perhaps 3-4 hours per day?
Is there a way to do this?
Right now, I have a crawl depth of 6
.
at org.apache.nutch.crawl.Crawl.main(Crawl.java:89)
This patch, doesn't affect the crawl without the -force option. Is
this going to be useful?
I have included the patch both as text (after the signature) and as an
attachment.
Regards,
Susam Pal
Index: src/java/org/apache/nutch/crawl
Try this command:-
bin/nutch readdb crawl/crawldb -stats
To get help, try:-
bin/nutch readdb
Regards,
Susam Pal
On Feb 1, 2008 8:21 AM, Paul Stewart [EMAIL PROTECTED] wrote:
Hi folks...
Is there a way to retrieve stats from Nutch - meaning how many webpages
are indexed, to be indexed
crawl-urlfilter.txt and regex-urlfilter.txt are used to block or allow
certain URLs to be called. It does not allow you to extract a URL from
another. You might want to use conf/regex-normalize.xml to do this.
Regards,
Susam Pal
On Jan 31, 2008 1:43 AM, Vinci [EMAIL PROTECTED] wrote:
hi,
I
directory.
2. The command you used to run the crawl.
3. What changes you did in conf/crawl-urlfilter.txt
4. Does the site you are crawling have link to other pages?
Regards,
Susam Pal
On Jan 29, 2008 1:04 AM, Barry Haddow [EMAIL PROTECTED] wrote:
Hi
I'm try to get the nutch/hadoop example from
You can try the crawl script: http://wiki.apache.org/nutch/Crawl
Regards,
Susam Pal
On Jan 13, 2008 8:36 AM, Manoj Bist [EMAIL PROTECTED] wrote:
Hi,
When I run crawl the second time, it always complains that 'crawled' already
exists. I always need to remove this directory using 'hadoop dfs
,
Susam Pal
On Jan 13, 2008 11:19 AM, Manoj Bist [EMAIL PROTECTED] wrote:
Thanks for the response.
I tried this with nutch-0.9. The script seems to be accessing non-existent
file/dirs.
Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt
exist : /user/nutch/-threads
instead.
Regards,
Susam Pal
On Jan 10, 2008 6:34 PM,
[EMAIL PROTECTED]
[EMAIL PROTECTED] wrote:
Hi there,
I'm actually having weird problems with my recrawl procedure (nutch0.9).
The situation is the following:
First, I crawl a couple of domains. Then, I start a seperate crawl with a
pages
be able to reach that page.
Regards,
Susam Pal
On Jan 10, 2008 3:56 AM, [EMAIL PROTECTED] wrote:
Hello all,
I am using nutch 9 and when I fetch a couple of sites nutch does not include
pages other that the main one.
For example, if I have mysite.com/cv.htm, nutch fetches only mysite.com
that is
allowed by 'conf/crawl-urlfilter.txt'.
Regards,
Susam Pal
On Jan 7, 2008 8:56 AM, [EMAIL PROTECTED] wrote:
why i can crawl http://game.search.com but i can't crawl
http://www.search.com? conf/crawl-urlfilter is
# skip file:, ftp:, mailto: urls
-^(file|ftp|mailto):
# skip image and other
, the whole job of authentication can be
done within the protocol-httpclient plugin. However, in this, some job
has to be done in the fetcher, outside the plugin also.
If I get some free time, I'll try to work on this.
Regards,
Susam Pal
On Jan 6, 2008 12:11 AM, Martin Kuen [EMAIL PROTECTED] wrote
.* properties.
Ideally, ou should also set the http.agent.host property properly
though I have never found this to cause a problem.)
Regards,
Susam Pal
On Jan 3, 2008 12:47 PM, Nidhi malik [EMAIL PROTECTED] wrote:
I am sending my Hadoop file and I apllied also patch559V0.5
at the time of fetching I
received:
2337
2008-01-02 21:55:32,900 DEBUG httpclient.Http - url:
https://mail.yahoo.com/; status code: 200; bytes received: 26291
If DEBUG lines are missing, it means you have either not enabled DEBUG
properly or you have not successfully patched and built Nutch.
Regards,
Susam Pal
On Jan 4, 2008 12
just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
/description
/property
Regards,
Susam Pal
On Jan 2, 2008 10:45 PM
Your configuration seems fine. Ideally http.agent.url should point to
a page where you describe your crawler, but that shouldn't cause an
error.
If you are facing any problem, please post the relevant logs from
logs/hadoop.log and describe your problem in detail.
Regards,
Susam Pal
On 1/1/08
probably because you do not have
permission over the log file, hadoop.log. Checking the permissions and
setting the proper permissions might work.
Regards,
Susam Pal
On Dec 28, 2007 4:58 PM, NIDHI MALIK [EMAIL PROTECTED] wrote to
[EMAIL PROTECTED]:
Hello,
I am facing problem in using
For point (1), isn't bin/nutch freegen command enough for what you want?
Regards,
Susam Pal
On Dec 18, 2007 5:05 PM,
[EMAIL PROTECTED]
[EMAIL PROTECTED] wrote:
Hi there,
I have the following problem to solve:
I already crawled a couple of domains and can also recrawl them frequently
that is being discussed here:-
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg10030.html
Regards,
Susam Pal
http://susam.in/
On Nov 28, 2007 6:20 PM, [EMAIL PROTECTED] wrote:
I have tried to use Susam Pal's patch (Nutch-559) NTLM, Basic and Digest
Authentication schemes for web/proxy
I have just uploaded NUTCH-559v0.5.patch in JIRA
https://issues.apache.org/jira/browse/NUTCH-559. It works fine too
with Tomcat Basic authentication. I tested it with the same
configuration and commands that I mentioned in my previous mail.
Regards,
Susam Pal
On Nov 28, 2007 9:50 PM, Susam Pal
with about 20 GB free space and I never face a
problem.
Regards,
Susam Pal
On Nov 23, 2007 4:32 AM, Josh Attenberg [EMAIL PROTECTED] wrote:
i have added
property
namehadoop.tmp.dir/name
value/opt/tmp/value
descriptionBase for Nutch Temporary Directories/description
/property
(with opt/tmp
a different directory for
writing the temporary files.
property
namehadoop.tmp.dir/name
value/opt/tmp/value
descriptionBase for Nutch Temporary Directories/description
/property
Regards,
Susam Pal
On Nov 21, 2007 8:54 AM, Josh Attenberg [EMAIL PROTECTED] wrote:
I had this error when
logs too when an error occurs.
Regards,
Susam Pal
On Nov 21, 2007 10:28 AM, Josh Attenberg [EMAIL PROTECTED] wrote:
i did as you say, and moved the files to a new directory on a big drive, but
now have some additional errors. are there any other pointers i need to
update?
On Nov 20, 2007 11:33
work fine for Nutch 0.9 too.
We had a discussion on re-crawling for Nutch 1.0-dev here:-
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg09514.html
Please try this script for re-crawling with Nutch-0.9 and let us know
how it goes.
Regards,
Susam Pal
On Nov 20, 2007 2:11 AM, Moore
file.
4. Logs.
Regards,
Susam Pal
On Nov 16, 2007 3:18 PM, crazy [EMAIL PROTECTED] wrote:
hi,
tks for your answer but i don't understand what i should do exactly
this is my file crawl-urlfilter.txt:
# skip file:, ftp:, mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we
can ignore this part and instead
change the last line of this file from:-
-. to +.
Regards,
Susam Pal
On Nov 16, 2007 1:45 PM, crazy [EMAIL PROTECTED] wrote:
Hi,
i install nutch for the first time and i want to index word and excel
document
even i change the nutch-default.xml :
property
and this would help you
understand Nutch better.
Regards,
Susam Pal
On Nov 16, 2007 4:59 PM, crazy [EMAIL PROTECTED] wrote:
i change my seed urls file to this
http://www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc
and i have this like result:
fetching http://www.frlii.org/IMG/doc
Please try mentioning the protocol in the seed URL file. For example:-
http://www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc
I guess, it selects the protocol plugin according to the protocol
specified in the URL.
Regards,
Susam Pal
On Nov 16, 2007 4:07 PM, crazy [EMAIL PROTECTED
for failed, ERROR,
FATAL, etc.
Regards,
Susam Pal.
On Nov 14, 2007 12:29 AM, payo [EMAIL PROTECTED] wrote:
hi
i run the crawl this way
./bin/nutch crawl urls -dir crawl -depth 3 -topN 500
my urls file
http://localhost/test/
my crawl-urlfilter
+^http://([a-z0-9]*\.)*localhost/
my nutch
What kind of of authentication is required? Do you have to submit the
credentials by POST method or does it require Basic/Digest/NTLM
authentication?
Regards,
Susam Pal
On 10/22/07, sujithq [EMAIL PROTECTED] wrote:
Hi,
Recently I was able to crawl a few sites. But now I have to crawl a site
]
You need to comment this line.
Regards,
Susam Pal
http://susam.in/
On 10/11/07, Rohit Trivedi [EMAIL PROTECTED] wrote:
Hi,
I have an archive page with a bunch of links in it like so:
a
href=/servlet/ShowContent?ResourceType=SServerLocation=1ResourceId=1163280qcs
Monthly/a
but nutch doesn't
to the crawler in some form.
Regards,
Susam Pal
http://susam.in/
On 10/1/07, Gareth Gale [EMAIL PROTECTED] wrote:
Well, that's a possibility I guess but I was hoping that nutch could be
configured to look at a directory and be told to index everything it
finds in there
Will Scheidegger wrote
URL fetched.
These lines would look like:-
2007-09-28 19:16:06,918 INFO fetcher.Fetcher - fetching
http://192.168.101.33/url
If you do not find any 'fetching' in the logs, it means something is
wrong. Most probably the crawl-urlfilter.txt may be wrong.
Regards,
Susam Pal
http://susam.in/
On 9
you were expecting.
Regards,
Susam Pal
http://susam.in/
On 9/28/07, Gareth Gale [EMAIL PROTECTED] wrote:
Hope someone can help. I'd like to index and search only a single
directory of my website. Doesn't work so far (both building the index
and consequent searches). Here's my config :-
Url
If you have not set the agent properties, you must set them.
http.agent.name
http.agent.description
http.agent.url
http.agent.email
The significance of the properties are explained within the
description tags. For the time being you can set some dummy values
and get started.
Regards,
Susam Pal
http.auth.host
This should work fine. I'll be revising this patch as per the
suggestions of Doğacan in order to reduce the 'diff'.
Regards,
Susam Pal
http://susam.in/
On 9/26/07, Alexis Votta [EMAIL PROTECTED] wrote:
I tried the new properties but they don't work. I don't know where the
new properties come
http.auth.host
This should work fine. I'll be revising this patch as per the
suggestions of Doğacan in order to reduce the 'diff'.
Regards,
Susam Pal
http://susam.in/
On 9/26/07, Alexis Votta [EMAIL PROTECTED] wrote:
I tried the new properties but they don't work. I don't know where the
new properties come
is stored against Date whereas DublinCore
interface (which Metadata implements) defines DATE as:-
public static final String DATE = date;
Regards,
Susam Pal
http://susam.in/
On 9/25/07, Sebastian Schick [EMAIL PROTECTED] wrote:
Hello,
we have the same problem. Accidentally I created a new thread
The properties you are trying were meant for the original
protocol-httpclient which doesn't work for NTLM authentication due to
a bug. The patch I have submitted uses these properties:-
http.auth.username
http.auth.password
http.auth.realm
http.auth.host
Please try these.
Regards,
Susam Pal
merged segment. So this is strictly new. So, while merging, we
are merging NEWindexes with the old indexes into 'crawl/index'.
Regards,
Susam Pal
http://susam.in/
On 9/20/07, Alexis Votta [EMAIL PROTECTED] wrote:
Hi Tomislav and Nutch users
I could not solve the problem with your instructions
See NUTCH-281. https://issues.apache.org/jira/browse/NUTCH-281
On 9/20/07, Joseph M. [EMAIL PROTECTED] wrote:
I am having a problem with cached pages. images are not showing in them. how
can I make images show in them?
I am new to Nutch and having difficulties. please help me to show images
Did you replace the 'webapps/ROOT' with the new one by deploying the
.war file generated from the trunk?
Regards,
Susam Pal
http://susam.in/
On 9/17/07, Alexis Votta [EMAIL PROTECTED] wrote:
I was using Nutch-0.9 successfully for around one month. Today, I
downloaded the trunk, built
me know if this solves your problem.
Regards,
Susam Pal
http://susam.in/
On 9/18/07, Aryan Sahoo [EMAIL PROTECTED] wrote:
Hi Nutch user group,
I installed Nutch from the trunk. I wanted NTLM authentication. I
included protocol-httpclient in nutch-site.xml. Next I added the
properties
It seems you have not set the NTLM related properties in nutch-site.xml.
These are the properties you need to set.
http.auth.ntlm.username
http.auth.ntlm.password
http.auth.ntlm.domain
http.auth.ntlm.host
Regards,
Susam Pal
http://susam.in/
On 9/13/07, Smith Norton [EMAIL PROTECTED] wrote:
I
,
Susam Pal
http://susam.in/
On 8/21/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
Hi all,
I am new to Nutch. While trying to create indexes, i am getting following
errors/exceptions:
.
.
.
fetching http://192.168.36.199/
fetch of http://192.168.36.199/ failed
go wrong too like the
crawl DB might be corrupt or incomplete, you might not have a 'crawl'
directory present, etc. but first try out different search strings and
see if it works fine.
Regards,
Susam Pal
http://susam.in/
On 8/21/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
Ya Thanks
are the files for the Nutch web gui located in the source?
I guess you are looking for the files in 'src/web/jsp'.
Regards,
Susam Pal
http://susam.in/
1 - 100 of 112 matches
Mail list logo