Re: Why Nutch is not crawling all links from web page

2010-04-05 Thread Susam Pal
, it means that your crawler is sacrificing these multiple links because they have a very low rank in which case you might want to increase the 'topN' value. Hope this helps you. Regards, Susam Pal

Re: Nutch segment merge is very slow

2010-04-05 Thread Susam Pal
very deep are also important and you want to crawl them, you might have to sacrifice low ranking URLs by setting a smaller topN value, say, 1000, or whatever works for you. Regards, Susam Pal

Re: Crawling authenticated websites !

2010-03-18 Thread Susam Pal
included it in CC. This feature is not present in Nutch. We have recorded the summary of some old discussions regarding this here: http://wiki.apache.org/nutch/HttpPostAuthentication But this was never implemented. Regards, Susam Pal

Re: Proxy Authentication

2010-03-15 Thread Susam Pal
On Mon, Mar 15, 2010 at 2:32 PM, Graziano Aliberti graziano.alibe...@eng.it wrote: Il 13/03/2010 22.55, Susam Pal ha scritto: On Fri, Mar 12, 2010 at 3:17 PM, Susam Palsusam@gmail.com  wrote: On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti graziano.alibe...@eng.it  wrote: Il 11/03

Re: Proxy Authentication

2010-03-15 Thread Susam Pal
On Tue, Mar 16, 2010 at 12:55 AM, Susam Pal susam@gmail.com wrote: On Mon, Mar 15, 2010 at 2:32 PM, Graziano Aliberti graziano.alibe...@eng.it wrote: Il 13/03/2010 22.55, Susam Pal ha scritto: On Fri, Mar 12, 2010 at 3:17 PM, Susam Palsusam@gmail.com  wrote: On Fri, Mar 12, 2010

Re: Proxy Authentication

2010-03-13 Thread Susam Pal
On Fri, Mar 12, 2010 at 3:17 PM, Susam Pal susam@gmail.com wrote: On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti graziano.alibe...@eng.it wrote: Il 11/03/2010 16.20, Susam Pal ha scritto: On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti graziano.alibe...@eng.it  wrote: Hi

Re: Proxy Authentication

2010-03-12 Thread Susam Pal
On Fri, Mar 12, 2010 at 2:09 PM, Graziano Aliberti graziano.alibe...@eng.it wrote: Il 11/03/2010 16.20, Susam Pal ha scritto: On Thu, Mar 11, 2010 at 8:24 PM, Graziano Aliberti graziano.alibe...@eng.it  wrote: Hi everyone, I'm trying to use nutch ver. 1.0 on a system under squid proxy

Re: Proxy Authentication

2010-03-11 Thread Susam Pal
' property in 'conf/nutch-site.xml'? Regards, Susam Pal

Re: R: Using Nutch for only retriving HTML

2009-09-29 Thread Susam Pal
for. You can use it with -r option to recursively download pages and store them as separate files on the hard disk, which is exactly what you need. You might want to use the -np option too. It is available for Windows as well as Linux. Regards, Susam Pal

Re: Authentication Not Occuring

2009-07-06 Thread Susam Pal
/httpclient-auth.xml 3. logs/hadoop.log 4. Output from telnet, netcat, etc. Please go through Need Help? section of http://wiki.apache.org/nutch/HttpAuthenticationSchemes to make sure you haven't missed anything important. Regards, Susam Pal

Re: NTLM Authentication Not Occuring...

2009-06-17 Thread Susam Pal
, Susam Pal

Re: NTLM Authentication Not Occuring...

2009-06-17 Thread Susam Pal
for not being able to help you soon enough as I am on a vacation at a small town with poor internet connectivity. Regards, Susam Pal

Re: NTLM authentication

2009-06-09 Thread Susam Pal
that it mentions that NTLM can not be used to authenticate with both a proxy and the server. Regards, Susam Pal

Re: After test - how to crawl WWW continously?

2009-06-09 Thread Susam Pal
. Regards, Susam Pal

Re: NTLM authentication

2009-06-09 Thread Susam Pal
that it has to do something with the NTLM version. However, I don't have any experience with errors of this kind. So, I can't really tell. This section might be of some help to you: http://hc.apache.org/httpclient-3.x/authentication.html#Known_limitations_and_problems Regards, Susam Pal

Re: Nutch not crawling windows authenticated sites.

2009-05-15 Thread Susam Pal
://hc.apache.org/httpclient-3.x/authentication.html Regards, Susam Pal

Re: Nutch1.0 hadoop dfs usage doesnt seem right . experience users please comment

2009-05-11 Thread Susam Pal
' property in hadoop-site.xml to specify an alternate path for temporary directory. Example: property namehadoop.tmp.dir/name value/opt/tmp//value description/description /property Regards, Susam Pal

Re: Content(source code) of web pages crawled by nutch

2009-05-11 Thread Susam Pal
String(bean.getContent(hitDetails)); You can also see 'src/web/jsp/cached.jsp' or 'cached.jsp' in the directory where Nutch WAR file is deployed to see how NutchBean object is used to get the page content. Regards, Susam Pal

Re: hi Kubes:the question about develop environment!

2009-04-23 Thread Susam Pal
don't see why you can not run Hadoop in VMware virtual machines. Regards, Susam Pal

Re: Nutch 1.0 - NTLM question

2009-03-31 Thread Susam Pal
as the development progressed. Regards, Susam Pal

Re: Nutch 1.0 - NTLM question

2009-03-31 Thread Susam Pal
of these three cases. Regards, Susam Pal On Tue, Mar 31, 2009 at 9:44 PM, Austin, David david.aus...@encana.com wrote: Hi Susam, Thanks for your quick response.  I've gone through the Need Help section.   Modified a few things accordingly. Turned on the debugging using

Re: Nutch 1.0 - NTLM question

2009-03-31 Thread Susam Pal
the logs for #1 as well as #2. It would be interesting to see why the fetch fails for #1 but succeeds for #2. Regards, Susam Pal On Tue, Mar 31, 2009 at 11:01 PM, Austin, David david.aus...@encana.com wrote: Hello again, Did you set the 'http.agent.host' in 'conf/nutch-site.xml' ? I didn't have

Re: keyword crawling

2008-11-16 Thread Susam Pal
I am not sure what exactly you mean by this. One can know whether a page contains a certain keyword or not only after the page has been fetched. Regards, Susam Pal On Mon, Nov 17, 2008 at 11:26 AM, Miao [EMAIL PROTECTED] wrote: Hi all, I have a question about using Nutch. I only want

Re: Not able to crawl password protected pages using NUTCH 0.9

2008-09-22 Thread Susam Pal
://wiki.apache.org/nutch/HttpPostAuthentication Regards, Susam Pal Could you please let me know??? Best regards, Biswajit. Susam Pal wrote: Hi Biswajit, I don't find a single error caused due to authentication problem in the 'new.txt' file you have attached in some mail before.. Most

Re: Not able to crawl password protected pages using NUTCH 0.9

2008-09-19 Thread Susam Pal
time. Regards, Susam Pal On Fri, Sep 19, 2008 at 5:38 AM, biswajit_rout [EMAIL PROTECTED] wrote: Hi Susam, Please give a look into the attached file (new.txt) and suggest a solution for this. This time i have crawled another site. I am able to crawl all the public pages but password

Re: Temporary storage during crawling

2008-09-16 Thread Susam Pal
You can use the 'hadoop.tmp.dir' property in hadoop-site.xml to specify an alternate path for temporary directory. Example: property namehadoop.tmp.dir/name value/home2/tmp//value description/description /property Regards, Susam Pal On Tue, Sep 16, 2008 at 10:50 AM, Srinivas Gokavarapu

Re: Not able to crawl password protected pages using NUTCH 0.9

2008-09-16 Thread Susam Pal
, Susam Pal On Tue, Sep 16, 2008 at 1:33 PM, biswajit_rout [EMAIL PROTECTED] wrote: Hi Susam, The ip 10.222.18.113 is nothing but the ip address of my machine(localhost). Now also i changed http://localhost:8080/ to http://10.222.18.113:8080. However no result, i mean to say still not able

Re: Not able to crawl password protected pages using NUTCH 0.9

2008-09-16 Thread Susam Pal
to fetch a page but fails due to authentication, then it is a problem with authentication. In this case, it is not even attempting to fetch those pages. So, the problem lies elsewhere. You need to first find out why it is fetching only one page and not others. Regards, Susam Pal On Tue, Sep 16, 2008

Re: Not able to crawl password protected pages using NUTCH 0.9

2008-09-15 Thread Susam Pal
The logs show that it is fetching http://localhost:8080/ but you have set credentials for 10.222.18.113:8080 which is never being fetched. So, no authentication takes place. Regards, Susam Pal On Mon, Sep 15, 2008 at 1:20 PM, biswajit_rout [EMAIL PROTECTED] wrote: Hi Susam, In order to crawl

Re: how does nutch connect to urls internally?

2008-06-16 Thread Susam Pal
'. To enable the DEBUG logs for a particular package, say, the httpclient package, you can open 'conf/log4j.properties' and add the following line: log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout Regards, Susam Pal On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann [EMAIL PROTECTED] wrote

Re: How to authenticate with cookies?

2008-05-08 Thread Susam Pal
Please see my reply inline. On Thu, May 8, 2008 at 12:04 PM, POIRIER David [EMAIL PROTECTED] wrote: Yoav, You are right. With the help of the protocol-httpclient plugin you will be able to use cookies when crawling. There is one thing that you need to watch out though (quoting Susam Pal

Re: nutch 0.9 no results ??

2008-05-01 Thread Susam Pal
with .+ Regards, Susam Pal On Thu, May 1, 2008 at 2:39 PM, ili chimad [EMAIL PROTECTED] wrote: Hi, i'm using nutch 0.9 with tomcat6 / Windows-Vista+cygwin for 2days only before sending this mail i read many posts here but i didn't find this problem, after finishing the crawl step

Re: nutch 0.9 no results ??

2008-05-01 Thread Susam Pal
is causing the problem. You could then try something like C:/nutch-0.9/crawl/ and see if it works. By the way, did you try searching from command prompt using the bin/nutch crawl command. That will ensure that your index is correct and provides results. Regards, Susam Pal

Re: fetching error

2008-04-10 Thread Susam Pal
Please see my previous mail and tell us what you get when you run those commands. Regards, Susam Pal On 4/10/08, subrat mahanty [EMAIL PROTECTED] wrote: dear i am try to use the NTLM proxy instade of http because the http due to the error : org.apache.nutch.protocol.http.api.HttpException

Re: Nutch fetching skipped files

2008-04-04 Thread Susam Pal
first time and get 0 hits? Regards, Susam Pal Susam Pal wrote: Find my reply inline. On Wed, Apr 2, 2008 at 5:04 PM, Vineet Garg [EMAIL PROTECTED] wrote: Hi, I am using Nutch to crawl local file system. I am crawling by bin/nutch crawl urls -dir crawl -depth 5 -topN 500

Re: fetching error

2008-04-03 Thread Susam Pal
are able to resolve the domain name into IP address. Regards, Susam Pal On Thu, Apr 3, 2008 at 3:38 PM, subrat mahanty [EMAIL PROTECTED] wrote: Dear i am new in nutch and get a fetching error as org.apache.nutch.protocol.http.api.HttpException: java.net.UnknownHostException: so

Re: Nutch fetching skipped files

2008-04-03 Thread Susam Pal
is included or ignored. Hope this helps. Regards, Susam Pal # skip image and other suffixes we can't yet parse -\.(css|gif|GIF|jpg|JPG|png|PNG|ico|ICO|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ What could be the reason?? Regards, Vineet

Re: Crawl dies unexpectedly

2008-03-31 Thread Susam Pal
generation. Regards, Susam Pal On Mon, Mar 31, 2008 at 7:14 PM, matt davies [EMAIL PROTECTED] wrote: Hi Dennis If you have a crawl depth of 3 then there should be only 3 segments/* folder Thanks for that titbit, that makes a bit more sense now. I have no idea where the other ones

Re: Recrawling without deleting crawl directory

2008-03-18 Thread Susam Pal
manually to your Nutch 0.9 source code directory. Once you make the changes, just build your project again with ant and you would be ready for recrawl. Regards, Susam Pal On Tue, Mar 18, 2008 at 7:12 PM, Jean-Christophe Alleman [EMAIL PROTECTED] wrote: Hi, I'm interested by this patch but I

Re: Recrawling without deleting crawl directory

2008-03-18 Thread Susam Pal
(indexes) should work since I can find such a method (though it is deprecated now) in the latest Hadoop API. Regards, Susam pal On Tue, Mar 18, 2008 at 9:09 PM, Jean-Christophe Alleman [EMAIL PROTECTED] wrote: Thank's for your reply Susam Pal ! I have run ant and I have an error I can't

Re: Recrawling without deleting crawl directory

2008-03-14 Thread Susam Pal
, Susam Pal On Fri, Mar 14, 2008 at 3:48 AM, Bradford Stephens [EMAIL PROTECTED] wrote: Greetings, A coworker and I are experimenting with Nutch in anticipation of a pretty large rollout at our company. However, we seem to be stuck on something -- after the crawler is finished, we can't

Re: Problem in running Nutch where proxy authentication is required.

2008-03-13 Thread Susam Pal
this line: log4j.logger.org.apache.nutch.crawl.Crawl=INFO,cmdstdout add this line:- log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout 3. Save conf/log4j.properties and delete all files in 'logs' directory. 4. Do a new crawl and obtain the new log. Regards, Susam Pal On Wed, Mar 12

Re: started today

2008-03-07 Thread Susam Pal
server. You can see the logs in 'logs/catalina.out' file of Tomcat. Regards, Susam Pal On Fri, Mar 7, 2008 at 8:40 PM, vanderkerkoff [EMAIL PROTECTED] wrote: Hello everyone i started looking at nutch today and have installed my ubuntu box, followed alot of advice and have run a crawl

Re: started today

2008-03-07 Thread Susam Pal
inside crawl directory. When you start Tomcat, NutchBean would search for 'crawl' directory in the directory you are starting Tomcat. Regards, Susam Pal On Fri, Mar 7, 2008 at 9:11 PM, matt davies [EMAIL PROTECTED] wrote: Does the order of this and the places the commands are being run look ok

Re: started today

2008-03-07 Thread Susam Pal
://lucene.apache.org/nutch/tutorial8.html Regards, Susam Pal On Fri, Mar 7, 2008 at 9:34 PM, matt davies [EMAIL PROTECTED] wrote: Well that's worked a treat, thanks again Susam I've now got to start adding other sites to the index. Is it simply adding a line like this +^http://([a-z0-9

Re: problem while indexing

2008-03-03 Thread Susam Pal
Do you put the URLs to all 35 documents in the text file? If yes, you can check logs/hadoop.log to see if any fetch fails. If not, may be some of the documents are too deep and increasing the depth value while crawling, might solve the problem. Regards, Susam Pal On 3/3/08, Jean-Christophe

Re: Help to understand the crawl filter

2008-02-19 Thread Susam Pal
-urlfilter.txt works. Regards, Susam Pal

Re: How to do nutch inject?

2008-02-19 Thread Susam Pal
You will also find a logs/hadoop.log file. Do you find any clue here? Maybe, instead of trying to inject dmoz you can try injecting a set of 4 to 10 URLs written in a file and see the hadoop.log file and find out what is going wrong. Regards, Susam Pal On 2/20/08, Nick Duan [EMAIL PROTECTED

Re: Limiting Crawl Time

2008-02-06 Thread Susam Pal
the top 1000 URLs for this particular crawl. For the next crawl, again top 1000 URLs would be generated. Regards, Susam Pal -Original Message- From: Susam Pal [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 05, 2008 10:36 PM To: nutch-user@lucene.apache.org Subject: Re: Limiting Crawl

Re: Questions on normalizer and filter related code in Crawl, Injector and Generator

2008-02-06 Thread Susam Pal
. Regards, Susam Pal

Re: Questions on normalizer and filter related code in Crawl, Injector and Generator

2008-02-06 Thread Susam Pal
Susam Pal wrote: (2) In Generator.java, the normalizers.normalize() statement is inside the following 'if' block. Generator.java (Line: 186) if (maxPerHost 0) { I am curious to know why we should avoid URL normalization if generate.max.per.host = -1 (which also happens

Re: Questions on normalizer and filter related code in Crawl, Injector and Generator

2008-02-06 Thread Susam Pal
Yes, this should be a simple patch. I will upload one tomorrow. Regards, Susam Pal On Feb 7, 2008 12:11 AM, Dennis Kubes [EMAIL PROTECTED] wrote: Susam Pal wrote: I am adding a few more observations. On Feb 6, 2008 1:47 AM, Dennis Kubes [EMAIL PROTECTED] wrote: For the generator

Questions on normalizer and filter related code in Crawl, Injector and Generator

2008-02-05 Thread Susam Pal
issue for this and submit a one line fix? Regards, Susam Pal

Re: Urgent help reqd.....plz

2008-02-05 Thread Susam Pal
You have not added nutch-default.xml and nutch-site.xml to your Configuration object. Adding the following two lines to your code should solve the problem:- conf.addDefaultResource(nutch-default.xml); conf.addDefaultResource(nutch-site.xml); Regards, Susam Pal On Feb 6, 2008 12:17 AM, devj

Re: Limiting Crawl Time

2008-02-05 Thread Susam Pal
Did you try specifying a topN value? -depth 3 -topN 1000 should be close to what you want. On 2/6/08, Paul Stewart [EMAIL PROTECTED] wrote: Hi folks... What is the best way to say limit crawling to perhaps 3-4 hours per day? Is there a way to do this? Right now, I have a crawl depth of 6

Recrawl using org.apache.nutch.crawl.Crawl

2008-01-31 Thread Susam Pal
. at org.apache.nutch.crawl.Crawl.main(Crawl.java:89) This patch, doesn't affect the crawl without the -force option. Is this going to be useful? I have included the patch both as text (after the signature) and as an attachment. Regards, Susam Pal Index: src/java/org/apache/nutch/crawl

Re: Stats?

2008-01-31 Thread Susam Pal
Try this command:- bin/nutch readdb crawl/crawldb -stats To get help, try:- bin/nutch readdb Regards, Susam Pal On Feb 1, 2008 8:21 AM, Paul Stewart [EMAIL PROTECTED] wrote: Hi folks... Is there a way to retrieve stats from Nutch - meaning how many webpages are indexed, to be indexed

Re: Can Nutch use part of the url found for the next crawling?

2008-01-30 Thread Susam Pal
crawl-urlfilter.txt and regex-urlfilter.txt are used to block or allow certain URLs to be called. It does not allow you to extract a URL from another. You might want to use conf/regex-normalize.xml to do this. Regards, Susam Pal On Jan 31, 2008 1:43 AM, Vinci [EMAIL PROTECTED] wrote: hi, I

Re: Simple crawl fails to find any URLs

2008-01-28 Thread Susam Pal
directory. 2. The command you used to run the crawl. 3. What changes you did in conf/crawl-urlfilter.txt 4. Does the site you are crawling have link to other pages? Regards, Susam Pal On Jan 29, 2008 1:04 AM, Barry Haddow [EMAIL PROTECTED] wrote: Hi I'm try to get the nutch/hadoop example from

Re: 'crawled already exists' - how do I recrawl?

2008-01-12 Thread Susam Pal
You can try the crawl script: http://wiki.apache.org/nutch/Crawl Regards, Susam Pal On Jan 13, 2008 8:36 AM, Manoj Bist [EMAIL PROTECTED] wrote: Hi, When I run crawl the second time, it always complains that 'crawled' already exists. I always need to remove this directory using 'hadoop dfs

Re: 'crawled already exists' - how do I recrawl?

2008-01-12 Thread Susam Pal
, Susam Pal On Jan 13, 2008 11:19 AM, Manoj Bist [EMAIL PROTECTED] wrote: Thanks for the response. I tried this with nutch-0.9. The script seems to be accessing non-existent file/dirs. Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /user/nutch/-threads

Re: Problem with recrawl

2008-01-10 Thread Susam Pal
instead. Regards, Susam Pal On Jan 10, 2008 6:34 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi there, I'm actually having weird problems with my recrawl procedure (nutch0.9). The situation is the following: First, I crawl a couple of domains. Then, I start a seperate crawl with a pages

Re: some crawl problems

2008-01-09 Thread Susam Pal
be able to reach that page. Regards, Susam Pal On Jan 10, 2008 3:56 AM, [EMAIL PROTECTED] wrote: Hello all, I am using nutch 9 and when I fetch a couple of sites nutch does not include pages other that the main one. For example, if I have mysite.com/cv.htm, nutch fetches only mysite.com

Re: nutch crawl problem

2008-01-07 Thread Susam Pal
that is allowed by 'conf/crawl-urlfilter.txt'. Regards, Susam Pal On Jan 7, 2008 8:56 AM, [EMAIL PROTECTED] wrote: why i can crawl http://game.search.com but i can't crawl http://www.search.com? conf/crawl-urlfilter is # skip file:, ftp:, mailto: urls -^(file|ftp|mailto): # skip image and other

Re: form-based authentication?

2008-01-05 Thread Susam Pal
, the whole job of authentication can be done within the protocol-httpclient plugin. However, in this, some job has to be done in the fetcher, outside the plugin also. If I get some free time, I'll try to work on this. Regards, Susam Pal On Jan 6, 2008 12:11 AM, Martin Kuen [EMAIL PROTECTED] wrote

Re: Http 407 error

2008-01-03 Thread Susam Pal
.* properties. Ideally, ou should also set the http.agent.host property properly though I have never found this to cause a problem.) Regards, Susam Pal On Jan 3, 2008 12:47 PM, Nidhi malik [EMAIL PROTECTED] wrote: I am sending my Hadoop file and I apllied also patch559V0.5 at the time of fetching I

Re: hadoop file and nutch-407 error

2008-01-03 Thread Susam Pal
received: 2337 2008-01-02 21:55:32,900 DEBUG httpclient.Http - url: https://mail.yahoo.com/; status code: 200; bytes received: 26291 If DEBUG lines are missing, it means you have either not enabled DEBUG properly or you have not successfully patched and built Nutch. Regards, Susam Pal On Jan 4, 2008 12

Re: System.out.println(parsetext.getText()) prints non readable chars - Please help

2008-01-02 Thread Susam Pal
just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property Regards, Susam Pal On Jan 2, 2008 10:45 PM

Re: Http-407 - authentication problem on Nutch -0.8

2008-01-01 Thread Susam Pal
Your configuration seems fine. Ideally http.agent.url should point to a page where you describe your crawler, but that shouldn't cause an error. If you are facing any problem, please post the relevant logs from logs/hadoop.log and describe your problem in detail. Regards, Susam Pal On 1/1/08

Re: nutch internet crawling help

2007-12-31 Thread Susam Pal
probably because you do not have permission over the log file, hadoop.log. Checking the permissions and setting the proper permissions might work. Regards, Susam Pal On Dec 28, 2007 4:58 PM, NIDHI MALIK [EMAIL PROTECTED] wrote to [EMAIL PROTECTED]: Hello, I am facing problem in using

Re: adding domain to recrawl

2007-12-18 Thread Susam Pal
For point (1), isn't bin/nutch freegen command enough for what you want? Regards, Susam Pal On Dec 18, 2007 5:05 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi there, I have the following problem to solve: I already crawled a couple of domains and can also recrawl them frequently

Re: Problems testing Authentication

2007-11-28 Thread Susam Pal
that is being discussed here:- http://www.mail-archive.com/nutch-user@lucene.apache.org/msg10030.html Regards, Susam Pal http://susam.in/ On Nov 28, 2007 6:20 PM, [EMAIL PROTECTED] wrote: I have tried to use Susam Pal's patch (Nutch-559) NTLM, Basic and Digest Authentication schemes for web/proxy

Re: Problems testing Authentication

2007-11-28 Thread Susam Pal
I have just uploaded NUTCH-559v0.5.patch in JIRA https://issues.apache.org/jira/browse/NUTCH-559. It works fine too with Tomcat Basic authentication. I tested it with the same configuration and commands that I mentioned in my previous mail. Regards, Susam Pal On Nov 28, 2007 9:50 PM, Susam Pal

Re: No space left on device

2007-11-22 Thread Susam Pal
with about 20 GB free space and I never face a problem. Regards, Susam Pal On Nov 23, 2007 4:32 AM, Josh Attenberg [EMAIL PROTECTED] wrote: i have added property namehadoop.tmp.dir/name value/opt/tmp/value descriptionBase for Nutch Temporary Directories/description /property (with opt/tmp

Re: No space left on device

2007-11-20 Thread Susam Pal
a different directory for writing the temporary files. property namehadoop.tmp.dir/name value/opt/tmp/value descriptionBase for Nutch Temporary Directories/description /property Regards, Susam Pal On Nov 21, 2007 8:54 AM, Josh Attenberg [EMAIL PROTECTED] wrote: I had this error when

Re: No space left on device

2007-11-20 Thread Susam Pal
logs too when an error occurs. Regards, Susam Pal On Nov 21, 2007 10:28 AM, Josh Attenberg [EMAIL PROTECTED] wrote: i did as you say, and moved the files to a new directory on a big drive, but now have some additional errors. are there any other pointers i need to update? On Nov 20, 2007 11:33

Re: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg09096.html

2007-11-19 Thread Susam Pal
work fine for Nutch 0.9 too. We had a discussion on re-crawling for Nutch 1.0-dev here:- http://www.mail-archive.com/nutch-user@lucene.apache.org/msg09514.html Please try this script for re-crawling with Nutch-0.9 and let us know how it goes. Regards, Susam Pal On Nov 20, 2007 2:11 AM, Moore

Re: indexing word file

2007-11-16 Thread Susam Pal
file. 4. Logs. Regards, Susam Pal On Nov 16, 2007 3:18 PM, crazy [EMAIL PROTECTED] wrote: hi, tks for your answer but i don't understand what i should do exactly this is my file crawl-urlfilter.txt: # skip file:, ftp:, mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we

Re: indexing word file

2007-11-16 Thread Susam Pal
can ignore this part and instead change the last line of this file from:- -. to +. Regards, Susam Pal On Nov 16, 2007 1:45 PM, crazy [EMAIL PROTECTED] wrote: Hi, i install nutch for the first time and i want to index word and excel document even i change the nutch-default.xml : property

Re: indexing word file

2007-11-16 Thread Susam Pal
and this would help you understand Nutch better. Regards, Susam Pal On Nov 16, 2007 4:59 PM, crazy [EMAIL PROTECTED] wrote: i change my seed urls file to this http://www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc and i have this like result: fetching http://www.frlii.org/IMG/doc

Re: indexing word file

2007-11-16 Thread Susam Pal
Please try mentioning the protocol in the seed URL file. For example:- http://www.frlii.org/IMG/doc/catalogue_a_portail_27-09-2004.doc I guess, it selects the protocol plugin according to the protocol specified in the URL. Regards, Susam Pal On Nov 16, 2007 4:07 PM, crazy [EMAIL PROTECTED

Re: run the crawl

2007-11-13 Thread Susam Pal
for failed, ERROR, FATAL, etc. Regards, Susam Pal. On Nov 14, 2007 12:29 AM, payo [EMAIL PROTECTED] wrote: hi i run the crawl this way ./bin/nutch crawl urls -dir crawl -depth 3 -topN 500 my urls file http://localhost/test/ my crawl-urlfilter +^http://([a-z0-9]*\.)*localhost/ my nutch

Re: Crawling sites (authentication required)

2007-10-22 Thread Susam Pal
What kind of of authentication is required? Do you have to submit the credentials by POST method or does it require Basic/Digest/NTLM authentication? Regards, Susam Pal On 10/22/07, sujithq [EMAIL PROTECTED] wrote: Hi, Recently I was able to crawl a few sites. But now I have to crawl a site

Re: nutch won't index urls to servlets

2007-10-11 Thread Susam Pal
] You need to comment this line. Regards, Susam Pal http://susam.in/ On 10/11/07, Rohit Trivedi [EMAIL PROTECTED] wrote: Hi, I have an archive page with a bunch of links in it like so: a href=/servlet/ShowContent?ResourceType=SServerLocation=1ResourceId=1163280qcs Monthly/a but nutch doesn't

Re: Newbie query: problem indexing pdf files

2007-10-01 Thread Susam Pal
to the crawler in some form. Regards, Susam Pal http://susam.in/ On 10/1/07, Gareth Gale [EMAIL PROTECTED] wrote: Well, that's a possibility I guess but I was hoping that nutch could be configured to look at a directory and be told to index everything it finds in there Will Scheidegger wrote

Re: Newbie query: problem indexing pdf files

2007-09-28 Thread Susam Pal
URL fetched. These lines would look like:- 2007-09-28 19:16:06,918 INFO fetcher.Fetcher - fetching http://192.168.101.33/url If you do not find any 'fetching' in the logs, it means something is wrong. Most probably the crawl-urlfilter.txt may be wrong. Regards, Susam Pal http://susam.in/ On 9

Re: Newbie query: problem indexing pdf files

2007-09-28 Thread Susam Pal
you were expecting. Regards, Susam Pal http://susam.in/ On 9/28/07, Gareth Gale [EMAIL PROTECTED] wrote: Hope someone can help. I'd like to index and search only a single directory of my website. Doesn't work so far (both building the index and consequent searches). Here's my config :- Url

Re: Newbie query: problem indexing pdf files

2007-09-28 Thread Susam Pal
If you have not set the agent properties, you must set them. http.agent.name http.agent.description http.agent.url http.agent.email The significance of the properties are explained within the description tags. For the time being you can set some dummy values and get started. Regards, Susam Pal

Re: Does authentication work?

2007-09-26 Thread Susam Pal
http.auth.host This should work fine. I'll be revising this patch as per the suggestions of Doğacan in order to reduce the 'diff'. Regards, Susam Pal http://susam.in/ On 9/26/07, Alexis Votta [EMAIL PROTECTED] wrote: I tried the new properties but they don't work. I don't know where the new properties come

Re: Does authentication work?

2007-09-26 Thread Susam Pal
http.auth.host This should work fine. I'll be revising this patch as per the suggestions of Doğacan in order to reduce the 'diff'. Regards, Susam Pal http://susam.in/ On 9/26/07, Alexis Votta [EMAIL PROTECTED] wrote: I tried the new properties but they don't work. I don't know where the new properties come

Re: Last-modified / creation date or time

2007-09-25 Thread Susam Pal
is stored against Date whereas DublinCore interface (which Metadata implements) defines DATE as:- public static final String DATE = date; Regards, Susam Pal http://susam.in/ On 9/25/07, Sebastian Schick [EMAIL PROTECTED] wrote: Hello, we have the same problem. Accidentally I created a new thread

Re: Does authentication work?

2007-09-25 Thread Susam Pal
The properties you are trying were meant for the original protocol-httpclient which doesn't work for NTLM authentication due to a bug. The patch I have submitted uses these properties:- http.auth.username http.auth.password http.auth.realm http.auth.host Please try these. Regards, Susam Pal

Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

2007-09-20 Thread Susam Pal
merged segment. So this is strictly new. So, while merging, we are merging NEWindexes with the old indexes into 'crawl/index'. Regards, Susam Pal http://susam.in/ On 9/20/07, Alexis Votta [EMAIL PROTECTED] wrote: Hi Tomislav and Nutch users I could not solve the problem with your instructions

Re: cached page not showing images

2007-09-20 Thread Susam Pal
See NUTCH-281. https://issues.apache.org/jira/browse/NUTCH-281 On 9/20/07, Joseph M. [EMAIL PROTECTED] wrote: I am having a problem with cached pages. images are not showing in them. how can I make images show in them? I am new to Nutch and having difficulties. please help me to show images

Re: Unknown format version:- 3 with Nutch trunk

2007-09-18 Thread Susam Pal
Did you replace the 'webapps/ROOT' with the new one by deploying the .war file generated from the trunk? Regards, Susam Pal http://susam.in/ On 9/17/07, Alexis Votta [EMAIL PROTECTED] wrote: I was using Nutch-0.9 successfully for around one month. Today, I downloaded the trunk, built

Re: protocol-httpclient NTLM authentication fails

2007-09-17 Thread Susam Pal
me know if this solves your problem. Regards, Susam Pal http://susam.in/ On 9/18/07, Aryan Sahoo [EMAIL PROTECTED] wrote: Hi Nutch user group, I installed Nutch from the trunk. I wanted NTLM authentication. I included protocol-httpclient in nutch-site.xml. Next I added the properties

Re: NTLM authentication not working in protocol-httpclient

2007-09-14 Thread Susam Pal
It seems you have not set the NTLM related properties in nutch-site.xml. These are the properties you need to set. http.auth.ntlm.username http.auth.ntlm.password http.auth.ntlm.domain http.auth.ntlm.host Regards, Susam Pal http://susam.in/ On 9/13/07, Smith Norton [EMAIL PROTECTED] wrote: I

Re: Problem in creating Index

2007-08-21 Thread Susam Pal
, Susam Pal http://susam.in/ On 8/21/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi all, I am new to Nutch. While trying to create indexes, i am getting following errors/exceptions: . . . fetching http://192.168.36.199/ fetch of http://192.168.36.199/ failed

Re: Problem in creating Index

2007-08-21 Thread Susam Pal
go wrong too like the crawl DB might be corrupt or incomplete, you might not have a 'crawl' directory present, etc. but first try out different search strings and see if it works fine. Regards, Susam Pal http://susam.in/ On 8/21/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Ya Thanks

Re: Any patch for navigation of pages?

2007-08-21 Thread Susam Pal
are the files for the Nutch web gui located in the source? I guess you are looking for the files in 'src/web/jsp'. Regards, Susam Pal http://susam.in/

  1   2   >