Issues in recrawling

2010-04-27 Thread arpit khurdiya
HI, I m new to the world of nutch. I am trying to crawl  local file systems on LAN using nutch 1.0. Documents are rarely modified and then search them using solr. And frequency of recrawling is 1 day as documents are frequently added and deleted. I have few queries regarding recrawling. 1. What

Re: Recrawling Nutch

2009-10-14 Thread Paul Tomblin
segment, get to the url and decide whether to get the content based on whether the url has been updated since? Shreekanth Prabhu -- View this message in context: http://www.nabble.com/Recrawling--Nutch-tp25891294p25891294.html Sent from the Nutch - User mailing list archive at Nabble.com

recrawling

2009-07-17 Thread Neeti Gupta
/recrawling-tp24530967p24530967.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: recrawling

2009-07-14 Thread Neeti Gupta
/ -- Lucene - Solr - Nutch - Original Message From: Neeti Gupta neeti_gupt...@yahoo.com To: nutch-user@lucene.apache.org Sent: Wednesday, June 24, 2009 7:52:47 AM Subject: recrawling we had made a crawler that visit various sites, and i want the crawler to crawl sites as soon

Re: recrawling

2009-07-14 Thread Sjaiful Bahri
You have to detect changes of web content. http://zipclue.com --- On Tue, 7/14/09, Neeti Gupta neeti_gupt...@yahoo.com wrote: From: Neeti Gupta neeti_gupt...@yahoo.com Subject: Re: recrawling To: nutch-user@lucene.apache.org Date: Tuesday, July 14, 2009, 6:50 AM But are there any rules

recrawling

2009-06-24 Thread Neeti Gupta
we had made a crawler that visit various sites, and i want the crawler to crawl sites as soon as they are updated, if anyone can help me to know how i can know when the site is updated and its the time to crawl again -- View this message in context: http://www.nabble.com/recrawling

Re: recrawling

2009-06-24 Thread Otis Gospodnetic
:47 AM Subject: recrawling we had made a crawler that visit various sites, and i want the crawler to crawl sites as soon as they are updated, if anyone can help me to know how i can know when the site is updated and its the time to crawl again -- View this message in context: http

recrawling

2009-05-06 Thread abdessalemDridi
hi i have a problem: i need to know how to introduce the urls of site for my script of reindexation. thanks -- View this message in context: http://www.nabble.com/recrawling-tp23402805p23402805.html Sent from the Nutch - User mailing list archive at Nabble.com.

Recrawling updated pages

2008-12-31 Thread Rinesh1
have also tried floating values like 15f .. . Please give your inputs . Regards, Rinesh -- View this message in context: http://www.nabble.com/Recrawling-updated-pages-tp21228900p21228900.html Sent from the Nutch - User mailing list archive at Nabble.com.

Regarding recrawling in nutch

2008-11-13 Thread Rinesh Kumar
Hi, I wanted some inputs on recrawling. 1.what is the difference between crawling and recrawling.Is it getting new web pages in the site or new pages or both. 2.How can I test whether recrawling is working fine. -- Thanks Regards Rinesh

Regarding recrawling in nutch

2008-11-13 Thread Rinesh Kumar
Hi, I wanted some inputs on recrawling. 1.what is the difference between crawling and recrawling.Is it getting new web pages in the site or updated pages or both.How will it come to know about the new , updated and already crawled pages. 2.How can I test whether recrawling

recrawling nutch

2008-10-06 Thread abdessalem
So many links in the web without radical solution. Is there any efficient, final script to do recrawling, without restarting the client web app, and not just doing a new crawl and replacing the old index (takes so much time) . I have 0.9 version of nutch, later versions are welcome also

Recrawling

2008-09-18 Thread salah Elabidi
So many links in the web without radical solution. Is there any efficient, final script to do recrawling, without restarting the client web app, and not just doing a new crawl and replacing the old index (takes so much time) . I have 0.9 version of nutch, later versions are welcome also

Recrawling script

2008-09-17 Thread salah Elabidi
So many links in the web without radical solution. Is there any efficient, final script to do recrawling, without restarting the client web app, and not just doing a new crawl and replacing the old index (takes so much time) . I have 0.9 version of nutch, later versions are welcome also

schedule recrawling in nutch

2008-08-25 Thread nalgonda
Hi , i found 3 scripts for schedule recrawling but my doubt is it run on command prompt(means scripts) or create a file if we create a file where we put this? if any one idean plz share it #!/bin/sh sh /apps/Linux64/nutch/bin/nutch.sh /dev/null sh /apps/Linux64/nutch/bin/nutch-merge.sh /dev

RE: Recrawling without deleting crawl directory

2008-03-23 Thread Vinci
votre GSM. http://get.live.com _ Vous partez ? Emmenez vos amis avec vous ! http://www.windowslivemobile.msn.com/nl-be -- View this message in context: http://www.nabble.com/Recrawling-without-deleting-crawl-directory

RE: Recrawling without deleting crawl directory

2008-03-19 Thread Jean-Christophe Alleman
Hi Susam Pal and thank's for your help ! The solution you give to me doesn't work... I have still an error with Hadoop... And if I download an older version of the API, will this patch work ? I have Nutch-0.9 and I don't know if I compile with an oder Hadoop API, this patch will work. But

RE: Recrawling without deleting crawl directory

2008-03-19 Thread Jean-Christophe Alleman
Hi, I have nothing said. This works fine ! It's morning and I'm still not woke up :-D I just want to know if it was possible to re index modified documents ? Or re index documents which are already in database ? Thank's in advance ! Jisay Hi Susam Pal and thank's for your help !

RE: Recrawling without deleting crawl directory

2008-03-18 Thread Jean-Christophe Alleman
Hi, I'm interested by this patch but I can't patch it. I have some problems when I try to patch... Here is what I do : debian:~/patch# patch -p0 NUTCH-601v0.3.patch can't find file to patch at input line 5 Perhaps you used the wrong -p or --strip option? The text leading up to this was:

Re: Recrawling without deleting crawl directory

2008-03-18 Thread Susam Pal
The patch was generated for Nutch 1.0 development version which is currently in trunk. So, it is unable to patch your older version cleanly. I also see that you are using NUTCH-601v0.3.patch. However, NUTCH-601v1.0.patch is the recommended patch. If this patch fails, you can make the

RE: Recrawling without deleting crawl directory

2008-03-18 Thread Jean-Christophe Alleman
Thank's for your reply Susam Pal ! I have run ant and I have an error I can't resolve... Look at this : debian:~/nutch-0.9# ant Buildfile: build.xml init: [unjar] Expanding: /root/nutch-0.9/lib/hadoop-0.12.2-core.jar into /root/nutch-0.9/build/hadoop [untar] Expanding:

Re: Recrawling without deleting crawl directory

2008-03-18 Thread Susam Pal
I am not sure but it seems that this is because of an older version of Hadoop. I don't have older versions of Nutch or Hadoop with me to confirm this. Just try omitting the second argument in: fs.listPaths(indexes, HadoopFSUtil.getPassAllFilter()) and see if it compiles? I guess,

Re: Recrawling without deleting crawl directory

2008-03-14 Thread Susam Pal
The recrawl patch in https://issues.apache.org/jira/browse/NUTCH-601 got committed today. So if you check out the latest trunk, you can recrawl without deleting the crawl directory. However, if you are using an older version, you may use the script at: http://wiki.apache.org/nutch/Crawl Regards,

Recrawling without deleting crawl directory

2008-03-13 Thread Bradford Stephens
Greetings, A coworker and I are experimenting with Nutch in anticipation of a pretty large rollout at our company. However, we seem to be stuck on something -- after the crawler is finished, we can't manually re-crawl into the same directory/index! It says Directory already exists when we try to

Re: About link analysis and filter usage, and Recrawling

2008-03-12 Thread Vinci
-analysis-and-filter-usage%2C-and-Recrawling-tp15975729p16001325.html Sent from the Nutch - User mailing list archive at Nabble.com.

About link analysis and filter usage, and Recrawling

2008-03-11 Thread Vinci
? Is there any other method other than bin/nutch readlinkdb -dump? 2. I want all of my page crawled not begin updated, but I know I will do the recrawling based on the those crawled page, Is there any other method other than I dump the crawldb every time? 3. If I need to processing the crawled page in more

Re: About link analysis and filter usage, and Recrawling

2008-03-11 Thread Enis Soztutar
will do the recrawling based on the those crawled page, Is there any other method other than I dump the crawldb every time? yes, you can process the crawldb as above, or use updatedb etc. 3. If I need to processing the crawled page in more flexible way, Is it better I dump the document

Re: About link analysis and filter usage, and Recrawling

2008-03-11 Thread Vinci
. So If I use other Indexer like Solr, I need to do additional processing on the page in order to keep the source link information? (like add the source link infomation) Enis Soztutar wrote: 5. Is there any method to avoid nutch recrawl a page in recrawling script? (e.g. not to crawl a page

Recrawling with nutch-1.0-dev

2007-10-24 Thread Paolo Castagna
Hi, I am using nutch-1.0-dev to crawl an Intranet. My problem is recrawling. I found interesting pointers on a weblog post [1], but no solution on how to do a recrawl properly. There is a script [2] on the Nutch wiki but it does not work with the Hadoop Distributed File System

Re: how to update CrawlDB instead of Recrawling???

2007-08-21 Thread John Mendenhall
.htm l -Ronny -Opprinnelig melding- Fra: Ratnesh,V2Solutions India [mailto:[EMAIL PROTECTED] Sendt: 9. mai 2007 15:30 Til: nutch-user@lucene.apache.org Emne: how to update CrawlDB instead of Recrawling??? Hi, Ricardo, Greetings of the day, We are using nutch

Re: how to update CrawlDB instead of Recrawling???

2007-08-20 Thread bikram
RECRAWLING.. Might be helpful for someone.. thanx bikram Naess, Ronny wrote: Take a look at this article http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.htm l -Ronny -Opprinnelig melding- Fra: Ratnesh,V2Solutions India [mailto:[EMAIL PROTECTED] Sendt

Re: how to update CrawlDB instead of Recrawling???

2007-08-13 Thread srampl
India -- View this message in context: http://www.nabble.com/how-to-update-CrawlDB-instead-of-Recrawlingtf3715747.html#a12122045 Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how to update CrawlDB instead of Recrawling???

2007-08-13 Thread Brian Demers
://www.nabble.com/how-to-update-CrawlDB-instead-of-Recrawlingtf3715747.html#a12122045 Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how to update CrawlDB instead of Recrawling???

2007-08-13 Thread Renaud Richardet
not sure, but I think it's just to flush the cached index... Brian Demers wrote: Why does the web app need to be restarted? are the index files on the classpath or something? It seem like this is a hack? On 8/13/07, srampl [EMAIL PROTECTED] wrote: Hi, Thanks for this valuable

Re: how to update CrawlDB instead of Recrawling???

2007-08-13 Thread Brian Demers
does anyone know of a nicer way of doing this? On 8/13/07, Renaud Richardet [EMAIL PROTECTED] wrote: not sure, but I think it's just to flush the cached index... Brian Demers wrote: Why does the web app need to be restarted? are the index files on the classpath or something? It seem

Re: how to update CrawlDB instead of Recrawling???

2007-08-11 Thread Tomislav Poljak
Hi, if it helps: you don't need to restart tomcat to load index changes, it is enough to restart an individual web application (without restarting the Tomcat service) by touching the application's web.xml file. This is faster than restarting tomcat. Add: touch $tomcat_dir/WEB-INF/web.xml to the

Re: how to update CrawlDB instead of Recrawling???

2007-08-10 Thread Ratnesh,V2Solutions India
this message in context: http://www.nabble.com/how-to-update-CrawlDB-instead-of-Recrawlingtf3715747.html#a12086283 Sent from the Nutch - User mailing list archive at Nabble.com.

Re: how to update CrawlDB instead of Recrawling???

2007-08-10 Thread srampl
if I find any solutions from u or any of ur colleagues. With Thanks Regards, Ratnesh,V2Solutions India -- View this message in context: http://www.nabble.com/how-to-update-CrawlDB-instead-of-Recrawlingtf3715747.html#a12086056 Sent from the Nutch - User mailing list archive

Re: how to update CrawlDB instead of Recrawling???

2007-08-10 Thread srampl
information. It's nice if I find any solutions from u or any of ur colleagues. With Thanks Regards, Ratnesh,V2Solutions India -- View this message in context: http://www.nabble.com/how-to-update-CrawlDB-instead-of-Recrawlingtf3715747.html#a12086541 Sent from the Nutch - User

Re: how to update CrawlDB instead of Recrawling???

2007-08-10 Thread Harmesh, V2solutions
information. It's nice if I find any solutions from u or any of ur colleagues. With Thanks Regards, Ratnesh,V2Solutions India -- View this message in context: http://www.nabble.com/how-to-update-CrawlDB-instead-of-Recrawlingtf3715747.html#a12087687 Sent from the Nutch - User mailing

Re: how to update CrawlDB instead of Recrawling???

2007-08-10 Thread srampl
-to-update-CrawlDB-instead-of-Recrawlingtf3715747.html#a12088394 Sent from the Nutch - User mailing list archive at Nabble.com.

Recrawling is not working in Nutch 0.9

2007-07-25 Thread Anuradha doppalapudi
Hi, I have installed Nutch0.9 and crawled news website. I got hits also. After that I recrawled the same site. At that time I didn't get the hits for new pages. But I saw update urls in the log file. EX: I crawled on 17th. Again I recrawled on 23th. I saw the 23th urls in the log file like.

Re: Recrawling and Merging

2007-07-14 Thread John Reidy
Kai_testing Middleton wrote: Anuradha brought this up on nutch-dev and I also have a lot of questions regarding recrawling and merging. Unfortunately, many of these questions are not even clearly formulated yet. I have been working on a new blog. I only have two posts on there so far

Recrawling and Merging

2007-07-13 Thread Kai_testing Middleton
Anuradha brought this up on nutch-dev and I also have a lot of questions regarding recrawling and merging. Unfortunately, many of these questions are not even clearly formulated yet. I have been working on a new blog. I only have two posts on there so far but this one: http

how to update CrawlDB instead of Recrawling???

2007-05-09 Thread Ratnesh,V2Solutions India
crawled and storing some useful information. It's nice if I find any solutions from u or any of ur colleagues. With Thanks Regards, Ratnesh,V2Solutions India -- View this message in context: http://www.nabble.com/how-to-update-CrawlDB-instead-of-Recrawlingtf3715747.html#a10394243 Sent

Re: Recrawling (Tomi NA)

2006-09-08 Thread Tomi NA
On 9/7/06, David Wallace [EMAIL PROTECTED] wrote: Just guessing, but could this be caused by session ids in the URL? Or some other unimportant piece of data? If this is the case, then every page would be added to the index when it's crawled, regardless of whether it's already in there, with a

Re: Recrawling (Tomi NA)

2006-09-08 Thread Andrzej Bialecki
Tomi NA wrote: On 9/7/06, David Wallace [EMAIL PROTECTED] wrote: Just guessing, but could this be caused by session ids in the URL? Or some other unimportant piece of data? If this is the case, then every page would be added to the index when it's crawled, regardless of whether it's already

Re: Recrawling (Tomi NA)

2006-09-08 Thread Tomi NA
On 9/8/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: Tomi NA wrote: On 9/7/06, David Wallace [EMAIL PROTECTED] wrote: Just guessing, but could this be caused by session ids in the URL? Or some other unimportant piece of data? If this is the case, then every page would be added to the index

Re: Recrawling

2006-09-07 Thread Tomi NA
On 9/6/06, Andrei Hajdukewycz [EMAIL PROTECTED] wrote: Another problem I've noticed is that it seems the db grows *rapidly* with each successive recrawl. Mine started at 379MB, and it seems to increase by roughly 350MB every time I run a recrawl, despite there not being anywhere near that

Re: Recrawling (Tomi NA)

2006-09-07 Thread David Wallace
Just guessing, but could this be caused by session ids in the URL? Or some other unimportant piece of data? If this is the case, then every page would be added to the index when it's crawled, regardless of whether it's already in there, with a different session id. If this is what's causing

Re: Recrawling

2006-09-06 Thread Andrei Hajdukewycz
problem, honestly, obviously there's a lot of duplicated data in the segments. --- [EMAIL PROTECTED] wrote: From: Andrei Hajdukewycz [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Subject: Recrawling Date: Mon, 4 Sep 2006 13:42:42 -0700 Hi, I've crawled a site of roughly 30,000-40,000 pages

Recrawling

2006-09-04 Thread Andrei Hajdukewycz
Hi, I've crawled a site of roughly 30,000-40,000 pages using the bin/nutch crawl command, which went quite smoothly. Now, however, I'm trying to recrawl it using the script at http://wiki.apache.org/nutch/IntranetRecrawl?action=show . However, when I run the recrawl, generally I end up fetching

Recrawling until there's nothing left, or depth N, whichever comes first

2006-08-01 Thread Benjamin Higgins
Hello, I'm trying to write a crawl script for my intranet, and I've been looking over the one by Matthew Holt. I'd like to do something slightly different, with regard to the generate - fetch - updatedb process. I want to keep fetching until there is nothing left, or until depth N, whichever

[Fwd: Recrawling... methodology?]

2006-07-31 Thread Matthew Holt
Can anyone offer any insight into this? If I am correct and the recrawl script is currently not working properly, I will update the script and make it available to the community. Thanks.. Matt ---BeginMessage--- I need some help clarifying if recrawling is doing exactly what I think

ERROR when recrawling... can ANYONE help?

2006-06-23 Thread Honda-Search Administrator
I'm hoping that my emails actually reach other people, as they've been ignored so far. I just ran a recrawl today to crawl a few injected URLs that I have. At the end of the recrawl I received the following error: 060623 122916 merging segment indexes to:

Re: ERROR when recrawling... can ANYONE help?

2006-06-23 Thread TDLN
Please specify what exact sequence of commands you are using. For incremental crawling best to follow the whole web style process as outlined in the tutorial. The one stop crawl command cannot be used effectively for that. HTH Thomas On 6/23/06, Honda-Search Administrator [EMAIL PROTECTED]

Re: ERROR when recrawling... can ANYONE help?

2006-06-23 Thread Honda-Search Administrator
- From: TDLN [EMAIL PROTECTED] To: nutch-user@lucene.apache.org; Honda-Search Administrator [EMAIL PROTECTED] Sent: Friday, June 23, 2006 10:46 AM Subject: Re: ERROR when recrawling... can ANYONE help? Please specify what exact sequence of commands you are using. For incremental crawling

Re: ERROR when recrawling... can ANYONE help?

2006-06-23 Thread TDLN
Administrator [EMAIL PROTECTED] Sent: Friday, June 23, 2006 10:46 AM Subject: Re: ERROR when recrawling... can ANYONE help? Please specify what exact sequence of commands you are using. For incremental crawling best to follow the whole web style process as outlined in the tutorial. The one stop

Re: ERROR when recrawling... can ANYONE help?

2006-06-23 Thread Honda-Search Administrator
] To: nutch-user@lucene.apache.org; Honda-Search Administrator [EMAIL PROTECTED] Sent: Friday, June 23, 2006 11:28 AM Subject: Re: ERROR when recrawling... can ANYONE help? Does /home/honda/nutch-0.7.2/crawl/segments/20060619230003/index exist at all? Can you confirm that all segments contain index

Re: Recrawling question

2006-06-09 Thread Chris Finne
I have the same problem with 0.7.2. My guess is that the updatedb isn't actually adding more links into the webdb. i ran the bin/nutch crawl with a depth of 1 and it grabbed the initial page and registered the to links in the webdb. Then I run the recrawl script:

Recrawling question

2006-06-06 Thread Matthew Holt
Hi all, I have already successfuly indexed all the files on my domain only (as specified in the conf/crawl-urlfilter.txt file). Now when I use the below script (./recrawl crawl 10 31) to recrawl the domain, it begins indexing pages off of my domain (such as wikipedia, etc). How do I

Re: Recrawling question

2006-06-06 Thread Stefan Neufeind
Matthew Holt wrote: Hi all, I have already successfuly indexed all the files on my domain only (as specified in the conf/crawl-urlfilter.txt file). Now when I use the below script (./recrawl crawl 10 31) to recrawl the domain, it begins indexing pages off of my domain (such as wikipedia,

Re: Recrawling question

2006-06-06 Thread Matthew Holt
Stefan, Thanks a bunch! I see what you mean.. matt Stefan Neufeind wrote: Matthew Holt wrote: Hi all, I have already successfuly indexed all the files on my domain only (as specified in the conf/crawl-urlfilter.txt file). Now when I use the below script (./recrawl crawl 10 31) to

Re: Recrawling question

2006-06-06 Thread Matthew Holt
The recrawl worked this time, and I recrawled the entire db using the -adddays argument (in my case ./recrawl crawl 10 31). However, it didn't find a newly created page. If I delete the database and do the initial crawl over again, the new page is found. Any idea what I'm doing wrong or why

Re: Recrawling question

2006-06-06 Thread Matthew Holt
Just FYI.. After I do the recrawl, I do stop and start tomcat, and still the newly created page can not be found. Matthew Holt wrote: The recrawl worked this time, and I recrawled the entire db using the -adddays argument (in my case ./recrawl crawl 10 31). However, it didn't find a newly

Re: Recrawling question

2006-06-06 Thread Stefan Neufeind
You miss actually indexing the pages :-) This is done inside the crawl-command which does everything in one. After you fetched everything use: nutch invertlinks ... nutch index ... Hope that helps. Otherwise let me know and I'll dig out the complete commandlines for you. Regards, Stefan

Re: Recrawling question

2006-06-06 Thread Matthew Holt
Sorry to be asking so many questions.. Below is the current script I'm using. It's indexing the segments.. so do I use invertlinks directly after the fetch? I'm kind of confused.. thanks. matt --- #!/bin/bash # A simple script to run a Nutch

Re: Recrawling question

2006-06-06 Thread Stefan Neufeind
Oh sorry, I didn't look up the script again from your earlier mail. Hmm, I guess you can live fine without the invertlinks (if I'm right). Are you sure that your indexing works fine? I think if an index exists nutch complains. See if there is any error with indexing. Also maybe try to delete your

Re: Recrawling question

2006-06-06 Thread Matthew Holt
It's writing the segments to a new directory then I believe merging them and the index... or am i reading the script wrong? Stefan Neufeind wrote: Oh sorry, I didn't look up the script again from your earlier mail. Hmm, I guess you can live fine without the invertlinks (if I'm right). Are you

Recrawling

2005-09-07 Thread Vanderdray, Jake
I want to apologize in advance for this very basic question, but my searches aren't turning up the answer so far. I've successfully run a crawl and I can search the results. I'd like to update my index by re-crawling my site, but when I try to use the same command I used the first time I

Re: Recrawling

2005-09-07 Thread gekkokid
the documents, just an idea - im not that knowledgeable yet on nutch/lucene, hope it helps - Original Message - From: Sébastien LE CALLONNEC [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Wednesday, September 07, 2005 10:13 PM Subject: RE: Recrawling Hi Jake, I presume

Re: Recrawling

2005-09-07 Thread carmmello
Your question is, really, very basic, but it seem that Nutch does not provide such a basic command. As far as I know, the only way is to recrawl everything again. It would be nice if the Nutch developers could provide, the users, with such a recrawling tool. Carmmello - Original Message

Re: Recrawling

2005-09-07 Thread Jack Tang
Hi Jake Basic, but pretty hard issue. Now, we re-crawling website by running crawl command, and put index into temp dir. I think the core issue is how to swap index on the fly. Some index maybe are referenced by NutchBean. Should we shutdown it? Mapreduce will solve the problem? I mean can we