Re: Crawl yahoo search result page

2010-03-31 Thread reinhard schwab
it is not allowed for robots. http://search.yahoo.com/robots.txt User-agent: * Disallow: /search Disallow: /bin Disallow: /myweb Disallow: /myresults Disallow: /language Kim Theng Chong schrieb: Hi all, Can Nutch crawl Yahoo search result page? eg :

Re: problem crawling entire internal website

2010-03-26 Thread reinhard schwab
you try to refetch an already fetched segment. if in your loop bin/nutch generate crawl/crawldb crawl/segments -topN 1000 does not generate a new segment, this can happen. you have to check whether a new segment is generated by this command. check the exit status. there are some scripts in

Re: Cannot fetch urls with target=_blank

2010-03-24 Thread reinhard schwab
i guess it is filtered out by your url filter configuration. DOMContentUtils.java in parse-html plugin extracts the links. Stefano Cherchi schrieb: As in subject: when I try to fetch a page whose link should open in new window (with the tag target=_new or _blank) Nutch fails. No errors or

Re: String menu

2010-03-01 Thread reinhard schwab
QueroVc schrieb: But the crawl-urlfilter.txt not accept only characters instead of strings? If accepted, as I write? # Skip URLs containing certain characters as probable queries, etc.. -[...@=] Could be? # Skip URLs containing certain characters as probable queries, etc.. - [ menu]

Re: String menu

2010-02-22 Thread reinhard schwab
you can edit regex-urlfilter.txt to exclude those urls if you use fetch command, or crawl-urlfilter.txt if you use crawl command. QueroVc schrieb: Please could someone tell me how to not get the crawl URLs that contain the word menu. Thanks

Re: SegmentFilter

2010-02-21 Thread reinhard schwab
Andrzej Bialecki schrieb: On 2010-02-21 12:36, reinhard schwab wrote: Andrzej Bialecki schrieb: On 2010-02-20 23:32, reinhard schwab wrote: Andrzej Bialecki schrieb: On 2010-02-20 22:45, reinhard schwab wrote: the content of one page is stored even 7 times. http://www.cinema-paradiso.at

Re: SegmentFilter

2010-02-20 Thread reinhard schwab
:: reinhard schwab schrieb: i implement now this tool by forking SegmentMerger. i have only added an additional filter in the map method and keep the segment name. i have then be surprised, that the reduce method logs 4 times the content of a crawl datum. why this? i have logged then the content

Re: SegmentFilter

2010-02-19 Thread reinhard schwab
segment.SegmentFilter - org.apache.nutch.crawl.CrawlDatum 2010-02-19 13:25:54,794 INFO segment.SegmentFilter - reduce 348 regards reinhard schwab schrieb: i would like to have a segment filter, which filters out unneeded content. i only want to keep the content of pages which are still indexed in solr

Re: Aborting with 10 hung threads.

2010-02-19 Thread reinhard schwab
after adding a synchronized modifier to the addFetchItem method, i have not seen any hang of the fetcher. reinhard schwab schrieb: after studying the code and the analysis done by Steven Denny in jira, i think he is right. Note that the queue is created and then immediately reaped, and after

SegmentFilter

2010-02-14 Thread reinhard schwab
i would like to have a segment filter, which filters out unneeded content. i only want to keep the content of pages which are still indexed in solr and which belong to this segment, when i query solr by this segment name. is there any existing tool available? SegmentMerger is a no go for me. it

Re: error while crawling

2010-02-10 Thread reinhard schwab
nutch expect urls to be a directory. create a directory urls and create in this directory a file called like you want and edit this file, add the urls you want to crawl. Injector: urlDir: urls Input path doesnt exist : C:/cygwin/home/MouadSibel/nutch-0.9/urls Mouad schrieb: Hello, i

Re: 'readdb' and 'readseg' commands shows wrong last-modified-date

2010-02-01 Thread reinhard schwab
paul tomblin has posted a diff for handling last modified. dont know whether an issue has been opened in jira. http://www.mail-archive.com/nutch-user@lucene.apache.org/msg15056.html Rupesh Mankar schrieb: Hi, I am using Nutch 1.0. I have successfully crawled our intranet site. But when I

Re: Aborting with 10 hung threads.

2010-01-31 Thread reinhard schwab
/browse/NUTCH-719. A solution has been proposed but I am not sure that it really fixes the problem. J. 2010/1/26 reinhard schwab reinhard.sch...@aon.at: sometimes i watch -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1

Re: Aborting with 10 hung threads.

2010-01-31 Thread reinhard schwab
sorry, i have overseen a method with the same name in FetchItemQueues. line number 394 in my code version after expanding the import statements. will test it. reinhard schwab schrieb: i have had now the opportunity to test again fetching. it has looked good so far until now. again the same

Re: Aborting with 10 hung threads.

2010-01-31 Thread reinhard schwab
is not synchronized. if getFetchItem is called before addFetchItem has finished, then the queue is reaped and later addFetchItem increments the counter. reinhard schwab schrieb: sorry, i have overseen a method with the same name in FetchItemQueues. line number 394 in my code version after

Re: IOException Error

2010-01-29 Thread reinhard schwab
you have to install some additional jars. read the nutch README it says Apache Nutch README Important note: Due to licensing issues we cannot provide two libraries that are normally provided with PDFBox (jai_core.jar, jai_codec.jar), the parser library we use for parsing PDF files. If you

Re: IOException Error

2010-01-29 Thread reinhard schwab
/loocia/nutch-1.0/build.xml:62: Specify at least one source--a file or resource collection. Total time: 0 seconds ant version: Apache Ant version 1.7.0 compiled on April 29 2008 any idea why it doesn't build? reinhard schwab wrote: you have to install some additional jars. read

Aborting with 10 hung threads.

2010-01-25 Thread reinhard schwab
sometimes i watch -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1 Aborting with 10 hung threads. if i connect with jconsole, all fetcher threads are sleeping. something wrong with fetchQueues totalSize? before it has logged

Re: Remove URL below a certain score

2010-01-24 Thread reinhard schwab
easiest way i think of would be to modify CrawlDbReducer. it has a reduce method where it writes the crawl dates to crawldb when updating the crawl db. you can filter out there the crawl dates with low score and return before output.collect(key, result); then they are not in the crawl db.

Re: repeat fetch of same page without error

2010-01-21 Thread reinhard schwab
using nutch readdb you can dump the entry of the page. i believe that the fetch interval of this page is zero. Sunnyvale Fl schrieb: Hi, I am using Nutch 0.9.1 and I am having this weird problem - it will repeatedly fetch the same page without error. So if I let it run to 10 levels deep, the

Re: repeat fetch of same page without error

2010-01-21 Thread reinhard schwab
:00 PST 1969 Retries since fetch: 0 Retry interval: 7.0 days Score: 0.0 Signature: 5ec8dc313a9ae4d61c6e8c9d9c18ea26 Metadata: _pst_:success(1), lastModified=0 On Thu, Jan 21, 2010 at 5:00 PM, reinhard schwab reinhard.sch...@aon.atwrote: using nutch readdb you can dump the entry

Re: repeat fetch of same page without error

2010-01-21 Thread reinhard schwab
fetch: 0 Retry interval: 0.0 days Score: 0.0 Signature: 09854146546e5e7fe5def1e1add23037 Metadata: _pst_:success(1), lastModified=0 On Thu, Jan 21, 2010 at 5:50 PM, reinhard schwab reinhard.sch...@aon.atwrote: yes, i mean that. in the java classes, it is called fetch interval, see

Re: How do I crawl relative URLs not in href tags?

2010-01-17 Thread reinhard schwab
check the class DOMContentUtils.java in the plugin parse-html you can modify it to meet your requirements. in general option value does not contain links. you may apply a heuristic. Joshua J Pavel schrieb: So, with HTML like this (from a dropdown box): option

Re: regex-urlfilter.txt: only crawl .com tld

2010-01-09 Thread reinhard schwab
Ken Ken schrieb: /nutch-1.0/conf/regex-urlfilter.txt Hello, I just want to fetch/crawl all .com domain names, so what should I put in the /nutch-1.0/conf/regex-urlfilter.txt file e.g. +^http://([a-z0-9]*\.)*apache.org/ Correct me if I am wrong. I think the above only crawl/fetch

unicode 2029 paragraph separator

2009-12-21 Thread reinhard schwab
http://www.fileformat.info/info/unicode/char/2029/index.htm i have experienced that this unicode character breaks JSON deserializing when using SOLR and AJAX. it comes from a pdf text. where to filter out or replace this character? pdf parser/text extractor? solr indexer? regards reinhard

Re: Accessing crawled data

2009-12-16 Thread reinhard schwab
if you dont want to refetch already fetched pages, i think of 3 possibilities: a/ set a very high fetch interval b/ use a customized fetch schedule class instead of DefaultFetchSchedule implement there a method public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) which returns

Re: How to force recrawl of everything

2009-12-04 Thread reinhard schwab
Peters, Vijaya schrieb: I am using Nutch 1.0. I want to perform a 'clean' crawl. I see the force option in this patch: NUTCH-601v1.0.patch https://issues.apache.org/jira/secure/attachment/12375717/NUTCH-601v1.0 .patch Do I have to make those code changes, or does Nutch 1.0 have

Re: crawl dates with fetch interval 0

2009-12-02 Thread reinhard schwab
dates have 60 days retry interval. this crawl date will be fetched and fetched again with 0 days retry interval. i will open an issue in jira and attach a patch. regards reinhard reinhard schwab schrieb: i'm observing crawl dates, which have fetch interval with value 0. when i dump the segment

crawl dates with fetch interval 0

2009-12-01 Thread reinhard schwab
i'm observing crawl dates, which have fetch interval with value 0. when i dump the segment, i see Recno:: 33 URL:: http://www.wachauclimbing.net/home/impressum-disclaimer/comment-page-1/ CrawlDatum:: Version: 7 Status: 65 (signature) Fetch time: Tue Dec 01 23:41:15 CET 2009 Modified time: Thu

Re: dedup dont delete duplicates !

2009-11-25 Thread reinhard schwab
Andrzej Bialecki schrieb: BELLINI ADAM wrote: hi, my two urls points to the same page ! Please, no need to shout ... If the MD5 signatures are different, then the binary content of these pages is different, period. Use readseg -dump utility to retrieve the page content from the

Re: AbstractFetchSchedule

2009-11-22 Thread reinhard schwab
Andrzej Bialecki schrieb: reinhard schwab wrote: there is some piece of code i dont understand public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) { // pages are never truly GONE - we have to check them from time to time. // pages with too long fetchInterval

AbstractFetchSchedule

2009-11-21 Thread reinhard schwab
there is some piece of code i dont understand public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) { // pages are never truly GONE - we have to check them from time to time. // pages with too long fetchInterval are adjusted so that they fit within // maximum

Re: How do I block/ban a specific domain name or a tld?

2009-11-11 Thread reinhard schwab
reinhard schwab wrote: opsec schrieb: I've added this to my conf/crawl-urlfilter.txt and conf/regex-urlfilter.txt yet when I start a crawl this domain is heavily spidered. I would like to remove it from my search results entirely and prevent it from being crawled in the future

Re: nutch refetch by db.fetch.interval.default not working

2009-11-04 Thread reinhard schwab
if you want to recrawl urls, you have to generate a new segment, fetch this segment and update the crawl db. example script: bin/nutch generate crawl/crawldb crawl/segments -topN $topN -adddays $adddays segment=`ls -d crawl/segments/* | tail -1` bin/nutch fetch $segment bin/nutch updatedb

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
is MD5 hash of the content. another reason may be that you have some indexing filters. i dont believe its the reason here. regards kevin chen schrieb: I have similar experience. Reinhard schwab responded a possible fix. See mail in this group from Reinhard schwab at Sun, 25 Oct 2009 10

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
(db_fetched) So it was sucessfully fetchet. But, according to indexing log - it still was not sent to indexer! reinhard schwab wrote: what is the db status of this url in your crawl db? if it is STATUS_DB_NOTMODIFIED, then it may be the reason. (you can check it if you dump your

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
hmm i have no idea now. check the reduce method in IndexerMapReduce and add some debug statements there. recompile nutch and try it again. caezar schrieb: Thanks, checked, it was parsed. Still no answer why it was not indexed reinhard schwab wrote: yes, its permanently redirected. you

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
is your problem solved now??? this can be ok. new discovered urls will be added to a segment when fetched documents are parsed and if these urls pass the filters. they will not have a crawl datum Generate because they are unknown until they are extracted. regards caezar schrieb: I've compared

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread reinhard schwab
== null) { return; // only have inlinks } in IndexerMapReduce code. For this page dbDatum is null, so it is not indexed! reinhard schwab wrote: is your problem solved now??? this can be ok. new discovered urls will be added to a segment when

Re: Missing pages from Index in NUTCH 1.0

2009-10-25 Thread reinhard schwab
paul tomblin has sent a patch at 14.10.2009 to filter out not modified pages makes sense for me if the index is built incrementally and if these pages are already in the index which is updated then lucene offers the option to update an index but in my case i always build a new one. you may

Re: Plug-ins during Nutch Crawl

2009-10-21 Thread reinhard schwab
if you try bin/nutch without any arguments and options, it will show you Usage: nutch [-core] COMMAND where COMMAND is one of: ... parse parse a segment's pages invertlinks create a linkdb from parsed segments index run the indexer on parsed segments and

Re: crawl always stops at depth=3

2009-10-21 Thread reinhard schwab
part (?). nutchcase schrieb: Here is the output from that: TOTAL urls: 297 retry 0: 297 min score:0.0 avg score:0.023377104 max score:2.009 status 2 (db_fetched):295 status 5 (db_redir_perm): 2 reinhard schwab wrote: try bin/nutch readdb crawl

Re: crawl always stops at depth=3

2009-10-20 Thread reinhard schwab
try bin/nutch readdb crawl/crawldb -stats are there any unfetched pages? nutchcase schrieb: My crawl always stops at depth=3. It gets documents but does not continue any further. Here is my nutch-site.xml ?xml version=1.0? configuration property namehttp.agent.name/name

Re: LinkDB size difference

2009-09-01 Thread reinhard schwab
it in LinkDB in both cases, but if it has URL like /img/img.jpg for image, it's missing from LinkDB in case of execution using separate commands.) Any thoughts? TIA, --Hrishi -Original Message- From: reinhard schwab [mailto:reinhard.sch...@aon.at] Sent: Tuesday, September 01, 2009 3

Re: Regarding relative paths

2009-08-25 Thread reinhard schwab
there is a config option in nutch-default.xml property namedb.ignore.internal.links/name valuetrue/value descriptionIf true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest

Re: crawldb not updating

2009-08-22 Thread reinhard schwab
either you have no seed urls or your filter is to restrictive. also take care that nutch crawl will use conf/crawl-urlfilter.txt by default and not conf/regex-urlfilter.txt! Aditya Sakhuja schrieb: I am having issues getting the data injected into the crawldb. I have set the filter in the

Re: Nutch in C++

2009-08-04 Thread reinhard schwab
Iain Downs schrieb: I think there is probably a sub text here (I'm putting words in Otis' mouth, for which my apologies). ' Yes, you could rewrite Nutch in C++ and have that use CLucene.' But you'd be mad to do so! I'm a bit out of date with Nutch, but it's large. And Java to C++ is not

Re: How fetcher works

2009-07-30 Thread reinhard schwab
Saurabh Suman schrieb: Hi I have some confusion regarding Fetcher.java. Does Fetcher fetches Html page ,stores it first and then parse? Can i just store the html and i don't want to parse it? it can. it has a -noParsing option bin/nutch fetch Usage: Fetcher segment [-threads n]

Re: Include/exclude lists

2009-07-29 Thread reinhard schwab
i would suggest that you implement an urlfilter plugin which is doing that. which is mapping hosts to regexp rules. Paul Tomblin schrieb: Is there any way other than the config files to specify the url filter parameters? I have a few dozen sites to crawl, and for each site I want to specify

Re: mergesegs disk space

2009-07-29 Thread reinhard schwab
Doğacan Güney schrieb: On Tue, Jul 21, 2009 at 21:50, Tomislav Poljaktpol...@gmail.com wrote: Hi, thanks for your answers, I've configured compression: mapred.output.compress = true mapred.compress.map.output = true mapred.output.compression.type= BLOCK ( in xml format in

Re: mergesegs disk space

2009-07-29 Thread reinhard schwab
Doğacan Güney schrieb: On Wed, Jul 29, 2009 at 13:11, reinhard schwabreinhard.sch...@aon.at wrote: Doğacan Güney schrieb: On Tue, Jul 21, 2009 at 21:50, Tomislav Poljaktpol...@gmail.com wrote: Hi, thanks for your answers, I've configured compression:

Re: Dumping what I have?

2009-07-28 Thread reinhard schwab
yes, there are tools which you can use to dump the content of crawl db, link db and segments. dump=./crawl/dump bin/nutch readdb $crawl/crawldb -dump $dump/crawldb bin/nutch readlinkdb $crawl/linkdb -dump $dump/linkdb bin/nutch readseg -dump $1 $dump/segments/$1 you will get more info if you

Re: crawl-tool.xml

2009-07-27 Thread reinhard schwab
url in the domain apache.org. * Until someone could explain this...When I use the file crawl-urlfilter.txt the filter doesn't work, instead of it use the file conf/regex-urlfilter.txt and change the last line from +. to -. reinhard schwab schrieb: i have tried the recrawl script of susam pal

Re: question

2009-07-27 Thread reinhard schwab
i believe it can. check your configuration files, nutch-site.xml and nutch-default.xml. you will find something like property nameplugin.includes/name

crawl-tool.xml

2009-07-26 Thread reinhard schwab
i have tried the recrawl script of susam pal and have wondered why url filtering no longer works. http://wiki.apache.org/nutch/Crawl the mystery is only Crawl.java adds crawl-tool.xml to the NutchConfiguration. Configuration conf = NutchConfiguration.create(); conf.addResource(crawl-tool.xml);

Re: Pages with Specific URLS.

2009-07-23 Thread reinhard schwab
because? you mean urls which contain a query part? they can be crawled. the default nutch configuration excludes them by this filter rule in conf/crawl-urlfilter.txt # skip URLs containing certain characters as probable queries, etc. -[...@=] Zaihan schrieb: Hi All, I'm sure I've read

Re: dump all outlinks

2009-07-19 Thread reinhard schwab
can dump segment info to a directory, let's say tmps, $NUTCH_HOME/bin/nutch readseg -dump $segment tmps -nocontent Then, go to the directory, you should see a file dump grep outlink: dump | cut -f5 -d outlinks On Fri, 2009-07-17 at 18:43 +0200, reinhard schwab wrote: is any tool

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb: Why isnt nutch able to handle links from google? I tried to start a crawl from the following url http://www.google.se/search?q=site:sehl=svstart=100sa=N And all

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
://www.google.com/terms_of_service.html if you set the user agent properties to a client such as firefox, google will serve your request. reinhard schwab schrieb: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow: /search Larsson85 schrieb

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
you can check the response of google by dumping the segment bin/nutch readseg -dump crawl/segments/... somedirectory reinhard schwab schrieb: it seems that google is blocking the user agent i get this reply with lwp-request Your client does not have permission to get URL code/search?q

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
identify nutch as popular user agent such as firefox. Larsson85 schrieb: Any workaround for this? Making nutch identify as something else or something similar? reinhard schwab wrote: http://www.google.se/robots.txt google disallows it. User-agent: * Allow: /searchhistory/ Disallow

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
directives. Dennis reinhard schwab wrote: identify nutch as popular user agent such as firefox. Larsson85 schrieb: Any workaround for this? Making nutch identify as something else or something similar? reinhard schwab wrote: http://www.google.se/robots.txt google disallows it. User

Re: Why cant I inject a google link to the database?

2009-07-17 Thread reinhard schwab
, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* /description /property If I dont have the star in the end I get the same as earlier, No URLs to fetch. And if I do I get 0 records selected for fetching, exiting reinhard schwab wrote: identify nutch as popular

Re: wrong outlinks

2009-07-17 Thread reinhard schwab
Doğacan Güney schrieb: On Fri, Jul 17, 2009 at 22:48, reinhard schwabreinhard.sch...@aon.at wrote: when i crawl a domain such as http://www.weissenkirchen.at/ nutch extracts these outlinks. do they come from some heuristics? These are probably coming from parse-js plugin.

Re: wrong outlinks

2009-07-17 Thread reinhard schwab
reinhard schwab schrieb: Doğacan Güney schrieb: On Fri, Jul 17, 2009 at 22:48, reinhard schwabreinhard.sch...@aon.at wrote: when i crawl a domain such as http://www.weissenkirchen.at/ nutch extracts these outlinks. do they come from some heuristics