from:"Andrzej Bialecki"

PDFBox (Re: Nutch Lockup/Freeze (Fetcher) - HELP!!)

2005-06-28 Thread Andrzej Bialecki

to create fetchlists based on a list of arbitrary URLs. This comes handy if you want to test various parts of Nutch with arbitrary URLs, not coming from the DB. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: PDFBox (Re: Nutch Lockup/Freeze (Fetcher) - HELP!!)

2005-06-28 Thread Andrzej Bialecki

of April. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [nutch 0.5] frames

2005-07-07 Thread Andrzej Bialecki

the frame contents. Please download the nightly snapshot and try it out. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Getting info about failed fetches (404, 500, HostNotFound, etc.)

2005-07-18 Thread Andrzej Bialecki

, translated error codes are recorded in segment data, and a subset of these translated codes is recorded in WebDB. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded

Re: Speed up indexing?

2005-07-21 Thread Andrzej Bialecki

performance penalty. Some disk subsystems are good with burstable traffic (because of large cache) but quite bad with sustained traffic. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Cookies, etc.

2005-08-09 Thread Andrzej Bialecki

regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Collapsing segments

2005-08-10 Thread Andrzej Bialecki

segments into one, and much more. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot

Re: injecting outlinks?

2005-08-10 Thread Andrzej Bialecki

pathSuffix (may be empty) and contentType. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

VOTE: (Re: RSS Feed Parser)

2005-08-11 Thread Andrzej Bialecki

measure against this short testing period I would leave it disabled by default. Please vote +1 if I should commit it before the release, or -1 if after. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: Nutch 0.7 released

2005-08-19 Thread Andrzej Bialecki

with -noParsing option. This way we should be able to eliminate problems related to parsing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: parser for xsl, ppt and zip

2005-08-31 Thread Andrzej Bialecki

occurs, and where occasional breakage may happen and may last even for longer time, and this is acceptable there. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Andrzej Bialecki

/on in the config, if it's off, then the unknown content is skipped and logged, if it's on - then make the best effort to extract text. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Andrzej Bialecki

. html parser may claim that it supports plaintext. but there is another plugin specifically for plaintext. Which of them wins? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Should type: and date: queries work with search.jsp?

2005-09-16 Thread Andrzej Bialecki

this, you need to re-index your segments. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

2005-09-19 Thread Andrzej Bialecki

it first, or to use it as such. If there are no objections, I will change it in the trunk/ in a couple of days. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Is it possible to change the list of common words without crawling everything again

2005-09-20 Thread Andrzej Bialecki

the content, you just need to re-create segment indexes to reflect the changes. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: Response content length is not known

2005-09-25 Thread Andrzej Bialecki

exceed available resources or limits. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: Response content length is not known

2005-09-25 Thread Andrzej Bialecki

exceed available resources or limits. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: fetch questions - freezing

2005-10-28 Thread Andrzej Bialecki

the -noParse flag to fetcher for all those experiments. In the past it was common for the fetcher to be stuck in a buggy parser plugin, so you will need to eliminate this factor. -- Best regards, Andrzej Bialecki

Re: Outlinks?

2005-11-07 Thread Andrzej Bialecki

regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: 20million documents, disk space size?

2005-11-14 Thread Andrzej Bialecki

Paul Harrison wrote: The 250GB is with cached pages. There is some dependency on your settings for maximum content size - if you allow content such as PDF, DOC, etc then the average disk space per page could increase to 20kB and more. -- Best regards, Andrzej Bialecki

Re: Is NutchBean Class Thead/Process-Safe?

2005-11-14 Thread Andrzej Bialecki

Victor Lee wrote: How should I go around the problem? Don't use php-java bridge - use OpenSearch servlet to get RSS with results, and then parse RSS using PHP; the servlet container will cache NutchBean for you. -- Best regards, Andrzej Bialecki

Re: Nutch webapp not at root context.

2005-11-18 Thread Andrzej Bialecki

, and some broken links. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Images

2005-11-22 Thread Andrzej Bialecki

... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Adding Field to be Searched

2005-11-22 Thread Andrzej Bialecki

/apache/nutch/searcher and I imagine I'll eventually figure it out, but if someone could point me in the right direction, I'd appreciate it. You need a query plugin - please see e.g. query-host or query-more plugins. -- Best regards, Andrzej Bialecki

Re: Stats

2005-11-28 Thread Andrzej Bialecki

and LinkDB). Also luke is every-time a good tool to browse a lucene index. (Andrzej: it is really cool! :D I use it several times in the week) Thx :) There are some bugs there, of which I'm aware, but I'm waiting with the new release for the official Lucene release. -- Best regards, Andrzej

Re: Crawl auto updated in nutch?

2005-11-30 Thread Andrzej Bialecki

function? The crawl command is just for those who are too lazy to run all 4 steps by hand... ;-) There is nothing magical about this. Just follow the standard workflow: generate, fetch, updatedb, invertlinks, generate, fetch ... dedup index search -- Best regards, Andrzej Bialecki

Re: Segment Slicer

2005-12-03 Thread Andrzej Bialecki

Matt Zytaruk wrote: Hi all, Just a quick question for you all. Is the segment slicer tool compatible with the map reduce version of nutch? Not yet. Any help is appreciated - it should be hard to do. Take a look at the CrawlDBReader ot LinkDBReader. -- Best regards, Andrzej Bialecki

Re: Merging two sets of crawled data.

2005-12-06 Thread Andrzej Bialecki

nutch merge. If you expect that there are some duplicates, you will need to run dedup. That's all. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: ATB: Merging two sets of crawled data.

2005-12-06 Thread Andrzej Bialecki

that inside each segment directory you have a per-segment index, because the nutch merge command will use them to create the master index. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Display on non-ASCII Characters in Search Results?

2005-12-06 Thread Andrzej Bialecki

the archives) is to change your Tomcat server.xml, and add useBodyEncodingForURI='true' in your Connector definition. And then consistently use UTF-8 in all JSPs. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: NDFS problem on mapred branch

2005-12-07 Thread Andrzej Bialecki

/bin$ ./nutch ndfs -put nutch nutch Could you try the same, but using absolute paths? NDFS client has no notion of relative or current directory, so the file names must always be absolute, i.e. starting with the leading / . -- Best regards, Andrzej Bialecki

Re: Luke and Indexes

2005-12-08 Thread Andrzej Bialecki

. Most likely you encountered either protocol errors or parsing errors, so there was nothing to index from these entries. In addition, if you ran the deduplication, some of the entries in your index may have been deleted because they were considered duplicates. -- Best regards, Andrzej Bialecki

Re: Why does Nutch use n-grams in analysis?

2005-12-28 Thread Andrzej Bialecki

to perform in order to find all occurences of these words when processing a phrase query. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: Why does Nutch use n-grams in analysis?

2005-12-28 Thread Andrzej Bialecki

the estimated total hits, and also the first couple of hits. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: Nutch freezing on fetch

2005-12-30 Thread Andrzej Bialecki

investigating? Do you use the parse-pdf plugin? Please do a thread dump of the stuck process (Ctrl-E, if I'm not mistaken). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Nutch freezing on fetch

2005-12-30 Thread Andrzej Bialecki

that are not fetched won't be processed at all, so those Pages in WebDB won't get updated and you will have to wait another week (or use -adddays). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: upgrade to version 0.8

2006-01-04 Thread Andrzej Bialecki

documented. But remember that the 0.7 branch is now in the maintenance mode, so no new features and only small bugfixes will show up there; most of the effort goes now to development of the 0.8 (trunk) branch. -- Best regards, Andrzej Bialecki

Re: Dedup - works on single file

2006-01-06 Thread Andrzej Bialecki

K.A.Hussain Ali wrote: HI all, Do delete dupliates (dedup) works on single segment ? Dedup works on multiple indexes. Please see the source of Crawl.main() for example of its use. -- Best regards, Andrzej Bialecki

Re: url outlink problem

2006-01-08 Thread Andrzej Bialecki

outlinks point to the correct page Is it for the reason that the site has to have a base URL value Yes. It's enough to add this somewhere in the head element of the HTML. -- Best regards, Andrzej Bialecki

Re: fresh fedora core4 install tomcat5 nutch .7.0.1 error

2006-01-10 Thread Andrzej Bialecki

(yet?). Please use the Sun JVM. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot

Re: other newbies like me

2006-01-11 Thread Andrzej Bialecki

. There's only a couple things that are missing, everything else should already be context-independent. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: Improving Nutch throughput w/MapReduce

2006-01-15 Thread Andrzej Bialecki

of a JobSubmissionProtocol - but I think there is no way now for the arbitrary code to reference it's JobClient.. bummer. Some food for thought, anyway. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: ATB: filtering content/results

2006-01-16 Thread Andrzej Bialecki

for health-related information? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: is it safe to inject into fetchlist directly?

2006-01-16 Thread Andrzej Bialecki

will be injected anyway when you update the DB. Some time ago I added a tool (in JIRA) to create such fetchlists, it works with 0.7. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Common Lucene Queries for PruneIndexTool -- GROUPS of files or folders

2006-01-16 Thread Andrzej Bialecki

, checking them, etc. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: throttling bandwidth

2006-01-16 Thread Andrzej Bialecki

, but there are similar Linux solutions, or commercial routers with built-in traffic shaping. I think that you could also play some tricks with a bandwidth-limiting proxy server, because protocol-httpclient can use a proxy. -- Best regards, Andrzej Bialecki

Re: throttling bandwidth

2006-01-17 Thread Andrzej Bialecki

need no stinking TCP, we route good ol' IP ;) Please check your facts before claiming something about all ISPs around the world. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Running crawl on nutch 0.8

2006-01-18 Thread Andrzej Bialecki

newline at the end of the urllist.txt. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: So many Unfetched Pages using MapReduce

2006-01-23 Thread Andrzej Bialecki

, second, or both? I suspect only the second change was really needed, i.e. the change in config files, and not the change of protocol-httpclient - protocol-http ... It would be very helpful if you could confirm/deny this. -- Best regards, Andrzej Bialecki

Re: How do we get the last modified date in a file

2006-01-27 Thread Andrzej Bialecki

Is this possible ?? Not yet. This will be added soon to 0.8 (trunk). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Search setup

2006-01-30 Thread Andrzej Bialecki

, and segments. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Problems with MapRed-

2006-02-01 Thread Andrzej Bialecki

you perhaps check what is the exception (if any) from the JS parser when it's failing? It could be emitted into one of the tasktracker logs. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: crawl fetch interval doubt

2006-02-02 Thread Andrzej Bialecki

is a float value, in seconds. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: crawl fetch interval doubt

2006-02-02 Thread Andrzej Bialecki

. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: crawler

2006-02-03 Thread Andrzej Bialecki

most of the time... ;-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Categorizing content

2006-02-07 Thread Andrzej Bialecki

as spam. This is the purpose of the CrawlDatum metadata patch... coming soon, I hope :-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: writing modified date in crawl datum

2006-02-14 Thread Andrzej Bialecki

in the CrawlDatum metadata. I'm working on this patch, I'll update it soon on JIRA. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: writing modified date in crawl datum

2006-02-14 Thread Andrzej Bialecki

in that patch. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: HTTPS Protocol Implementation

2006-02-14 Thread Andrzej Bialecki

Vanderdray, Jacob wrote: Is there an HTTPS protocol implementation for nutch? Yes, protocol-httpclient supports https. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Removing URLs from Web DB

2006-02-18 Thread Andrzej Bialecki

, i.e. don't collect the result. That's all. :-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: swf -tilte

2006-02-20 Thread Andrzej Bialecki

it otherwise I'll be happy to correct it. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Andrzej Bialecki

Elwin wrote: Yes, it's true, although it's not the cause of my problem. Did you try to use the alternative HTML parser (TagSoup) supported by the plugin? You need to set a property parser.html.impl to tagsoup. -- Best regards, Andrzej Bialecki

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Andrzej Bialecki

on truckin'. This is especially true for pages with multiple html elements, where Neko ignores all elements but the first one, while TagSoup just treats any html elements inside a document like any other nested element. -- Best regards, Andrzej Bialecki

CBIR (Re: Jpeg and Exif Plugin)

2006-03-03 Thread Andrzej Bialecki

a suitable front-end. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: find duplicate urls in webdb

2006-03-06 Thread Andrzej Bialecki

carefully, most probably they differ only in a single character, or a whitespace. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: HTTPS support?

2006-03-06 Thread Andrzej Bialecki

David Odmark wrote: Hi, Does Nutch 0.8 support https fetches? If not, are there any active efforts to support it? It does, using protocol-httpclient plugin. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: Help with bin/nutch server 8081 crawl

2006-03-07 Thread Andrzej Bialecki

that implement Configurable? Perhaps it should, using the current JobConf. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Problem running Nutch Mapred after applying patch for Adaptive refetch

2006-03-07 Thread Andrzej Bialecki

), which removes obsolete versions of pages from indexes. Pages are still present in segments until you delete old segments, but they won't appear in searchable index. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: Help with bin/nutch server 8081 crawl

2006-03-07 Thread Andrzej Bialecki

will apply this fix too. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: still not so clear to me

2006-03-07 Thread Andrzej Bialecki

a fetch list first based on the seed urls, then on the links found on that page (for each subsequent iteration), then on the links on those pages, and so forth and son on until the entire domain is crawled, if you limit the domains with a filter. Yes. -- Best regards, Andrzej Bialecki

Re: retry later

2006-03-08 Thread Andrzej Bialecki

refetch anyway, and if it doesn't succeed we just increase the interval by 50%. Now, fixing this the same way in 0.7 would mean that pages no longer end up in PAGE_GONE state. Is this a fix of broken behavior or a new behavior (new feature)? I'm not sure... -- Best regards, Andrzej Bialecki

Re: Adaptive Refetching

2006-03-08 Thread Andrzej Bialecki

me know if my inferences are correct and sorry for a bigger mail. No problem with the size.Yes, your conclusions seem correct. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Adaptive Refetching

2006-03-09 Thread Andrzej Bialecki

Doug Cutting wrote: Andrzej Bialecki wrote: Doug Cutting wrote: are refetched, their links are processed again. I think the easiest way to fix this would be to change ParseOutputFormat to not generate STATUS_LINKED crawldata when a page has been refetched. That way scores would only

Re: try to parse pdf

2006-03-13 Thread Andrzej Bialecki

properties in any place except the currently running process. All properties are read anew from the config files whenever you start any nutch processing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

Re: Distributed Search - config issue?

2006-03-17 Thread Andrzej Bialecki

... It is not wise to put IP addresses in your emails. Agreed. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

Re: Can small segments be combined?

2006-03-20 Thread Andrzej Bialecki

me to it ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Adaptive fetch schedule

2006-03-22 Thread Andrzej Bialecki

is also good. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-22 Thread Andrzej Bialecki

located? Apparently Nutch doesn't find one of the input directories, so it's either not there, or the config is wrong, but without more information it's impossible to tell. -- Best regards, Andrzej Bialecki

Re: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-23 Thread Andrzej Bialecki

/GettingNutchRunningWithWindows). When using Open Source software you should be prepared to do some basic research on your own. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-23 Thread Andrzej Bialecki

regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: fetching https pages

2006-03-27 Thread Andrzej Bialecki

- and then you should remove protocol-http from your config. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: error help

2006-03-28 Thread Andrzej Bialecki

) at net.nutch.db.WebDBWriter.createWebDB (WebDBWriter.java:1425) at net.nutch.tools.WebDBAdminTool.main (WebDBAdminTool.java:159) You are using incompatible GNU Java. Either upgrade your GCC/GCJ to 4.x.x, or use Sun Java. Besides, Nutch 0.6 is ancient history, you should use 0.7.1 (or 0.7.2). -- Best regards, Andrzej

Re: problem with starting injection...

2006-03-28 Thread Andrzej Bialecki

if it's an option, or revert to revision 388299 . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: adaptive fetch

2006-03-28 Thread Andrzej Bialecki

to function in the above manner right. Did i miss out anything??? Yes, this is how it's supposed to work. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded

Re: Adaptive Refetch

2006-03-30 Thread Andrzej Bialecki

in CrawlDbReducer.java:86 in both versions), instead it should be initialized with the value from old.getFetchInterval(), if available. Please fix this in your version, I'll fix this in the un-patched version. Thanks for spotting this! -- Best regards, Andrzej Bialecki

Re: Adaptive Refetch

2006-03-30 Thread Andrzej Bialecki

Andrzej Bialecki wrote: Mehmet Tan wrote: Hi, I want to ask a question about redirections. Correct me if I'm wrong but if a page is redirected to a page that is already in the webdb, then the next updatedb operation will overwrite all previous info about refetch, because it is a newly

Re: Adaptive fetch

2006-03-31 Thread Andrzej Bialecki

a feeling that not too many people really reviewed this patch. So, IMHO these patches need more testing, because the potential for disruption is rather large. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: Adaptive fetch

2006-03-31 Thread Andrzej Bialecki

bring this patch up to date over the weekend. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: hi all

2006-04-02 Thread Andrzej Bialecki

regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: thanks, but what I wanted to do is to merge segments from multiple crawls

2006-04-03 Thread Andrzej Bialecki

on my TODO list, but with a low priority. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: more questions on this - please advice

2006-04-03 Thread Andrzej Bialecki

is not compatible with 0.7. With (significant) effort suitable converters could be made, but it would be way less expensive to just bite the bullet and implement missing functionality in 0.8. -- Best regards, Andrzej Bialecki

Re: Query on merged indexes returned 0 hit - test case included (Nutch 0.8)

2006-04-04 Thread Andrzej Bialecki

). This is technically possible, but simply not implemented (yet). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: Query on merged indexes returned 0 hit - test case included (Nutch 0.8)

2006-04-04 Thread Andrzej Bialecki

, merging indexes) are already supported if you use individual command-line tools and a single DB. So, I'm not planning to do anything about it. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

Re: generate failes - class org.apache.nutch.crawl.Generator$SelectorInverseMapper not org.apache.hadoop.mapred.Mapper

2006-04-05 Thread Andrzej Bialecki

Byron Miller wrote: Got the following dump at 100% of generate cycle (.8 svn release) Just fixed this. Sorry. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Adaptive Refetch

2006-04-05 Thread Andrzej Bialecki

if they come from redirection or directly from the outlinks. If you make an exception for such urls, next time you generate a fetchlist or updatedb these urls will be filtered out anyway. -- Best regards, Andrzej Bialecki

Re: Adaptive Refetch

2006-04-05 Thread Andrzej Bialecki

regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: please help!! inverlinks not work properly with more than 5 input parts (0.8)

2006-04-06 Thread Andrzej Bialecki

in the tutorial (http://wiki.apache.org/nutch/NutchTutorial). Please follow the tutorial where it says about Step-by-Step or Whole-web Crawling - you will save yourself (and us) a lot of grief. -- Best regards, Andrzej Bialecki

Re: latest build throws error - critical

2006-04-06 Thread Andrzej Bialecki

, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Index: src/java/org/apache/nutch/crawl

Re: fetching stuck in the middle of processing

2006-04-08 Thread Andrzej Bialecki

as unlimited namehttp.content.limit/name value-1/value . Will that be the reason? These particular problems happen among other when you run out of disk space - please check that you have enough disk space, also on your /tmp partition. -- Best regards, Andrzej Bialecki

1 2 3 4 5 6 7 >

1 - 100 of 620 matches

Mail list logo