PDFBox (Re: Nutch Lockup/Freeze (Fetcher) - HELP!!)

2005-06-28 Thread Andrzej Bialecki
to create fetchlists based on a list of arbitrary URLs. This comes handy if you want to test various parts of Nutch with arbitrary URLs, not coming from the DB. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: PDFBox (Re: Nutch Lockup/Freeze (Fetcher) - HELP!!)

2005-06-28 Thread Andrzej Bialecki
of April. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [nutch 0.5] frames

2005-07-07 Thread Andrzej Bialecki
the frame contents. Please download the nightly snapshot and try it out. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Getting info about failed fetches (404, 500, HostNotFound, etc.)

2005-07-18 Thread Andrzej Bialecki
, translated error codes are recorded in segment data, and a subset of these translated codes is recorded in WebDB. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded

Re: Speed up indexing?

2005-07-21 Thread Andrzej Bialecki
performance penalty. Some disk subsystems are good with burstable traffic (because of large cache) but quite bad with sustained traffic. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Cookies, etc.

2005-08-09 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Collapsing segments

2005-08-10 Thread Andrzej Bialecki
segments into one, and much more. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot

Re: injecting outlinks?

2005-08-10 Thread Andrzej Bialecki
pathSuffix (may be empty) and contentType. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

VOTE: (Re: RSS Feed Parser)

2005-08-11 Thread Andrzej Bialecki
measure against this short testing period I would leave it disabled by default. Please vote +1 if I should commit it before the release, or -1 if after. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: Nutch 0.7 released

2005-08-19 Thread Andrzej Bialecki
with -noParsing option. This way we should be able to eliminate problems related to parsing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: parser for xsl, ppt and zip

2005-08-31 Thread Andrzej Bialecki
occurs, and where occasional breakage may happen and may last even for longer time, and this is acceptable there. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Andrzej Bialecki
/on in the config, if it's off, then the unknown content is skipped and logged, if it's on - then make the best effort to extract text. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Andrzej Bialecki
. html parser may claim that it supports plaintext. but there is another plugin specifically for plaintext. Which of them wins? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Should type: and date: queries work with search.jsp?

2005-09-16 Thread Andrzej Bialecki
this, you need to re-index your segments. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

2005-09-19 Thread Andrzej Bialecki
it first, or to use it as such. If there are no objections, I will change it in the trunk/ in a couple of days. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Is it possible to change the list of common words without crawling everything again

2005-09-20 Thread Andrzej Bialecki
the content, you just need to re-create segment indexes to reflect the changes. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: Response content length is not known

2005-09-25 Thread Andrzej Bialecki
exceed available resources or limits. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: Response content length is not known

2005-09-25 Thread Andrzej Bialecki
exceed available resources or limits. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: fetch questions - freezing

2005-10-28 Thread Andrzej Bialecki
the -noParse flag to fetcher for all those experiments. In the past it was common for the fetcher to be stuck in a buggy parser plugin, so you will need to eliminate this factor. -- Best regards, Andrzej Bialecki

Re: Outlinks?

2005-11-07 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: 20million documents, disk space size?

2005-11-14 Thread Andrzej Bialecki
Paul Harrison wrote: The 250GB is with cached pages. There is some dependency on your settings for maximum content size - if you allow content such as PDF, DOC, etc then the average disk space per page could increase to 20kB and more. -- Best regards, Andrzej Bialecki

Re: Is NutchBean Class Thead/Process-Safe?

2005-11-14 Thread Andrzej Bialecki
Victor Lee wrote: How should I go around the problem? Don't use php-java bridge - use OpenSearch servlet to get RSS with results, and then parse RSS using PHP; the servlet container will cache NutchBean for you. -- Best regards, Andrzej Bialecki

Re: Nutch webapp not at root context.

2005-11-18 Thread Andrzej Bialecki
, and some broken links. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Images

2005-11-22 Thread Andrzej Bialecki
... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Adding Field to be Searched

2005-11-22 Thread Andrzej Bialecki
/apache/nutch/searcher and I imagine I'll eventually figure it out, but if someone could point me in the right direction, I'd appreciate it. You need a query plugin - please see e.g. query-host or query-more plugins. -- Best regards, Andrzej Bialecki

Re: Stats

2005-11-28 Thread Andrzej Bialecki
and LinkDB). Also luke is every-time a good tool to browse a lucene index. (Andrzej: it is really cool! :D I use it several times in the week) Thx :) There are some bugs there, of which I'm aware, but I'm waiting with the new release for the official Lucene release. -- Best regards, Andrzej

Re: Crawl auto updated in nutch?

2005-11-30 Thread Andrzej Bialecki
function? The crawl command is just for those who are too lazy to run all 4 steps by hand... ;-) There is nothing magical about this. Just follow the standard workflow: generate, fetch, updatedb, invertlinks, generate, fetch ... dedup index search -- Best regards, Andrzej Bialecki

Re: Segment Slicer

2005-12-03 Thread Andrzej Bialecki
Matt Zytaruk wrote: Hi all, Just a quick question for you all. Is the segment slicer tool compatible with the map reduce version of nutch? Not yet. Any help is appreciated - it should be hard to do. Take a look at the CrawlDBReader ot LinkDBReader. -- Best regards, Andrzej Bialecki

Re: Merging two sets of crawled data.

2005-12-06 Thread Andrzej Bialecki
nutch merge. If you expect that there are some duplicates, you will need to run dedup. That's all. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: ATB: Merging two sets of crawled data.

2005-12-06 Thread Andrzej Bialecki
that inside each segment directory you have a per-segment index, because the nutch merge command will use them to create the master index. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Display on non-ASCII Characters in Search Results?

2005-12-06 Thread Andrzej Bialecki
the archives) is to change your Tomcat server.xml, and add useBodyEncodingForURI='true' in your Connector definition. And then consistently use UTF-8 in all JSPs. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: NDFS problem on mapred branch

2005-12-07 Thread Andrzej Bialecki
/bin$ ./nutch ndfs -put nutch nutch Could you try the same, but using absolute paths? NDFS client has no notion of relative or current directory, so the file names must always be absolute, i.e. starting with the leading / . -- Best regards, Andrzej Bialecki

Re: Luke and Indexes

2005-12-08 Thread Andrzej Bialecki
. Most likely you encountered either protocol errors or parsing errors, so there was nothing to index from these entries. In addition, if you ran the deduplication, some of the entries in your index may have been deleted because they were considered duplicates. -- Best regards, Andrzej Bialecki

Re: Why does Nutch use n-grams in analysis?

2005-12-28 Thread Andrzej Bialecki
to perform in order to find all occurences of these words when processing a phrase query. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: Why does Nutch use n-grams in analysis?

2005-12-28 Thread Andrzej Bialecki
the estimated total hits, and also the first couple of hits. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: Nutch freezing on fetch

2005-12-30 Thread Andrzej Bialecki
investigating? Do you use the parse-pdf plugin? Please do a thread dump of the stuck process (Ctrl-E, if I'm not mistaken). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Nutch freezing on fetch

2005-12-30 Thread Andrzej Bialecki
that are not fetched won't be processed at all, so those Pages in WebDB won't get updated and you will have to wait another week (or use -adddays). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: upgrade to version 0.8

2006-01-04 Thread Andrzej Bialecki
documented. But remember that the 0.7 branch is now in the maintenance mode, so no new features and only small bugfixes will show up there; most of the effort goes now to development of the 0.8 (trunk) branch. -- Best regards, Andrzej Bialecki

Re: Dedup - works on single file

2006-01-06 Thread Andrzej Bialecki
K.A.Hussain Ali wrote: HI all, Do delete dupliates (dedup) works on single segment ? Dedup works on multiple indexes. Please see the source of Crawl.main() for example of its use. -- Best regards, Andrzej Bialecki

Re: url outlink problem

2006-01-08 Thread Andrzej Bialecki
outlinks point to the correct page Is it for the reason that the site has to have a base URL value Yes. It's enough to add this somewhere in the head element of the HTML. -- Best regards, Andrzej Bialecki

Re: fresh fedora core4 install tomcat5 nutch .7.0.1 error

2006-01-10 Thread Andrzej Bialecki
(yet?). Please use the Sun JVM. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot

Re: other newbies like me

2006-01-11 Thread Andrzej Bialecki
. There's only a couple things that are missing, everything else should already be context-independent. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: Improving Nutch throughput w/MapReduce

2006-01-15 Thread Andrzej Bialecki
of a JobSubmissionProtocol - but I think there is no way now for the arbitrary code to reference it's JobClient.. bummer. Some food for thought, anyway. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: ATB: filtering content/results

2006-01-16 Thread Andrzej Bialecki
for health-related information? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: is it safe to inject into fetchlist directly?

2006-01-16 Thread Andrzej Bialecki
will be injected anyway when you update the DB. Some time ago I added a tool (in JIRA) to create such fetchlists, it works with 0.7. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Common Lucene Queries for PruneIndexTool -- GROUPS of files or folders

2006-01-16 Thread Andrzej Bialecki
, checking them, etc. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: throttling bandwidth

2006-01-16 Thread Andrzej Bialecki
, but there are similar Linux solutions, or commercial routers with built-in traffic shaping. I think that you could also play some tricks with a bandwidth-limiting proxy server, because protocol-httpclient can use a proxy. -- Best regards, Andrzej Bialecki

Re: throttling bandwidth

2006-01-17 Thread Andrzej Bialecki
need no stinking TCP, we route good ol' IP ;) Please check your facts before claiming something about all ISPs around the world. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Running crawl on nutch 0.8

2006-01-18 Thread Andrzej Bialecki
newline at the end of the urllist.txt. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: So many Unfetched Pages using MapReduce

2006-01-23 Thread Andrzej Bialecki
, second, or both? I suspect only the second change was really needed, i.e. the change in config files, and not the change of protocol-httpclient - protocol-http ... It would be very helpful if you could confirm/deny this. -- Best regards, Andrzej Bialecki

Re: How do we get the last modified date in a file

2006-01-27 Thread Andrzej Bialecki
Is this possible ?? Not yet. This will be added soon to 0.8 (trunk). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Search setup

2006-01-30 Thread Andrzej Bialecki
, and segments. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Problems with MapRed-

2006-02-01 Thread Andrzej Bialecki
you perhaps check what is the exception (if any) from the JS parser when it's failing? It could be emitted into one of the tasktracker logs. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: crawl fetch interval doubt

2006-02-02 Thread Andrzej Bialecki
is a float value, in seconds. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: crawl fetch interval doubt

2006-02-02 Thread Andrzej Bialecki
. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: crawler

2006-02-03 Thread Andrzej Bialecki
most of the time... ;-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Categorizing content

2006-02-07 Thread Andrzej Bialecki
as spam. This is the purpose of the CrawlDatum metadata patch... coming soon, I hope :-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: writing modified date in crawl datum

2006-02-14 Thread Andrzej Bialecki
in the CrawlDatum metadata. I'm working on this patch, I'll update it soon on JIRA. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: writing modified date in crawl datum

2006-02-14 Thread Andrzej Bialecki
in that patch. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: HTTPS Protocol Implementation

2006-02-14 Thread Andrzej Bialecki
Vanderdray, Jacob wrote: Is there an HTTPS protocol implementation for nutch? Yes, protocol-httpclient supports https. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Removing URLs from Web DB

2006-02-18 Thread Andrzej Bialecki
, i.e. don't collect the result. That's all. :-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: swf -tilte

2006-02-20 Thread Andrzej Bialecki
it otherwise I'll be happy to correct it. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Andrzej Bialecki
Elwin wrote: Yes, it's true, although it's not the cause of my problem. Did you try to use the alternative HTML parser (TagSoup) supported by the plugin? You need to set a property parser.html.impl to tagsoup. -- Best regards, Andrzej Bialecki

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Andrzej Bialecki
on truckin'. This is especially true for pages with multiple html elements, where Neko ignores all elements but the first one, while TagSoup just treats any html elements inside a document like any other nested element. -- Best regards, Andrzej Bialecki

CBIR (Re: Jpeg and Exif Plugin)

2006-03-03 Thread Andrzej Bialecki
a suitable front-end. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: find duplicate urls in webdb

2006-03-06 Thread Andrzej Bialecki
carefully, most probably they differ only in a single character, or a whitespace. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: HTTPS support?

2006-03-06 Thread Andrzej Bialecki
David Odmark wrote: Hi, Does Nutch 0.8 support https fetches? If not, are there any active efforts to support it? It does, using protocol-httpclient plugin. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: Help with bin/nutch server 8081 crawl

2006-03-07 Thread Andrzej Bialecki
that implement Configurable? Perhaps it should, using the current JobConf. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Problem running Nutch Mapred after applying patch for Adaptive refetch

2006-03-07 Thread Andrzej Bialecki
), which removes obsolete versions of pages from indexes. Pages are still present in segments until you delete old segments, but they won't appear in searchable index. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: Help with bin/nutch server 8081 crawl

2006-03-07 Thread Andrzej Bialecki
will apply this fix too. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: still not so clear to me

2006-03-07 Thread Andrzej Bialecki
a fetch list first based on the seed urls, then on the links found on that page (for each subsequent iteration), then on the links on those pages, and so forth and son on until the entire domain is crawled, if you limit the domains with a filter. Yes. -- Best regards, Andrzej Bialecki

Re: retry later

2006-03-08 Thread Andrzej Bialecki
refetch anyway, and if it doesn't succeed we just increase the interval by 50%. Now, fixing this the same way in 0.7 would mean that pages no longer end up in PAGE_GONE state. Is this a fix of broken behavior or a new behavior (new feature)? I'm not sure... -- Best regards, Andrzej Bialecki

Re: Adaptive Refetching

2006-03-08 Thread Andrzej Bialecki
me know if my inferences are correct and sorry for a bigger mail. No problem with the size.Yes, your conclusions seem correct. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Adaptive Refetching

2006-03-09 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Doug Cutting wrote: are refetched, their links are processed again. I think the easiest way to fix this would be to change ParseOutputFormat to not generate STATUS_LINKED crawldata when a page has been refetched. That way scores would only

Re: try to parse pdf

2006-03-13 Thread Andrzej Bialecki
properties in any place except the currently running process. All properties are read anew from the config files whenever you start any nutch processing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

Re: Distributed Search - config issue?

2006-03-17 Thread Andrzej Bialecki
... It is not wise to put IP addresses in your emails. Agreed. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

Re: Can small segments be combined?

2006-03-20 Thread Andrzej Bialecki
me to it ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Adaptive fetch schedule

2006-03-22 Thread Andrzej Bialecki
is also good. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-22 Thread Andrzej Bialecki
located? Apparently Nutch doesn't find one of the input directories, so it's either not there, or the config is wrong, but without more information it's impossible to tell. -- Best regards, Andrzej Bialecki

Re: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-23 Thread Andrzej Bialecki
/GettingNutchRunningWithWindows). When using Open Source software you should be prepared to do some basic research on your own. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-23 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: fetching https pages

2006-03-27 Thread Andrzej Bialecki
- and then you should remove protocol-http from your config. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: error help

2006-03-28 Thread Andrzej Bialecki
) at net.nutch.db.WebDBWriter.createWebDB (WebDBWriter.java:1425) at net.nutch.tools.WebDBAdminTool.main (WebDBAdminTool.java:159) You are using incompatible GNU Java. Either upgrade your GCC/GCJ to 4.x.x, or use Sun Java. Besides, Nutch 0.6 is ancient history, you should use 0.7.1 (or 0.7.2). -- Best regards, Andrzej

Re: problem with starting injection...

2006-03-28 Thread Andrzej Bialecki
if it's an option, or revert to revision 388299 . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: adaptive fetch

2006-03-28 Thread Andrzej Bialecki
to function in the above manner right. Did i miss out anything??? Yes, this is how it's supposed to work. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded

Re: Adaptive Refetch

2006-03-30 Thread Andrzej Bialecki
in CrawlDbReducer.java:86 in both versions), instead it should be initialized with the value from old.getFetchInterval(), if available. Please fix this in your version, I'll fix this in the un-patched version. Thanks for spotting this! -- Best regards, Andrzej Bialecki

Re: Adaptive Refetch

2006-03-30 Thread Andrzej Bialecki
Andrzej Bialecki wrote: Mehmet Tan wrote: Hi, I want to ask a question about redirections. Correct me if I'm wrong but if a page is redirected to a page that is already in the webdb, then the next updatedb operation will overwrite all previous info about refetch, because it is a newly

Re: Adaptive fetch

2006-03-31 Thread Andrzej Bialecki
a feeling that not too many people really reviewed this patch. So, IMHO these patches need more testing, because the potential for disruption is rather large. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: Adaptive fetch

2006-03-31 Thread Andrzej Bialecki
bring this patch up to date over the weekend. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: hi all

2006-04-02 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: thanks, but what I wanted to do is to merge segments from multiple crawls

2006-04-03 Thread Andrzej Bialecki
on my TODO list, but with a low priority. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: more questions on this - please advice

2006-04-03 Thread Andrzej Bialecki
is not compatible with 0.7. With (significant) effort suitable converters could be made, but it would be way less expensive to just bite the bullet and implement missing functionality in 0.8. -- Best regards, Andrzej Bialecki

Re: Query on merged indexes returned 0 hit - test case included (Nutch 0.8)

2006-04-04 Thread Andrzej Bialecki
). This is technically possible, but simply not implemented (yet). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: Query on merged indexes returned 0 hit - test case included (Nutch 0.8)

2006-04-04 Thread Andrzej Bialecki
, merging indexes) are already supported if you use individual command-line tools and a single DB. So, I'm not planning to do anything about it. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

Re: generate failes - class org.apache.nutch.crawl.Generator$SelectorInverseMapper not org.apache.hadoop.mapred.Mapper

2006-04-05 Thread Andrzej Bialecki
Byron Miller wrote: Got the following dump at 100% of generate cycle (.8 svn release) Just fixed this. Sorry. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Adaptive Refetch

2006-04-05 Thread Andrzej Bialecki
if they come from redirection or directly from the outlinks. If you make an exception for such urls, next time you generate a fetchlist or updatedb these urls will be filtered out anyway. -- Best regards, Andrzej Bialecki

Re: Adaptive Refetch

2006-04-05 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: please help!! inverlinks not work properly with more than 5 input parts (0.8)

2006-04-06 Thread Andrzej Bialecki
in the tutorial (http://wiki.apache.org/nutch/NutchTutorial). Please follow the tutorial where it says about Step-by-Step or Whole-web Crawling - you will save yourself (and us) a lot of grief. -- Best regards, Andrzej Bialecki

Re: latest build throws error - critical

2006-04-06 Thread Andrzej Bialecki
, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Index: src/java/org/apache/nutch/crawl

Re: fetching stuck in the middle of processing

2006-04-08 Thread Andrzej Bialecki
as unlimited namehttp.content.limit/name value-1/value . Will that be the reason? These particular problems happen among other when you run out of disk space - please check that you have enough disk space, also on your /tmp partition. -- Best regards, Andrzej Bialecki

  1   2   3   4   5   6   7   >