PDFBox (Re: Nutch Lockup/Freeze (Fetcher) - HELP!!)

2005-06-28 Thread Andrzej Bialecki
to create fetchlists based on a list of arbitrary URLs. This comes handy if you want to test various parts of Nutch with arbitrary URLs, not coming from the DB. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: PDFBox (Re: Nutch Lockup/Freeze (Fetcher) - HELP!!)

2005-06-28 Thread Andrzej Bialecki
of April. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [nutch 0.5] frames

2005-07-07 Thread Andrzej Bialecki
the frame contents. Please download the nightly snapshot and try it out. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Getting info about failed fetches (404, 500, HostNotFound, etc.)

2005-07-18 Thread Andrzej Bialecki
, translated error codes are recorded in segment data, and a subset of these translated codes is recorded in WebDB. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded

Re: Speed up indexing?

2005-07-21 Thread Andrzej Bialecki
performance penalty. Some disk subsystems are good with burstable traffic (because of large cache) but quite bad with sustained traffic. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Cookies, etc.

2005-08-09 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Collapsing segments

2005-08-10 Thread Andrzej Bialecki
segments into one, and much more. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot

Re: injecting outlinks?

2005-08-10 Thread Andrzej Bialecki
pathSuffix (may be empty) and contentType. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

VOTE: (Re: RSS Feed Parser)

2005-08-11 Thread Andrzej Bialecki
measure against this short testing period I would leave it disabled by default. Please vote +1 if I should commit it before the release, or -1 if after. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: Nutch 0.7 released

2005-08-19 Thread Andrzej Bialecki
somebody be interested looking into them and any new ones? Yes, I would. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Nutch 0.7 released

2005-08-19 Thread Andrzej Bialecki
with -noParsing option. This way we should be able to eliminate problems related to parsing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: Fetcher, Query Strings,and Duplicate Hashes (Nutch 0.7)

2005-08-25 Thread Andrzej Bialecki
contained content MD5 hashing value for the previous fetching. If the current fetching step gets same content, I will skip parsing and indexing process. Please see the patches in http://issues.apache.org/jira/browse/NUTCH-61 . -- Best regards, Andrzej Bialecki

Re: parser for xsl, ppt and zip

2005-08-31 Thread Andrzej Bialecki
occurs, and where occasional breakage may happen and may last even for longer time, and this is acceptable there. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: RangQuery problem.

2005-09-03 Thread Andrzej Bialecki
the application). If you are confident that your setup can handle more terms in a query, then you can use BooleanQuery.setMaxClauseCount(xxx) to increase this limit. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Andrzej Bialecki
/on in the config, if it's off, then the unknown content is skipped and logged, if it's on - then make the best effort to extract text. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Andrzej Bialecki
. html parser may claim that it supports plaintext. but there is another plugin specifically for plaintext. Which of them wins? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: JavaScript Urls

2005-09-07 Thread Andrzej Bialecki
, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Should type: and date: queries work with search.jsp?

2005-09-16 Thread Andrzej Bialecki
this, you need to re-index your segments. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: indexing is very very very slow

2005-09-19 Thread Andrzej Bialecki
of the 20050916014401 segment. Then run nutch segread -fix 20050916014401. Then re-run mergesegs - it will now work at full speed. NB. if there are any more segments which give you this warning, do the same before you run mergesegs. -- Best regards, Andrzej Bialecki

Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

2005-09-19 Thread Andrzej Bialecki
it first, or to use it as such. If there are no objections, I will change it in the trunk/ in a couple of days. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Proposal: refuse to open partially trunc. MapFile, unless forced (Re: indexing is very very very slow)

2005-09-20 Thread Andrzej Bialecki
Gal Nitzan wrote: Andrzej Bialecki wrote: Hi all, Well I still get a very slow mergesegs: 050917 043332 - data in segment index/segments/20050916014401 is corrupt, using only 128115 entries. This is a common and recurring problem. What's worse is that an unfixed segment like

Re: Is it possible to change the list of common words without crawling everything again

2005-09-20 Thread Andrzej Bialecki
the content, you just need to re-create segment indexes to reflect the changes. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: Response content length is not known

2005-09-25 Thread Andrzej Bialecki
exceed available resources or limits. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: Response content length is not known

2005-09-25 Thread Andrzej Bialecki
exceed available resources or limits. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: fetch questions - freezing

2005-10-28 Thread Andrzej Bialecki
the -noParse flag to fetcher for all those experiments. In the past it was common for the fetcher to be stuck in a buggy parser plugin, so you will need to eliminate this factor. -- Best regards, Andrzej Bialecki

Re: Outlinks?

2005-11-07 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: How to Get Exact Matches in Nutch?

2005-11-13 Thread Andrzej Bialecki
This will be parsed into a phrase query, but it will match only the documents which contain this exact phrase... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: 20million documents, disk space size?

2005-11-14 Thread Andrzej Bialecki
Paul Harrison wrote: The 250GB is with cached pages. There is some dependency on your settings for maximum content size - if you allow content such as PDF, DOC, etc then the average disk space per page could increase to 20kB and more. -- Best regards, Andrzej Bialecki

Re: Is NutchBean Class Thead/Process-Safe?

2005-11-14 Thread Andrzej Bialecki
Victor Lee wrote: How should I go around the problem? Don't use php-java bridge - use OpenSearch servlet to get RSS with results, and then parse RSS using PHP; the servlet container will cache NutchBean for you. -- Best regards, Andrzej Bialecki

Re: url regex filter on redirects

2005-11-16 Thread Andrzej Bialecki
Mr. Udatny wrote: is it correct that urls which return a redirect to another url are not filtered anymore? possible to solve? It's not true. In each case the new URL is passed to URLFilters, and if it comes back empty it is skipped. -- Best regards, Andrzej Bialecki

Re: Nutch webapp not at root context.

2005-11-18 Thread Andrzej Bialecki
, and some broken links. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Images

2005-11-22 Thread Andrzej Bialecki
... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Adding Field to be Searched

2005-11-22 Thread Andrzej Bialecki
/apache/nutch/searcher and I imagine I'll eventually figure it out, but if someone could point me in the right direction, I'd appreciate it. You need a query plugin - please see e.g. query-host or query-more plugins. -- Best regards, Andrzej Bialecki

Re: Stats

2005-11-28 Thread Andrzej Bialecki
and LinkDB). Also luke is every-time a good tool to browse a lucene index. (Andrzej: it is really cool! :D I use it several times in the week) Thx :) There are some bugs there, of which I'm aware, but I'm waiting with the new release for the official Lucene release. -- Best regards, Andrzej

Re: Invalid fetcher or fetch_output

2005-11-30 Thread Andrzej Bialecki
recover from whatever happened? If I generate to get a new crawl list the Fetchertool doesn't have any urls listed. Which version are you using? 0.7, or the mapred branch? -- Best regards, Andrzej Bialecki

Re: Crawl auto updated in nutch?

2005-11-30 Thread Andrzej Bialecki
function? The crawl command is just for those who are too lazy to run all 4 steps by hand... ;-) There is nothing magical about this. Just follow the standard workflow: generate, fetch, updatedb, invertlinks, generate, fetch ... dedup index search -- Best regards, Andrzej Bialecki

Re: Segment Slicer

2005-12-03 Thread Andrzej Bialecki
Matt Zytaruk wrote: Hi all, Just a quick question for you all. Is the segment slicer tool compatible with the map reduce version of nutch? Not yet. Any help is appreciated - it should be hard to do. Take a look at the CrawlDBReader ot LinkDBReader. -- Best regards, Andrzej Bialecki

Re: Merging two sets of crawled data.

2005-12-06 Thread Andrzej Bialecki
nutch merge. If you expect that there are some duplicates, you will need to run dedup. That's all. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: ATB: Merging two sets of crawled data.

2005-12-06 Thread Andrzej Bialecki
that inside each segment directory you have a per-segment index, because the nutch merge command will use them to create the master index. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Display on non-ASCII Characters in Search Results?

2005-12-06 Thread Andrzej Bialecki
the archives) is to change your Tomcat server.xml, and add useBodyEncodingForURI='true' in your Connector definition. And then consistently use UTF-8 in all JSPs. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: Returning all hits in a document

2005-12-07 Thread Andrzej Bialecki
requirements, it could be as simple as changing the configuration in nutch-default.xml / nutch-site.xml to allow infinitely long summaries (searcher.summary.context and searcher.summary.length). -- Best regards, Andrzej Bialecki

Re: NDFS problem on mapred branch

2005-12-07 Thread Andrzej Bialecki
/bin$ ./nutch ndfs -put nutch nutch Could you try the same, but using absolute paths? NDFS client has no notion of relative or current directory, so the file names must always be absolute, i.e. starting with the leading / . -- Best regards, Andrzej Bialecki

Re: Luke and Indexes

2005-12-08 Thread Andrzej Bialecki
. Most likely you encountered either protocol errors or parsing errors, so there was nothing to index from these entries. In addition, if you ran the deduplication, some of the entries in your index may have been deleted because they were considered duplicates. -- Best regards, Andrzej Bialecki

Re: Why does Nutch use n-grams in analysis?

2005-12-28 Thread Andrzej Bialecki
to perform in order to find all occurences of these words when processing a phrase query. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: Why does Nutch use n-grams in analysis?

2005-12-28 Thread Andrzej Bialecki
the estimated total hits, and also the first couple of hits. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: Nutch freezing on fetch

2005-12-30 Thread Andrzej Bialecki
investigating? Do you use the parse-pdf plugin? Please do a thread dump of the stuck process (Ctrl-E, if I'm not mistaken). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Nutch freezing on fetch

2005-12-30 Thread Andrzej Bialecki
that are not fetched won't be processed at all, so those Pages in WebDB won't get updated and you will have to wait another week (or use -adddays). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: upgrade to version 0.8

2006-01-04 Thread Andrzej Bialecki
documented. But remember that the 0.7 branch is now in the maintenance mode, so no new features and only small bugfixes will show up there; most of the effort goes now to development of the 0.8 (trunk) branch. -- Best regards, Andrzej Bialecki

Re: Dedup - works on single file

2006-01-06 Thread Andrzej Bialecki
K.A.Hussain Ali wrote: HI all, Do delete dupliates (dedup) works on single segment ? Dedup works on multiple indexes. Please see the source of Crawl.main() for example of its use. -- Best regards, Andrzej Bialecki

Re: url outlink problem

2006-01-08 Thread Andrzej Bialecki
outlinks point to the correct page Is it for the reason that the site has to have a base URL value Yes. It's enough to add this somewhere in the head element of the HTML. -- Best regards, Andrzej Bialecki

Re: fresh fedora core4 install tomcat5 nutch .7.0.1 error

2006-01-10 Thread Andrzej Bialecki
(yet?). Please use the Sun JVM. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot

Re: other newbies like me

2006-01-11 Thread Andrzej Bialecki
. There's only a couple things that are missing, everything else should already be context-independent. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: Improving Nutch throughput w/MapReduce

2006-01-15 Thread Andrzej Bialecki
of a JobSubmissionProtocol - but I think there is no way now for the arbitrary code to reference it's JobClient.. bummer. Some food for thought, anyway. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: ATB: filtering content/results

2006-01-16 Thread Andrzej Bialecki
for health-related information? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: is it safe to inject into fetchlist directly?

2006-01-16 Thread Andrzej Bialecki
will be injected anyway when you update the DB. Some time ago I added a tool (in JIRA) to create such fetchlists, it works with 0.7. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Common Lucene Queries for PruneIndexTool -- GROUPS of files or folders

2006-01-16 Thread Andrzej Bialecki
, checking them, etc. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: throttling bandwidth

2006-01-16 Thread Andrzej Bialecki
, but there are similar Linux solutions, or commercial routers with built-in traffic shaping. I think that you could also play some tricks with a bandwidth-limiting proxy server, because protocol-httpclient can use a proxy. -- Best regards, Andrzej Bialecki

Re: throttling bandwidth

2006-01-17 Thread Andrzej Bialecki
need no stinking TCP, we route good ol' IP ;) Please check your facts before claiming something about all ISPs around the world. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Running crawl on nutch 0.8

2006-01-18 Thread Andrzej Bialecki
newline at the end of the urllist.txt. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: interesting paper with competing index systems

2006-01-19 Thread Andrzej Bialecki
that the hotspots are recompiled... This alone discredits the results in my eyes. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: So many Unfetched Pages using MapReduce

2006-01-23 Thread Andrzej Bialecki
, second, or both? I suspect only the second change was really needed, i.e. the change in config files, and not the change of protocol-httpclient - protocol-http ... It would be very helpful if you could confirm/deny this. -- Best regards, Andrzej Bialecki

Re: So many Unfetched Pages using MapReduce

2006-01-23 Thread Andrzej Bialecki
by these two plugins. The most important message from these tests is that neither plugin is horribly broken, it seems this was a problem with setting the mapred values in the wrong file... Thank you very much for checking this! -- Best regards, Andrzej Bialecki

Re: How do we get the last modified date in a file

2006-01-27 Thread Andrzej Bialecki
Is this possible ?? Not yet. This will be added soon to 0.8 (trunk). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Search setup

2006-01-30 Thread Andrzej Bialecki
, and segments. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Problems with MapRed-

2006-02-01 Thread Andrzej Bialecki
you perhaps check what is the exception (if any) from the JS parser when it's failing? It could be emitted into one of the tasktracker logs. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: crawl fetch interval doubt

2006-02-02 Thread Andrzej Bialecki
is a float value, in seconds. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: crawl fetch interval doubt

2006-02-02 Thread Andrzej Bialecki
. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: crawler

2006-02-03 Thread Andrzej Bialecki
most of the time... ;-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Categorizing content

2006-02-07 Thread Andrzej Bialecki
as spam. This is the purpose of the CrawlDatum metadata patch... coming soon, I hope :-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: writing modified date in crawl datum

2006-02-14 Thread Andrzej Bialecki
in the CrawlDatum metadata. I'm working on this patch, I'll update it soon on JIRA. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: writing modified date in crawl datum

2006-02-14 Thread Andrzej Bialecki
in that patch. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: HTTPS Protocol Implementation

2006-02-14 Thread Andrzej Bialecki
Vanderdray, Jacob wrote: Is there an HTTPS protocol implementation for nutch? Yes, protocol-httpclient supports https. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Removing URLs from Web DB

2006-02-18 Thread Andrzej Bialecki
, i.e. don't collect the result. That's all. :-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: Out of Memory while fetching

2006-02-18 Thread Andrzej Bialecki
encountered this error when I ran out of disk space. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: swf -tilte

2006-02-20 Thread Andrzej Bialecki
it otherwise I'll be happy to correct it. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Andrzej Bialecki
Elwin wrote: Yes, it's true, although it's not the cause of my problem. Did you try to use the alternative HTML parser (TagSoup) supported by the plugin? You need to set a property parser.html.impl to tagsoup. -- Best regards, Andrzej Bialecki

Re: AW: extract links problem with parse-html plugin

2006-02-20 Thread Andrzej Bialecki
on truckin'. This is especially true for pages with multiple html elements, where Neko ignores all elements but the first one, while TagSoup just treats any html elements inside a document like any other nested element. -- Best regards, Andrzej Bialecki

CBIR (Re: Jpeg and Exif Plugin)

2006-03-03 Thread Andrzej Bialecki
a suitable front-end. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: find duplicate urls in webdb

2006-03-06 Thread Andrzej Bialecki
carefully, most probably they differ only in a single character, or a whitespace. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: HTTPS support?

2006-03-06 Thread Andrzej Bialecki
David Odmark wrote: Hi, Does Nutch 0.8 support https fetches? If not, are there any active efforts to support it? It does, using protocol-httpclient plugin. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: Help with bin/nutch server 8081 crawl

2006-03-07 Thread Andrzej Bialecki
that implement Configurable? Perhaps it should, using the current JobConf. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Problem running Nutch Mapred after applying patch for Adaptive refetch

2006-03-07 Thread Andrzej Bialecki
), which removes obsolete versions of pages from indexes. Pages are still present in segments until you delete old segments, but they won't appear in searchable index. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: Help with bin/nutch server 8081 crawl

2006-03-07 Thread Andrzej Bialecki
will apply this fix too. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: still not so clear to me

2006-03-07 Thread Andrzej Bialecki
a fetch list first based on the seed urls, then on the links found on that page (for each subsequent iteration), then on the links on those pages, and so forth and son on until the entire domain is crawled, if you limit the domains with a filter. Yes. -- Best regards, Andrzej Bialecki

Re: retry later

2006-03-08 Thread Andrzej Bialecki
refetch anyway, and if it doesn't succeed we just increase the interval by 50%. Now, fixing this the same way in 0.7 would mean that pages no longer end up in PAGE_GONE state. Is this a fix of broken behavior or a new behavior (new feature)? I'm not sure... -- Best regards, Andrzej Bialecki

Re: Adaptive Refetching

2006-03-08 Thread Andrzej Bialecki
me know if my inferences are correct and sorry for a bigger mail. No problem with the size.Yes, your conclusions seem correct. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Adaptive Refetching

2006-03-09 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Doug Cutting wrote: are refetched, their links are processed again. I think the easiest way to fix this would be to change ParseOutputFormat to not generate STATUS_LINKED crawldata when a page has been refetched. That way scores would only

Re: try to parse pdf

2006-03-13 Thread Andrzej Bialecki
properties in any place except the currently running process. All properties are read anew from the config files whenever you start any nutch processing. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

Re: Buggy fetchlist' urls

2006-03-13 Thread Andrzej Bialecki
really parse anything, just uses some heuristic to extract possible URLs. Unfortunately, often as not the strings it extracts don't have anything to do with URLs. If you have suggestions on how to improve it I'm all ears. -- Best regards, Andrzej Bialecki

Re: Distributed Search - config issue?

2006-03-17 Thread Andrzej Bialecki
... It is not wise to put IP addresses in your emails. Agreed. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

Re: Can small segments be combined?

2006-03-20 Thread Andrzej Bialecki
me to it ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Adaptive fetch schedule

2006-03-22 Thread Andrzej Bialecki
is also good. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-22 Thread Andrzej Bialecki
located? Apparently Nutch doesn't find one of the input directories, so it's either not there, or the config is wrong, but without more information it's impossible to tell. -- Best regards, Andrzej Bialecki

Re: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-23 Thread Andrzej Bialecki
/GettingNutchRunningWithWindows). When using Open Source software you should be prepared to do some basic research on your own. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: .08 java.io.IOException: No input directories specified in: Configuration: defaults:

2006-03-23 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: fetching https pages

2006-03-27 Thread Andrzej Bialecki
- and then you should remove protocol-http from your config. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: error help

2006-03-28 Thread Andrzej Bialecki
) at net.nutch.db.WebDBWriter.createWebDB (WebDBWriter.java:1425) at net.nutch.tools.WebDBAdminTool.main (WebDBAdminTool.java:159) You are using incompatible GNU Java. Either upgrade your GCC/GCJ to 4.x.x, or use Sun Java. Besides, Nutch 0.6 is ancient history, you should use 0.7.1 (or 0.7.2). -- Best regards, Andrzej

Re: problem with starting injection...

2006-03-28 Thread Andrzej Bialecki
if it's an option, or revert to revision 388299 . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: adaptive fetch

2006-03-28 Thread Andrzej Bialecki
to function in the above manner right. Did i miss out anything??? Yes, this is how it's supposed to work. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded

Re: Adaptive Refetch

2006-03-30 Thread Andrzej Bialecki
in CrawlDbReducer.java:86 in both versions), instead it should be initialized with the value from old.getFetchInterval(), if available. Please fix this in your version, I'll fix this in the un-patched version. Thanks for spotting this! -- Best regards, Andrzej Bialecki

  1   2   3   4   5   6   7   >