Re: Nutch and Hadoop not working proper

2009-06-24 Thread Andrzej Bialecki
this on Windows under cygwin then in your config files you MUST NOT use the cygwin paths (like /cygdrive/d/...) because Java can't see them. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

Re: Nutch and Hadoop not working proper

2009-06-24 Thread Andrzej Bialecki
path). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

[ANN] Luke + Hadoop, alpha version

2009-07-10 Thread Andrzej Bialecki
that this is an early preview. Also, various UI glitches are probably related to the Thinlet toolkit - again, one day I may re-write Luke using something else, but for now I don't have the strength to do it. :) -- Best regards, Andrzej Bialecki

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

2009-07-12 Thread Andrzej Bialecki
, or the problem I mentioned above. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Why cant I inject a google link to the database?

2009-07-17 Thread Andrzej Bialecki
insist on saying that it's RUDE to do this. Anyway, Google monitors such attempts and after you issue too many requests your IP will be blocked for a duration - so no matter if you go the polite or the impolite way you won't be able to do this. -- Best regards, Andrzej Bialecki

Re: nutch -threads in hadoop

2009-07-23 Thread Andrzej Bialecki
numThreads * numMapTasks per node. So be careful to set it to a number that doesn't overwhelm your network ;) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: nutch -threads in hadoop

2009-07-24 Thread Andrzej Bialecki
. I strongly recommend setting up a local caching DNS. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: Gracefull stop in the middle of a fetch phase ?

2009-07-25 Thread Andrzej Bialecki
. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Host specific parsing

2009-07-28 Thread Andrzej Bialecki
be enough. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Meaning of ProtocolStatus.ACCESS_DENIED

2009-08-03 Thread Andrzej Bialecki
is to be able to phase out old segments, so that you can be sure that you can delete old segments after N days, because all their pages have been surely scheduled for refetching and will be found in a newer segment. -- Best regards, Andrzej Bialecki

Re: Nutch updatedb Crash

2009-08-16 Thread Andrzej Bialecki
, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Nutch.SIGNATURE_KEY

2009-08-22 Thread Andrzej Bialecki
are not really the same page, so you need to be careful ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: R: Using Nutch for only retriving HTML

2009-09-30 Thread Andrzej Bialecki
, and not the content of crawldb. The command 'bin/nutch readseg -dump segmentName output' should do the trick. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: how to upgrade a java application with nutch?

2009-10-01 Thread Andrzej Bialecki
dependencies needed to run Nutch except for Hadoop libraries (which are also required). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Nutch randomly skipping locations during crawl

2009-10-01 Thread Andrzej Bialecki
of fetching. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: R: Using Nutch for only retriving HTML

2009-10-01 Thread Andrzej Bialecki
will dump just the content part: ./bin/nutch readseg -dump crawl/segments/20090903121951 toto -nofetch -nogenerate -noparse -noparsedata -noparsetext -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: Nutch randomly skipping locations during crawl

2009-10-01 Thread Andrzej Bialecki
in the logs why they aren't. See above. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: Targeting Specific Links for Crawling

2009-10-05 Thread Andrzej Bialecki
Eric wrote: Does anyone know if it possible to target only certain links for crawling dynamically during a crawl? My goal would be to write a plugin for this functionality but I don't know where to start. URLFilter plugins may be what you want. -- Best regards, Andrzej Bialecki

Re: Incremental Whole Web Crawling

2009-10-05 Thread Andrzej Bialecki
multiple segments from one job, but it's not implemented yet. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: Incremental Whole Web Crawling

2009-10-05 Thread Andrzej Bialecki
the parsing and updatedb just from these segments, without waiting for all 16 segments to be processed. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: Targeting Specific Links

2009-10-06 Thread Andrzej Bialecki
a special flag (in metadata) that prevents fetching. This requires that you implement a custom scoring plugin. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded

Re: Targeting Specific Links

2009-10-07 Thread Andrzej Bialecki
to implement than a ScoringFilter. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: indexing just certain content

2009-10-09 Thread Andrzej Bialecki
, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: indexing just certain content

2009-10-10 Thread Andrzej Bialecki
QueryFilter plugin.xml you declare that QueryParser should pass your special fields without treating them as terms, and in the implementation you create a BooleanClause to be added to the translated query. -- Best regards, Andrzej Bialecki

Re: How to ignore search results that don't have related keywords in main body?

2009-10-10 Thread Andrzej Bialecki
is this particular site, then you know the positions of navigation items, right? Then you can remove these elements in your HtmlParseFilter, or modify DOMContentUtils (in parse-html) to skip these elements. -- Best regards, Andrzej Bialecki

Re: How to ignore search results that don't have related keywords in main body?

2009-10-10 Thread Andrzej Bialecki
of the smallest blocks, where link number is high - these are likely navigational elements. * reconstruct the whole page from the remaining blocks. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: Incremental Whole Web Crawling

2009-10-13 Thread Andrzej Bialecki
/nutch/trunk -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Incremental Whole Web Crawling

2009-10-13 Thread Andrzej Bialecki
Eric Osgood wrote: So the trunk contains the most recent nightly update? It's the other way around - nightly build is created from a snapshot of the trunk. The trunk is always the most recent. -- Best regards, Andrzej Bialecki

Re: http keep alive

2009-10-14 Thread Andrzej Bialecki
for the server side. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Nutch Enterprise

2009-10-17 Thread Andrzej Bialecki
I agree with Dennis - use Nutch if you need to do a larger-scale discovery such as when you crawl the web, but if you already know all target pages in advance then Solr will be a much better (and much easier to handle) platform. -- Best regards, Andrzej Bialecki

Re: ERROR datanode.DataNode - DatanodeRegistration ... BlockAlreadyExistsException

2009-10-17 Thread Andrzej Bialecki
is valid, and cannot be written to. Are you sure you are running a single datanode process per machine? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: How to run a complete crawl?

2009-10-17 Thread Andrzej Bialecki
- when crawling filesystems each file in a directory is treated as an outlink, and this limit is then applied. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded

Re: Extending HTML Parser to create subpage index documents

2009-10-20 Thread Andrzej Bialecki
, to keep track of the parent URL. The rest should be handled automatically, although there are some other complications that need to be handled as well (e.g. don't recrawl sub-documents). -- Best regards, Andrzej Bialecki

Re: ERROR: current leaseholder is trying to recreate file.

2009-10-20 Thread Andrzej Bialecki
? It is. This problem is rare - I think I crawled cumulatively ~500mln pages in various configs and it didn't occur to me personally. It requires a few things to go wrong (see the issue comments). -- Best regards, Andrzej Bialecki

Re: Accessing an Index from a shared location

2009-10-21 Thread Andrzej Bialecki
missed here? Does Nutch allow us to put the index on a network location? UNC paths are not supported in Java - you need to mount this location as a local volume. -- Best regards, Andrzej Bialecki

Re: Targeting Specific Links

2009-10-23 Thread Andrzej Bialecki
; } -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Deleting stale URLs from Nutch/Solr

2009-10-26 Thread Andrzej Bialecki
such URLs directly from CrawlDb (using e.g. CrawlDbReader API) and then uses SolrJ API to send the same delete requests + commit. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: Deleting stale URLs from Nutch/Solr

2009-10-27 Thread Andrzej Bialecki
Gora Mohanty wrote: On Mon, 26 Oct 2009 17:26:23 +0100 Andrzej Bialecki a...@getopt.org wrote: [...] Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They are kept in Nutch crawldb to prevent their re-discovery (through stale links pointing to these URL-s from other pages

Re: How to index files only with specific type

2009-10-27 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread Andrzej Bialecki
, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: unbalanced fetching

2009-10-29 Thread Andrzej Bialecki
the longest is assigned a lot of URLs from a single host. A workaround for this is to limit the max number of URLs per host (in nutch-site.xml) to a more reasonable number, e.g. 100 or 1000, whatever works best for you. -- Best regards, Andrzej Bialecki

Re: updatedb is talking long long time

2009-11-02 Thread Andrzej Bialecki
. * minor issue - when specifying the path names of segments and crawldb, do NOT append the trailing slash - it's not harmful in this particular case, but you could have a nasty surprise when doing e.g. copy / mv operations ... -- Best regards, Andrzej Bialecki

Re: including code between plugins

2009-11-02 Thread Andrzej Bialecki
without actually using the language-identifier plugin? You need to add the language-identifier plugin to the requires section in your plugin.xml, like this: requires import plugin=nutch-extensionpoints/ import plugin=language-identifier/ /requires -- Best regards, Andrzej

Re: could you unsubscribe me from this mailing list pls. tks

2009-11-02 Thread Andrzej Bialecki
respond to it from the same email account that you were subscribed from? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Unsubscribe step-by-step (Re: could you unsubscribe me from this mailing list pls. tks)

2009-11-02 Thread Andrzej Bialecki
Andrzej Bialecki wrote: doesn't work, as reported by me and others last week. Thanks, Did you get the message with the subject of confirm unsubscribe from nutch-user@lucene.apache.org and did you respond to it from the same email account that you were subscribed from? .. I just verified

Re: Direct Access to Cached Data

2009-11-05 Thread Andrzej Bialecki
), and you can use its API to retrieve either all or individual records from a segment (using URL as key). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded

Nutch near future - strategic directions

2009-11-09 Thread Andrzej Bialecki
platform to develop and experiment with such components. - Briefly ;) that's what comes to my mind when I think about the future of Nutch. I invite you all to share your thoughts and suggestions! -- Best regards, Andrzej Bialecki

Re: changing/addding field in existing index

2009-11-09 Thread Andrzej Bialecki
the index. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Problems with Hadoop source

2009-11-11 Thread Andrzej Bialecki
of the file:// schema FileSystem. Now you probably forgot to put hadoop-default.xml on your classpath. Go to Build Path and add this file to your classpath, and all should be ok. -- Best regards, Andrzej Bialecki

Re: Nutch Hadoop question

2009-11-13 Thread Andrzej Bialecki
them to use different ports AND different local paths. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

Re: Synonym Filter with Nutch

2009-11-13 Thread Andrzej Bialecki
, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Nutch near future - strategic directions

2009-11-16 Thread Andrzej Bialecki
depends on the last modified timestamp being present on the webpage that is being crawled, which I believe is not mandatory. Still those who do set it would benefit. This is already implemented - see the Signature / MD5Signature / TextProfileSignature. -- Best regards, Andrzej Bialecki

Re: decoding nutch readseg -dump 's output

2009-11-16 Thread Andrzej Bialecki
characters outside this encoding will be replaced by question marks. If you want to get an exact copy of the raw binary content then please use the SegmentReader API. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: Scalability for one site

2009-11-16 Thread Andrzej Bialecki
are their victims). The source code is there, if you choose you can modify it to bypass these restrictions, just be aware of the consequences (and don't use Nutch as your user agent ;) ). -- Best regards, Andrzej Bialecki

Re: Nutch upgrade to Hadoop

2009-11-20 Thread Andrzej Bialecki
) - and I agree that we should have a 1.1 release in the near future. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Nutch near future - strategic directions

2009-11-20 Thread Andrzej Bialecki
own extended DB-s. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Nutch upgrade to Hadoop

2009-11-20 Thread Andrzej Bialecki
Dennis Kubes wrote: I would like to get a couple things in this release as well. Let me know if you want help with the upgrade. You mean you want to do the Hadoop upgrade? I won't stand in your way :) -- Best regards, Andrzej Bialecki

Re: Nutch upgrade to Hadoop

2009-11-21 Thread Andrzej Bialecki
! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: AbstractFetchSchedule

2009-11-22 Thread Andrzej Bialecki
, indeed this looks like a bug - we should instead do like this: if (datum.getFetchInterval() maxInterval) { datum.setFetchInterval(maxInterval * 0.9); } -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: dedup dont delete duplicates !

2009-11-24 Thread Andrzej Bialecki
in your crawldb. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: dedup dont delete duplicates !

2009-11-24 Thread Andrzej Bialecki
relaxed Signature implementation, e.g. TextProfileSignature. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: dedup dont delete duplicates !

2009-11-25 Thread Andrzej Bialecki
the db in order to update the signatures. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: Nutch config IOException

2009-11-25 Thread Andrzej Bialecki
logging. ;) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: 100 fetches per second?

2009-11-25 Thread Andrzej Bialecki
unique hosts are in the current working set. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: Broken segments ?

2009-11-26 Thread Andrzej Bialecki
which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. -- Best regards, Andrzej Bialecki

Re: Encoding the content got from Fetcher

2009-11-27 Thread Andrzej Bialecki
are unpredictable. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: 100 fetches per second?

2009-11-27 Thread Andrzej Bialecki
tasks tend to hang around, but still some of them finish and make space for new tasks. As time goes on, majority of your tasks becomes slow tasks, so the overall speed continues to drop down. -- Best regards, Andrzej Bialecki

Re: 100 fetches per second?

2009-11-27 Thread Andrzej Bialecki
week I will be working on integrating the patches from Julien, and if time permits I could perhaps start working on a speed monitoring to lock out slow servers. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: Nutch frozen but not exiting

2009-11-28 Thread Andrzej Bialecki
that the process is in a (single) reduce phase sorting the data - with larger jobs in local mode the sorting phase may take very long time, due to a heavy disk IO (and in disk-wait state it may be uninterruptible). Try to generate a thread dump to see what code is being executed. -- Best regards, Andrzej

Re: Nutch frozen but not exiting

2009-11-28 Thread Andrzej Bialecki
Paul Tomblin wrote: On Sat, Nov 28, 2009 at 5:48 PM, Andrzej Bialecki a...@getopt.org wrote: Paul Tomblin wrote: -bash-3.2$ jstack -F 32507 Attaching to process ID 32507, please wait... Hm, I can't see anything obviously wrong with that thread dump. What's the CPU and swap usage

Re: odd warnings

2009-12-01 Thread Andrzej Bialecki
partial indexes, so you need to specify each /part- dir as an input to dedup. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: org.apache.hadoop.util.DiskChecker$DiskErrorExceptio

2009-12-02 Thread Andrzej Bialecki
. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: How does generate work ?

2009-12-03 Thread Andrzej Bialecki
during generation. See ScoringFilter.generatorSortValue(..), you can modify this method in scoring-opic (or in your own scoring filter) to prioritize certain urls over others. -- Best regards, Andrzej Bialecki

Re: Nutch 1.0 wml plugin

2009-12-07 Thread Andrzej Bialecki
, please creata a JIRA issue in Nutch, and attach the patch. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Andrzej Bialecki
the nutch*.job to a separate Hadoop cluster? Could you please try it with a standalone Hadoop cluster (even if it's a pseudo-distributed, i.e. single node)? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: NOINDEX, NOFOLLOW

2009-12-10 Thread Andrzej Bialecki
, that contain an outlink to that page. Very good explanation, that's exactly the reasons why Nutch never discards such pages. If you really want to ignore certain pages, then use URLFilters and/or ScoringFilters. -- Best regards, Andrzej Bialecki

Re: domain vs www.domain?

2009-12-10 Thread Andrzej Bialecki
that changes the matching urls to e.g. always lose the 'www.' part. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Luke reading index in hdfs

2009-12-11 Thread Andrzej Bialecki
part-N partial indexes). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: OR support

2009-12-14 Thread Andrzej Bialecki
On 2009-12-14 16:05, BrunoWL wrote: Nobody? Please, any answer would good. Please check this issue: https://issues.apache.org/jira/browse/NUTCH-479 That's the current status, i.e. this functionality is available only as a patch. -- Best regards, Andrzej Bialecki

Re: Nutch Hadoop 0.20 - AlreadyBeingCreatedException

2009-12-17 Thread Andrzej Bialecki
should commit the change? Thanks for reporting this - could you perhaps try to apply that patch and see if it helps? I hesitated to commit it because it's really a workaround and not a solution ... but if it works for you then it's better than nothing. -- Best regards, Andrzej Bialecki

Re: Large files - nutch failing to fetch

2009-12-21 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Accessing crawled data

2009-12-22 Thread Andrzej Bialecki
that you are looking for is an IndexingFilter - this receives a copy of the document with all fields collected just before it's sent to the indexing backend - and you can freely modify the content of NutchDocument, e.g. do additional analysis, add/remove/modify fields, etc. -- Best regards, Andrzej

Re: Accessing crawled data

2009-12-22 Thread Andrzej Bialecki
On 2009-12-22 16:07, Claudio Martella wrote: Andrzej Bialecki wrote: On 2009-12-22 13:16, Claudio Martella wrote: Yes, I'am aware of that. The problem is that i have some fields of the SolrDocument that i want to compute by text analysis (basically i want to do some smart keywords extraction

Re: Dedup remove all duplicates

2010-01-06 Thread Andrzej Bialecki
(2 documents), and if the problem persist please report this in JIRA. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: Purging from Nutch after indexing with Solr

2010-01-08 Thread Andrzej Bialecki
linkdb with new links from a new segment. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: Purging from Nutch after indexing with Solr

2010-01-09 Thread Andrzej Bialecki
in development test phases, less in production though. Right. Also, a common practice is to keep the raw data for a while just to make sure that the parsing and indexing went smoothly (in case you need to re-parse the raw content). -- Best regards, Andrzej Bialecki

Re: Adding additional metadata

2010-01-11 Thread Andrzej Bialecki
is configurability - if you put this code in a separate plugin, you can easily turn it on/off, but if it sits in HtmlParser this would be more difficult to do. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: Help Needed with Error: java.lang.StackOverflowError

2010-01-11 Thread Andrzej Bialecki
is slightly less expressive but much much faster. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: Post Injecting ?

2010-01-15 Thread Andrzej Bialecki
On 2010-01-15 20:09, MilleBii wrote: Inject is meant to seed the database at the start. But I would like to inject new urls on a production crawldb, I think it works but I was wondering if somebody could confirm that. Yes. New urls are merged with the old ones. -- Best regards, Andrzej

Re: merge not working anymore

2010-01-18 Thread Andrzej Bialecki
hdfs.DFSClient - DFS Read: java.io.IOException: Could not obtain block: blk_-6931814167688802826_9735 file=/user/root/crawl/indexed-segments/20100117235244/part-0/_1lr.prx This error is commonly caused by running out of disk space on a datanode. -- Best regards, Andrzej Bialecki

Re: About HBase Integration

2010-02-09 Thread Andrzej Bialecki
On 2010-02-09 03:08, Hua Su wrote: Thanks. But heritrix is another project, right? Please see this Git repository, it contains the latest work in progress on Nutch+HBase: git://github.com/dogacan/nutchbase.git -- Best regards, Andrzej Bialecki

Re: SegmentFilter

2010-02-20 Thread Andrzej Bialecki
params, such as sessionId, print=yes, etc) or completely unrelated (human errors, peculiarities of the content management system, or mirrors). In your case it seems that the same page is available under different values of g2_highlightId. -- Best regards, Andrzej Bialecki

Re: SegmentFilter

2010-02-21 Thread Andrzej Bialecki
On 2010-02-20 23:32, reinhard schwab wrote: Andrzej Bialecki schrieb: On 2010-02-20 22:45, reinhard schwab wrote: the content of one page is stored even 7 times. http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8 i believe this comes from Recno:: 383 URL:: http://www.cinema-paradiso.at

Re: Nutch v0.4

2010-02-25 Thread Andrzej Bialecki
no longer exists. Sorry :( However, you can still check out that code from CVS repository at nutch.sf.net . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: Update on ignoring menu divs

2010-02-28 Thread Andrzej Bialecki
/boilerpipe/ . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: New version of nutch?

2010-03-03 Thread Andrzej Bialecki
still a few months away. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Content of redirected urls empty

2010-03-08 Thread Andrzej Bialecki
, is there really no content for the redirected url? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: form-based authentication? Any progress

2010-03-10 Thread Andrzej Bialecki
generating the response ... it was a total mess. So, if you target 10 sites, you can make it work. If you target 10,000 sites all using slightly different methods, then forget it. -- Best regards, Andrzej Bialecki

Re: Where are new linked entries added

2010-03-11 Thread Andrzej Bialecki
, it's complex and fragile. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Avoid indexing common html to all pages, promoting page titles.

2010-03-12 Thread Andrzej Bialecki
define these weights in the configuration, look for query boost properties. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

<    1   2   3   4   5   6   7   >