Re: Wildcard search with nutch distributed search

2010-05-09 Thread Andrzej Bialecki
On 2010-05-06 22:39, JohnRodey wrote:
 
 I'm running the Distributed Search's IndexServer.  I'm trying to figure out a
 way to improve the index search to basically work for wildcards.
 
 Index- name:Bobby
 
 Ex. query for name:Bob will return nothing
 Ex. query for name:Bob* will be converted to same as above and return
 nothing.

Nutch syntax query doesn't support wildcards. This will be changed soon
in Nutch trunk, where we will delegate query parsing to a particular
type of search backend (e.g. Solr).

 
 Looks like the Lucene Query object does provide this, however nutch's
 distributed search does not.

Neither local Nutch nor distributed support this - it's a (purposeful)
limitation of the Nutch query syntax.

 
 Is there any solution (that hopefully doesn't require major refactoring)
 that could provide this functionality?

Use Nutch for crawling and indexing to Solr, and then use Solr directly
for searching.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: full text search for java sources and subversion repository

2010-05-09 Thread Andrzej Bialecki
On 2010-05-09 12:23, Rafael Kubina wrote:
 Hi
 
 i´m trying to do a full text search on my java souces (.java)   via nutch 
 (1.0), svn and http (mod_dav_svn). 
 
 other documents  like html are pretty searchable, my sources not.
 
 currently  the  output ist the following:
 
 fetching http://s025/svn/java/foo/trunk/src/main/java/Bar.java
 Pre-configured   credentials with scope - host: s025; port: 80; found for 
 url: http://s025/svn/java/foo/trunk/src/main/java/Bar.java
 url:   http://s025/svn/java/foo/trunk/src/main/java/Bar.java;   status 
 code: 200; bytes received: 5829; Content-Length: 5829
 
 the   content-type for this file is text/plain
 
 there are no   exceptions, no other problems.
 
 i really appreciate any help that I   can get. Thanks a lot!

You need to check the following:

* parse_text in your segment (you can dump this with readseg command).
It should contain a plain text content of your file.

* use Luke (www.getopt.org/luke) to examine your Lucene index. You
should be able to retrieve terms coming from your Java documents - use
Reconstruct  Edit in Luke.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: JobTracker gets stuck with DFS problems

2010-05-03 Thread Andrzej Bialecki
On 2010-05-03 19:59, Emmanuel de Castro Santana wrote:
 Unfortunately, no. You should at least crawl without parsing, so that
 when you download the content you can run the parsing separately, and
 repeat it if it fails.
 
 I've just found this in the FAQ, can it be done ?
 
 http://wiki.apache.org/nutch/FAQ#How_can_I_recover_an_aborted_fetch_process.3F

The first method of recovering that is mentioned there works only
(sometimes) for crawls performed using LocalJobTracker and local file
system. It does not work in any other case.

 
 By the way, about not parsing, isn't necessary to parse the content anyway
 in order to generate links for the next segment ? If this is true, one would
 have to run parse separatedly, which would result the same.

Yes, but if the parsing fails you still have the downloaded content,
which you can re-parse again after you fixed the config or the code...


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: JobTracker gets stuck with DFS problems

2010-05-03 Thread Andrzej Bialecki
On 2010-05-03 22:58, Emmanuel de Castro Santana wrote:
 The first method of recovering that is mentioned there works only
 (sometimes) for crawls performed using LocalJobTracker and local file
 system. It does not work in any other case.
 
 if I stop the crawling process, take the crawled content from the dfs into
 my local disk, do the fix and then put it back into hdfs, would it work ? Or
 would there be a problem about dfs replication of the new files ?

Again, this procedure does NOT work when using HDFS - you won't even see
the partial output (without some serious hacking).

 
 Yes, but if the parsing fails you still have the downloaded content,
 which you can re-parse again after you fixed the config or the code...
 
 Interesting ... I did not see any option like a -noParsing in the bin/nutch
 crawl command, that means I will have to code my own .sh for crawling, one
 that uses the -noparsing option of the fetcher right ?

You can simply set the fetcher.parsing config option to false.


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: JobTracker gets stuck with DFS problems

2010-04-30 Thread Andrzej Bialecki
On 2010-04-30 20:09, Emmanuel de Castro Santana wrote:
 Hi All
 
 We are using Nutch to crawl ~500K pages with a 3 node cluster, each node
 features a dual core processor running with 4Gb RAM and circa 100Gb storage.
 All nodes run on CentOS.
 
 These 500K pages are scattered into several sites, each one of them having
 from 5k up to 200k pages. For each site we start a different crawl process
 (using bin/nutch crawl), but they are all almost simultaneously started.
 
 We are trying to tune Hadoop's configurations in order to have a reliable
 daily crawling process. After a while of crawling we see some problems
 occurring, mainly on the TaskTracker nodes, most of them are related to
 access to the HDFS. We often see Bad response 1 for block and Filesystem
 closed, among others. When these errors start to get more frequent, the
 JobTracker gets stuck and we have to run stop-all. If we adjust the maximum
 of map and reduce tasks to lower values, the process takes longer to get
 stuck, but we haven't found the adequate configuration yet.
 
 Given that setup, there are some question we have been struggling to find an
 answer
 
 1. What could be the most probable reason for the hdfs problems ?

I suspect the following issues in this order:

* too small number of file handles on your machines (run ulimit -n, this
should be set to 16k or more, the default is 1k).

* do you use a SAN or other type of NAS as your storage?

* network equipment, such as router or switch, or network card: quite
often low-end equipment cannot handle high volume of traffic, even
though it's equipped with all the right ports ;)

 
 2. Is it better to start a unique crawl with all sites inside or to just
 keep it the way we are doing (i.e start a different crawl process for each
 site) ?

It's much much better to crawl all sites at the same time. This allows
you to benefit from parallel crawling - otherwise your fetcher will be
always stuck in the politeness crawl delay.

 
 3. When it all goes down, is there a way to restart crawling from where the
 process stopped ?

Unfortunately, no. You should at least crawl without parsing, so that
when you download the content you can run the parsing separately, and
repeat it if it fails.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Hadoop Disk Error

2010-04-27 Thread Andrzej Bialecki
On 2010-04-26 22:31, Joshua J Pavel wrote:
 
 Sending this out to close the thread if anyone else experiences this
 problem: nutch 1.0 is not AIX-friendly (0.9 is).
 
 I'm not 100% sure which command it may be, but by modifying my path so
 that /opt/freeware/bin has precedence, I no longer get the hadoop disk
 error.  While I though this means the problem comes from the nutch script,
 not the code itself, manually trying to set system calls
 to /opt/freeware/bin didn't fix it.  I assume until detailed debugging is
 done, further releases will also require a workaround similar to what I'm
 doing.

Ahhh ... now I understand. The problem lies in Hadoop's use of utilities
such as /bin/whoami, /bin/ls and /bin/df. These are used to obtain some
filesystem and permissions information that is otherwise not available
from JVM.

However, these utilities are expected to provide a POSIX-y output if on
Unix, or Cygwin output if on Windows. I guess the native commands in AIX
don't conform to either, so the output of these utilities can't be
parsed, which ultimately results in errors. Whereas the output of
/opt/freeware/bin utilities follows the POSIX format.

I'm not sure what was the difference in 0.9 that still made it work ...
perhaps the parsing of these outputs was more lenient, or some errors
were ignored.

In any case, we in Nutch can't do anything about this, we can just add
your workaround to the documentation. The problem should be reported to
the Hadoop project.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



ANNOUNCE: Nutch becomes an Apache Top-Level Project (TLP)

2010-04-26 Thread Andrzej Bialecki
Hi all,

I'm happy to announce that the ASF Board has accepted the resolution to
separate Nutch from the Lucene project and make it into a top-level
project (full text of the resolution can be viewed here [1]). Thanks to
all who voted and who worked on preparing this proposal!

This means that in the upcoming days/weeks we will start moving our web
site and mailing lists to a new prefix, @nutch.apache.org. AFAIK it's
possible to automatically move the mailing list subscriptions to the new
addresses, so you won't have to do anything (apart from changing your
mail filters, perhaps).

This change involves the Nutch repository being moved eventually under
svn://svn.apache.org/repos/asf/nutch . We will let you know when this
happens.

JIRA setup will remain the same.

[1]
http://search.lucidimagination.com/search/document/443c3cf9f67b4f42/vote_2_board_resolution_for_nutch_as_tlp

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-26 Thread Andrzej Bialecki
On 2010-04-26 16:24, David M. Cole wrote:
 At 10:55 PM -0700 4/25/10, Mattmann, Chris A (388J) wrote:
 Most folks that use Nutch are likely
 familiar with running ant IMHO.
 
 I guess then I fall into the category of not most folks. Have been
 running Nutch for about 14 months and I haven't a clue how to run ant.
 
 If there's a place to vote to suggest that compiled versions still be
 distributed, I vote for that.

Actually, we don't have a build target (yet) that produces a binary-only
distribution that we can ship and which you can run out of the box (not
counting the build/nutch.job alone, because it needs the Hadoop
infrastructure to run).

The current mixed (source+binary) distribution worked well enough so
far, but the size of the distribution is becoming a concern, hence the
idea to ship only the source. We may have been too hasty with that,
though... What do others think?

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to do faceting on data indexed by Nutch

2010-04-25 Thread Andrzej Bialecki
On 2010-04-25 15:03, KK wrote:
 Hi All,
 I might be repeating this question asked by someone else but googling didn't
 help tracking any such mail responses.
 I'm pretty much aware of Solr/Lucene and its basic architecture. I've done
 hit highlighting in Lucene, has idea on faceting support by Solr but never
 tried it actually. I wanted to implement faceting on Nutch's indexed data. I
 already have some MBs of data already indexed by Nutch. I just want to
 implement faceting on those . Can someone give me pointers on how to proceed
 further in this regard. Or is it the case that I've to query using Solr
 interface and redirect all the queries to the index already created by
 Nutch. What is the best possible way, simplest way for achieving the same.
 Please help in this regard.

Nutch has two indexing/searching backends - the one that is configured
by default uses plain Lucene, and it does not support faceting. The
other backend uses Solr, and then of course it supports faceting and all
other Solr features.

So in your case you need to switch to use Solr indexing (and searching).

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: About Apache Nutch 1.1 Final Release

2010-04-17 Thread Andrzej Bialecki
On 2010-04-17 05:45, Phil Barnett wrote:
 On Sat, 2010-04-10 at 18:22 +0200, Andrzej Bialecki wrote:
 
 More details on this (your environment, OS, JDK version) and
 logs/stacktraces would be highly appreciated! You mentioned that you
 have some scripts - if you could extract relevant portions from them (or
 copy the scripts) it would help us to ensure that it's not a simple
 command-line error.
 
 I posted another thread tonight with the fixed code.

See here: https://issues.apache.org/jira/browse/NUTCH-812

 
 Can you please commit it for all of us?

I'm traveling today ... Chris, can you perhaps apply the patch before
you roll another RC?

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: About Apache Nutch 1.1 Final Release

2010-04-10 Thread Andrzej Bialecki
On 2010-04-10 17:49, Phil Barnett wrote:
 On Thu, 2010-04-08 at 21:31 -0700, Mattmann, Chris A (388J) wrote:
 Hi there,

 Well as soon as we have 3 +1 binding VOTEs. Right now I'm the only PMC 
 member that's VOTE'd +1 on the release.

 Hopefully in the next few days someone will have a chance to check...
 
 I tried to get the Release Candidate (latest nightly build) running
 yesterday and I ran into problems with both of the scripts that I use to
 crawl with 1.0.
 
 But the smaller bin/crawl method finished the crawl and then immediately
 had a java exception when starting the next step.
 
 Sorry I don't have more specifics, but I'm at home, the setup is at work
 and I had to revert to get things back running. But I built a dev
 machine so I can play with 1.1 and get more specific.

More details on this (your environment, OS, JDK version) and
logs/stacktraces would be highly appreciated! You mentioned that you
have some scripts - if you could extract relevant portions from them (or
copy the scripts) it would help us to ensure that it's not a simple
command-line error.



-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [VOTE] Apache Nutch 1.1 Release Candidate #1

2010-04-09 Thread Andrzej Bialecki
On 2010-04-07 07:14, Mattmann, Chris A (388J) wrote:
 Hi Folks,
 
 I have posted a candidate for the Apache Nutch 1.1 release. The source code
 is at:
 
 http://people.apache.org/~mattmann/apache-nutch-1.1/rc1/
 
 See the included CHANGES.txt file for details on release contents and latest
 changes. The release was made using the Nutch release process, documented on
 the Wiki here:
 
 http://bit.ly/d5ugid
 
 A Nutch 1.1 tag is at:
 
 http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/
 
 Please vote on releasing these packages as Apache Nutch 1.1. The vote is
 open for the next 72 hours. Only votes from Lucene PMC are binding, but
 everyone is welcome to check the release candidate and voice their approval
 or disapproval. The vote passes if at least three binding +1 votes are cast.
 
 [ ] +1 Release the packages as Apache Nutch 1.1.
 
 [ ] -1 Do not release the packages because...
 

+1 - tested both local and distributed workflows, all looks good.


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[VOTE RESULTS] Nutch to become a top-level project (TLP)

2010-04-08 Thread Andrzej Bialecki
Hi all,

I'm happy to announce that this vote is closed and the proposal has
passed with 4 +1 binding votes and 0 -1 binding votes - in fact, there
were only +1-s both from the committers and the community.

Thanks to all who expressed their opinion - we will now proceed with the
remaining formal steps to become a TLP.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch segment merge is very slow

2010-04-05 Thread Andrzej Bialecki
On 2010-04-05 16:54, ashokkumar.raveendi...@wipro.com wrote:
 Hi,
   Thank you for your suggestion. I have around 500+ internet urls
 configured for crawling and crawl process is running in Amazon cloud.  I
 have already reduced my depth to 8, topN to 1000 and also increased
 fetcher threads to 150 and limited 50 urls per  host using
 generate.max.per.host property. With this configuration Generate, Fetch,
 Parse, Update completes in max 10 hrs. When comes to segment merge it
 takes lot of time. As a temporary solution I am not doing the segment
 merge and directly indexing the fetched segments. With this solution I
 am able to finish the crawl process with in 24hrs. Now I am looking for
 long term solution to optimize segment merge process.

Segment merging is not strictly necessary, unless you have a hundred
segments or so. If this step takes too much time, but still the number
of segments is well below a hundred, just don't merge them.


-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[VOTE] Nutch to become a top-level project (TLP)

2010-04-01 Thread Andrzej Bialecki
Hi all,

According to an earlier [DISCUSS] thread on the nutch-dev list I'm
calling for a vote on the proposal to make Nutch a top-level project.

To quickly recap the reasons and consequences of such move: the ASF
board is concerned about the size and diversity of goals across various
subprojects under the Lucene TLP, and suggests that each subproject
should evaluate whether becoming its own TLP would better serve the
project itself and the Lucene TLP.

We discussed this issue and expressed opinions that ranged from positive
(easier management, better exposure, better focus on the mission, not
really dependent on Lucene development) to neutral (no significant
reason, only political change) to moderately negative (increased admin
work, decreased exposure).

Therefore, the proposal is to separate Nutch from under Lucene TLP and
form a top-level project with its own PMC, own svn and own site.

Please indicate one of the following:

[ ] +1 - yes, I vote for the proposal
[ ] -1 - no, I vote against the proposal (because ...)

(Please note that anyone in the Nutch community is invited to express
their opinion, though only Nutch committers cast binding votes.)

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [VOTE] Nutch to become a top-level project (TLP)

2010-04-01 Thread Andrzej Bialecki
On 2010-04-01 19:40, Robert Hohman wrote:
 +1 yes, and I also vote we try to somehow make nutch easier to work with 
 maven-based projects.  I've had a heck of a time integrating it (although 
 more or less gotten it to work)

Patches are welcome - I realize this could be beneficial, but I'm not
familiar with maven, so I won't be able to make this change myself...

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Can't open a nutch 1.0 index with luke

2010-04-01 Thread Andrzej Bialecki
On 2010-04-01 21:09, Magnús Skúlason wrote:
 Hi,
 
 I am getting the following exception when I try to open a nutch 1.0 (I am
 using the official release) index with Luke (0.9.9.1)
 
 java.io.IOException: read past EOF
 at
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.
 java:151)
 at
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInpu
 t.java:38)
 at
 org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInpu
 t.java:36)
 at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:70)
 at org.apache.lucene.store.IndexInput.readLong(IndexInput.java:93)
 at org.apache.lucene.index.SegmentInfo.init(SegmentInfo.java:203)
 at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:256)
 at
 org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java
 :72)
 at
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo
 s.java:704)
 at
 org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
 
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:375)
 at org.getopt.luke.Luke.openIndex(Unknown Source)
 at org.getopt.luke.Luke.openOk(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at thinlet.Thinlet.invokeImpl(Unknown Source)
 at thinlet.Thinlet.invoke(Unknown Source)
 at thinlet.Thinlet.handleMouseEvent(Unknown Source)
 at thinlet.Thinlet.processEvent(Unknown Source)
 at java.awt.Component.dispatchEventImpl(Unknown Source)
 at java.awt.Container.dispatchEventImpl(Unknown Source)
 at java.awt.Component.dispatchEvent(Unknown Source)
 at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
 at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
 at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
 at java.awt.Container.dispatchEventImpl(Unknown Source)
 at java.awt.Window.dispatchEventImpl(Unknown Source)
 at java.awt.Component.dispatchEvent(Unknown Source)
 at java.awt.EventQueue.dispatchEvent(Unknown Source)
 at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown
 Source)
 at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
 at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown
 Source)
 at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
 at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
 at java.awt.EventDispatchThread.run(Unknown Source)
 
 Any ideas why this happens and how to fix it?

Can Nutch itself open this index and use it? I'm not getting any such
errors with the above combination and a small test index ...



-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: hamid sefrani

2010-03-29 Thread Andrzej Bialecki

On 2010-03-29 16:52, Pedro Bezunartea López wrote:

Are there any anti-spam measures? The same sender has posted a few spam
messages already...

Pedro.

2010/3/25 Mike Hayscpun...@hotmail.com


http://SPAM...porr...com/...lndex.html


Normally only users that subscribed to the list are allowed to post, 
unless a moderator adds them. It appears that this user slipped through 
... I'll try to forcibly unsubscribe him. Sorry!




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch Fetch Stuck

2010-03-13 Thread Andrzej Bialecki

On 2010-03-13 00:12, Abhi Yerra wrote:

So I had -noParsing set. So parsing was not part of the fetch. The
pages have been crawled, but the reducers have crashed. So if I
restart the fetch will it try to crawl all those pages again?


Yes. It would be good to investigate first Why it crashed, otherwise 
it's likely to happen again. Are you running this on a cluster? Check 
the logs of the crashed tasks (in logs/userlogs/ on respective 
tasktracker nodes).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Avoid indexing common html to all pages, promoting page titles.

2010-03-12 Thread Andrzej Bialecki

On 2010-03-12 12:52, Pedro Bezunartea López wrote:

Hi,

I'm developing a site that has shows the dynamic content in adiv
id=content, the rest of the page doesn't really change. I'd like to store
and index only the contents of thisdiv, to basically avoid re-indexing
over and over the same content (header, footer, menu).

I've checked the WritingPluginExample-0.9 howto, but I couldn't figure out a
couple of things:

1.- Should I extend the parse-html plugin, or should I just replace it?


You should write an HtmlParseFilter and extract only the portions that 
you care about, and then replace the output parseText with your 
extracted text.



2.- The example talks about finding a meta tag, extracting some information
from it, and adding a field in the index. I think I just need to get rid of
all html except the div id=content tag, and index its content. Can someone
point me in the right direction?


See above.


And just one more thing, I'd like to give a higher score to pages which the
search terms appear in the title. Right now pages that contain the terms in
the body rank higher than those that contain the search terms in the title,
how could I modify this behaviour?


You can define these weights in the configuration, look for query boost 
properties.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch Fetch Stuck

2010-03-12 Thread Andrzej Bialecki

On 2010-03-12 23:39, Abhi Yerra wrote:

Hi,

We did a fetch and the maps are 100% done, but the reducers have crashed. We 
did a large fetch so is there a way to restart the reducers without restarting 
the fetch?


Unfortunately no. Was the fetcher in the parsing mode? If so, I 
strongly recommend that you first fetch, and then run the parsing as a 
separate step.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Where are new linked entries added

2010-03-11 Thread Andrzej Bialecki

On 2010-03-11 15:53, nikinch wrote:


Hi everyone

I've been using nutch for a while now and i've come up on a snag.

I'm trying to find where new  linked pages are added to the segment as a
specific entry.
To make myself clear i've been through the fetch class and the crawlDBFilter
and reducer.
But i'm looking for the initial entry where, for a given page, the links are
transformed into segment entries, my objective here is to pass down te
initial inject url  to all it's liked pages. So when i create an entry for
the linked urls of a wegbpage i'll add metadata to their definition giving
them this originating url.
By the time i get to CrawlDBFilter i already have entries for linked pages
and lost the notion of which seed url brought us here.
I thought the job would be done in the Fetcher maybe in the output function
but i'm not finding where it happens. So if anyone knows and could point me
in the right direction i'd appreciate it.


Currently the best place to do this is in your implementation of a 
ScoringFilter, in distributeScoreToOutlinks(). You can also modify one 
of the existing scoring plugins. I would advise against modifying the 
code directly in ParseOutputFormat, it's complex and fragile.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: form-based authentication? Any progress

2010-03-10 Thread Andrzej Bialecki

On 2010-03-10 19:26, conficio wrote:



Susam Pal wrote:


Hi,

Indeed the answer is negative and also, many people have often asked
this in this list. Martin has very nicely explained the problems and
possible solution. I'll just add what I have thought of. I have often
wondered what it would take to create a nice configurable cookie based
authentication feature. The following file would be needed:-
...
http://wiki.apache.org/nutch/HttpPostAuthentication


I was wondering if there has been any work done into this direction?

I guess the answer is still no.

Would the problem become easier, if one targets particular types of sites,
such as popular Wiki, Bug Trackers, Blogs, CMS, Forum, document management
systems (first)?



I was involved in a project to implement this (as a proprietary plugin). 
In short, it requires a lot of effort, and there are no generic 
solutions. If it works with one site, it breaks with another, and 
eventually you end up with a nasty heap of hacks upon hacks. In that 
project we gave up after discovering that a large number of sites use 
Javascript to create and name the input controls, and they used a 
challenge-response with client-side scripts generating the response ... 
it was a total mess.


So, if you target 10 sites, you can make it work. If you target 10,000 
sites all using slightly different methods, then forget it.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Content of redirected urls empty

2010-03-08 Thread Andrzej Bialecki

On 2010-03-08 14:55, BELLINI ADAM wrote:



is there any idea guys ??



From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: Content of redirected urls empty
Date: Fri, 5 Mar 2010 22:01:05 +



hi,
the content of my redirected urls is empty...but still have the other 
metadata...
i have an http urls that is redirected to https.
in my index i find the http URL but with an empty content...
could you explain it plz?


There are two ways to redirect - one is with protocol, and the other is 
with content (either meta refresh, or javascript).


When you dump the segment, is there really no content for the redirected 
url?



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: New version of nutch?

2010-03-03 Thread Andrzej Bialecki

On 2010-03-03 20:12, John Martyniak wrote:

Does anybody have an idea of when a new version of nutch will be
availale,  specifically supporting a latest version of hadoop. And
possibly hbase?

Thank you for any information.


We should roll out a 1.1 soon (a few weeks), the nutch+hbase is imho 
still a few months away.




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Update on ignoring menu divs

2010-02-28 Thread Andrzej Bialecki

On 2010-02-28 18:42, Ian M. Evans wrote:

Using Nutch as a crawler for solr.

I've been digging around the nutch-user archives a bit and have seen
some people discussing how to ignore menu items or other unnecessary div
areas like common footers, etc. I still haven't come across a full
answer yet.

Is there a to define a div by id that nutch will strip out before
tossing the content into solr?


There is no such functionality out of the box. One direction that is 
worth pursuing would be to create an HtmlParseFilter plugin that wraps 
the Boilerpipe library http://code.google.com/p/boilerpipe/ .


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch v0.4

2010-02-25 Thread Andrzej Bialecki

On 2010-02-24 17:34, Pedro Bezunartea López wrote:

Hi Ashley,

Hi,

I'm looking to reproduce program analysis results based on Nutch v0.4. I
realize this is a very old release, but is it possible to obtain the source
from somewhere? I see some of the classes I'm looking for in v0.7, but I
need the older version to confirm it.
Thanks,
Ashley



You can get version 0.6 and higher from apache's archive:

http://archive.apache.org/dist/lucene/nutch/

... but I haven't found anything older,


AFAIK older releases of Nutch were archived only on that old SF site, 
and apparently that site no longer exists. Sorry :( However, you can 
still check out that code from CVS repository at nutch.sf.net .



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: SegmentFilter

2010-02-21 Thread Andrzej Bialecki

On 2010-02-20 23:32, reinhard schwab wrote:

Andrzej Bialecki schrieb:

On 2010-02-20 22:45, reinhard schwab wrote:

the content of one page is stored even 7 times.
http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
i believe this comes from

Recno:: 383
URL::
http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519


Duplicate content is usually related to the fact that indeed the same
content appears under different urls. This is common enough, so I
don't see this necessarily as a bug in Nutch - we won't know that the
content is identical until we actually fetch it...

Urls may differ in certain systematic ways (e.g. by a set of URL
params, such as sessionId, print=yes, etc) or completely unrelated
(human errors, peculiarities of the content management system, or
mirrors). In your case it seems that the same page is available under
different values of g2_highlightId.



i know. i have implemented several url filters to filter duplicate content.
there is a difference here.
the difference here is that in this case the same content is stored
under the same url several times.
it is stored under
http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
and not under
http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519

the content for the latter url is empty.
Content:


Ok, then the answer can be found in the protocol status or parse status. 
You can get protocol status by doing a segment dump of only the 
crawl_fetch part (disable all other parts, then the output is less 
confusing). Similarly, parse status can be found in crawl_parse.





--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: SegmentFilter

2010-02-20 Thread Andrzej Bialecki

On 2010-02-20 22:45, reinhard schwab wrote:

the content of one page is stored even 7 times.
http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8
i believe this comes from

Recno:: 383
URL:: http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519


Duplicate content is usually related to the fact that indeed the same 
content appears under different urls. This is common enough, so I don't 
see this necessarily as a bug in Nutch - we won't know that the content 
is identical until we actually fetch it...


Urls may differ in certain systematic ways (e.g. by a set of URL params, 
such as sessionId, print=yes, etc) or completely unrelated (human 
errors, peculiarities of the content management system, or mirrors). In 
your case it seems that the same page is available under different 
values of g2_highlightId.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: About HBase Integration

2010-02-09 Thread Andrzej Bialecki

On 2010-02-09 03:08, Hua Su wrote:

Thanks. But heritrix is another project, right?



Please see this Git repository, it contains the latest work in progress 
on Nutch+HBase:


git://github.com/dogacan/nutchbase.git

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: merge not working anymore

2010-01-18 Thread Andrzej Bialecki

On 2010-01-18 21:56, MilleBii wrote:

Help !!!

My production environment is blocked by this error.
I deleted the segment altogether and restarted crawl/fetch/parse... and I'm
still stuck, so I can not add segments anymore.
Looking like a hdfs problem ???

2010-01-18 19:53:00,785 WARN  hdfs.DFSClient - DFS Read:
java.io.IOException: Could not obtain block: blk_-6931814167688802826_9735
file=/user/root/crawl/indexed-segments/20100117235244/part-0/_1lr.prx


This error is commonly caused by running out of disk space on a datanode.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Post Injecting ?

2010-01-15 Thread Andrzej Bialecki

On 2010-01-15 20:09, MilleBii wrote:

Inject is meant to seed the database at the start.

But I would like to inject new urls on a production crawldb, I think it
works but I was wondering if somebody could confirm that.



Yes. New urls are merged with the old ones.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Adding additional metadata

2010-01-11 Thread Andrzej Bialecki

On 2010-01-11 13:18, Erlend Garåsen wrote:


First of all: I didn't know about the list archive, so sorry for not
searching that resource before I sent a new post.

MilleBii wrote:

For lastModified just enable the index|query-more plugins it will do
the job for you.


Unfortunately not. Our pages include Dublin core metadata which has a
Norwegian name.


For other meta searc the mailing list its explained many times how to
do it


I found several posts concerning metadata, but for me, one question is
still unanswered: Do I really have to create a lot of new classes/xml
files in order to store the content of just two metadata? I have not
managed to parse the content of the lastModified metadata after I tried
to rewrite the HtmlParser class. So I tried to add hard coded metadata
values in HtmlParser like this instead:
entry.getValue().getData().getParseMeta().set(dato.endret, 01.01.2008);

My modified MoreIndexingFilter managed to pick up the hard coded values,
and the dates were successfully stored into my Solr Index after running
the solrindex option.

This means that it is not necessary to write a new MoreIndexingFilter
class, but I'm still unsure about the HtmlParser class since I haven't
managed to parse the content of the metadata.


You can of course hack your way through HtmlParser and add/remove/modify 
as you see fit - it's straightforward and likely you will get the result 
that you want.


However, as MilleBii suggests, the preferred way to do this would be to 
write a plugin. The reason is the cost of a long-term maintenance - if 
you ever want to sync up your local modified version of Nutch with the 
newer public release, your hacked copy of HtmlParser won't merge nicely, 
whereas if you put your code in a separate plugin then it might. Another 
reason is configurability - if you put this code in a separate plugin, 
you can easily turn it on/off, but if it sits in HtmlParser this would 
be more difficult to do.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Help Needed with Error: java.lang.StackOverflowError

2010-01-11 Thread Andrzej Bialecki

On 2010-01-11 18:40, Godmar Back wrote:

On Mon, Jan 11, 2010 at 12:30 PM, Fuad Efendif...@efendi.ca  wrote:

Googling reveals
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4675952 and
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5050507 so you
could try increasing the Java stack size in bin/nutch (-Xss), or use
an alternate regexp if you can.

Just out of curiosity, why does a performance critical program such as
Nutch use Sun's backtracking-based regexp implementation rather than
an efficient Thompson-based one?  Do you need the additional
expressiveness provided by PCRE?



Very interesting point... we should use it for BIXO too.


BTW, SUN has memory leaks with LinkedBlockingQueue,
http://bugs.sun.com/view_bug.do?bug_id=6806875
http://tech.groups.yahoo.com/group/bixo-dev/message/329



I don't think we use this class in Nutch.



And, of course, URL is synchronized; Apache Tomcat uses simplified version
of URL class.
And, RegexUrlNormalizer is synchronized in Nutch...
And, in order to retrieve plain text from HTML we are creating fat DOM
object (instead of using, for instance, filters in NekoHtml)


We are creating a DOM tree because it's much easier to write filtering 
plugins that work with DOM tree than implement Neko filters. Besides, we 
provide an option to use TagSoup for HTML parsing, which is not only 
more resilient to HTML errors but also more efficient.


Besides, Nutch is built around plugins. Deactivate parse-html and write 
your own HTML plugin that avoids these inefficiencies, and we'll be 
happy to include it in the distribution.



And more...



I'm no expert, but the reason I brought this up for discussion was
that I recently encountered a paper that pointed out that regular
expression matching accounts for a significant fraction of total
runtime in search engine indexers [1] and thus it's something that's
usually optimized.

  - Godmar

[1] http://portal.acm.org/citation.cfm?id=1542275.1542284


This StackOverflow came probably from the urlfilter-regex, which indeed 
uses Java regex, definitely one of the worst implementations. The reason 
it's used by default in Nutch is that it's standard in JDK, FWIW.


For high-performance crawlers I usually do the following:

* avoid regex filtering completely, if possible, instead using a 
combination of prefix/suffix/domain/custom filters


* use urlfilter-automaton, which is slightly less expressive but much 
much faster.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Purging from Nutch after indexing with Solr

2010-01-09 Thread Andrzej Bialecki

On 2010-01-09 10:18, MilleBii wrote:

@Andrzej,

To be more specific if one uses cached content (which I do), what is the
minimal staff to keep, I guess :
+ crawl_fetch
+ parse_data
+ parse_text

the rest is not used ... I guess, before I start testing could you confirm ?


crawl_fetch you can ignore - it's just the status of fetching, which 
should be by that time already integrated into crawldb (if you ran 
updatedb).


It's the content/ that you need to display cached view.



@Ulysse,

The other reason to keep all data is if you will need to reindex all
segments, which does happen in development  test phases, less in
production  though.


Right. Also, a common practice is to keep the raw data for a while just 
to make sure that the parsing and indexing went smoothly (in case you 
need to re-parse the raw content).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Purging from Nutch after indexing with Solr

2010-01-08 Thread Andrzej Bialecki

On 2010-01-08 19:07, Ulysses Rangel Ribeiro wrote:

I'm crawling with Nutch 1.0 and indexing with Solr 1.4, and came with some
questions regarding data redundancy with this setup.

Considering the following sample segment:

2.0Gcontent
196Kcrawl_fetch
152Kcrawl_generate
376Kcrawl_parse
392Kparse_data
441Mparse_text

1. From what I have found through searches content holds the raw fetched
content, is there any problem if I remove it, ie: does nutch needs it to
apply any sort of logic when re-crawling that content/url?


No, they are no longer needed, unless you want to provide a cached 
view of the content.




2. Previous question applies to parse_data and parse_text after i've called
nutch solrindex on that segment.


Depends how you set up your search. If you search using NutchBean (i.e. 
the Nutch web application) then you need them. If you search using Solr, 
then you don't need them.




3. Using samples scritps and tutorials I'm always seeing invertlinks being
called over all segments, but its output mentions merging, when I
fetch/parse new segments can I call invertlinks only over them?


Yes, invertlinks will incrementally merge the existing linkdb with new 
links from a new segment.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Dedup remove all duplicates

2010-01-06 Thread Andrzej Bialecki

On 2010-01-06 18:56, Pascal Dimassimo wrote:


Hi,

After I run the index command, my index contains 2 documents with the same
boost, digest, segment and title, but with different tstamp and url. When I
run the dedup command on that index, both documents are removed.

Should the document with the latest tstamp be kept?


It should, out of multiple documents with the same URL (url duplicates) 
only the most recent is retained - unless it was removed because there 
was another document in the index with the same content (a content 
duplicate).


Could you please verify this on a minimal index (2 documents), and if 
the problem persist please report this in JIRA.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Accessing crawled data

2009-12-22 Thread Andrzej Bialecki

On 2009-12-22 13:16, Claudio Martella wrote:

Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction) so i have to get in the middle
between crawling and indexing! My actual solution is to dump the content
in a file through the segreader, parse it and then use SolrJ to send the
documents. Probably the best solution is to set my own analyzer for the
field on solr side, and do keywords extraction there.

Thanks for the script, you'll use it!


Likely the solution that you are looking for is an IndexingFilter - this 
receives a copy of the document with all fields collected just before 
it's sent to the indexing backend - and you can freely modify the 
content of NutchDocument, e.g. do additional analysis, add/remove/modify 
fields, etc.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Accessing crawled data

2009-12-22 Thread Andrzej Bialecki

On 2009-12-22 16:07, Claudio Martella wrote:

Andrzej Bialecki wrote:

On 2009-12-22 13:16, Claudio Martella wrote:

Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction) so i have to get in the middle
between crawling and indexing! My actual solution is to dump the content
in a file through the segreader, parse it and then use SolrJ to send the
documents. Probably the best solution is to set my own analyzer for the
field on solr side, and do keywords extraction there.

Thanks for the script, you'll use it!


Likely the solution that you are looking for is an IndexingFilter -
this receives a copy of the document with all fields collected just
before it's sent to the indexing backend - and you can freely modify
the content of NutchDocument, e.g. do additional analysis,
add/remove/modify fields, etc.


This sounds very interesting. So the idea is to take the NutchDocument
as it comes out of the crawling and modify it (inside of an
IndexingFilter) before it's sent to indexing (inside of nutch),  right?


Correct - IndexingFilter-s work no matter whether you use Nutch or Solr 
indexing.



So how does it relate to nutch schema and solr schema? Can you give me
some pointers?



Please take a look at how e.g. the index-more filter is implemented - 
basically you need to copy this filter and make whatever modifications 
you need ;)


Keep in mind that any fields that you create in NutchDocument need to be 
properly declared in schema.xml when using Solr indexing.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Large files - nutch failing to fetch

2009-12-21 Thread Andrzej Bialecki

On 2009-12-21 17:15, Sundara Kaku wrote:

Hi,

Nutch is throwing errors while fetching large files (file with size more
then 100mb). I have a website with pages that point to large files (file
size varies from 10mb to 500mb) and there are several large files in that
website. I want to fetch all the files using Nutch, but nutch is throwing
outofmemory exception for large files ( have set heap size to 2500m), with
heap memory 2500m file size with 250mb are retrieved but larger that that
are failing,
and nutch takes lot of time after printing
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0

if there are three files with size 100mb each then it is failing (at the
same depth, with heap size 2500m) to fetch files.

i have set http.content.limite to -1

is there way to fetch several large files using nutch..

I am using nutch as webcrawler, i am not using Indexing. I want to download
web resources and scan then for virus using ClamA/V.


Probably Nutch is not the right tool for you - you should probably use 
wget. Nutch was designed to fetch many pages of limited size - as a 
temporary step it caches the downloaded content in memory, before 
flushing it out to disk.


(I had to solve this limitation once for a specific case - the solution 
was to implement a variant of the protocol and Content that stored data 
into separate HDFS files without buffering in memory - but it was a 
brittle hack that only worked for that particular scenario).


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch Hadoop 0.20 - AlreadyBeingCreatedException

2009-12-17 Thread Andrzej Bialecki

On 2009-12-17 10:13, Eran Zinman wrote:

Hi,

I'm getting Nutch/Hadoop exception: AlreadyBeingCreatedException on some of
Nutch parser reduce tasks.

I know this is a known issue with Nutch (
https://issues.apache.org/jira/browse/NUTCH-692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717058#action_12717058
)

And as far as I can see that patch wasn't committed yet because we wanted to
examine it on the new Hadoop 0.20 version. I am using latest Nutch with
Hadoop 0.20 and I can confirm this exception still accrues (rarely - but it
does) - maybe we should commit the change?


Thanks for reporting this - could you perhaps try to apply that patch 
and see if it helps? I hesitated to commit it because it's really a 
workaround and not a solution ... but if it works for you then it's 
better than nothing.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: OR support

2009-12-14 Thread Andrzej Bialecki

On 2009-12-14 16:05, BrunoWL wrote:


Nobody?
Please, any answer would good.


Please check this issue:

https://issues.apache.org/jira/browse/NUTCH-479

That's the current status, i.e. this functionality is available only as 
a patch.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Luke reading index in hdfs

2009-12-11 Thread Andrzej Bialecki

On 2009-12-11 22:21, MilleBii wrote:

Guys is there a way you can get Luke to read the index from hdfs:// ???
Or you have to copy it out to the local filesystem?



Luke 0.9.9 can open indexes directly from HDFS hosted on Hadoop 0.19.x.
Luke 0.9.9.1 can do the same, but uses Hadoop 0.20.1.

Start Luke, dismiss the open dialog, and then go to Plugins / Hadoop, 
and enter the full URL of the index directory (including the hdfs:// 
part). You can also open multiple parts of the index (e.g. if you follow 
the Nutch naming convention, you can directly open the indexes/ 
directory that contains part-N partial indexes).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: NOINDEX, NOFOLLOW

2009-12-10 Thread Andrzej Bialecki

On 2009-12-10 20:33, Kirby Bohling wrote:

On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAMmbel...@msn.com  wrote:


hi,

i have a page withmeta name=robots content=noindex,nofollow /, now i know 
that nutch obey to this tag because i dont find the content and the title in my index, but i was 
wondering that this document will not be present in the index. why he keep the document in my index with 
no title and no content ??

i'm using index-basic and index-more plugins, and i want to understand why 
nutch still filling the url, date, boostetc since he didnt it for title and 
content.

i was thinking that if nutch will obey to nofollow and noindex so it will skip 
all the document !

or mabe i missunderstood something, can you plz explain this behavior to me?

best regards.



My guess is that the page is recorded to note that the page shouldn't
be fetched, I'm guessing the status is one of the magic values.  It
probably re-fetches the page periodically to ensure it has the list.
So the URL and the date make sense to me as to why they populate them.
  I don't know why it is computing the boost, other then the fact that
it might be part of the OPIC scoring algorithm.  If the scoring
algorithm ever uses the scores/boost of the pages that you point at as
a contributing factor, it would make total sense.  So even though it
doesn't index http://example/foo/bar;, knowing which pages point
there, and what their scores are could contribute scores of pages that
you do index, that contain an outlink to that page.


Very good explanation, that's exactly the reasons why Nutch never 
discards such pages. If you really want to ignore certain pages, then 
use URLFilters and/or ScoringFilters.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: domain vs www.domain?

2009-12-10 Thread Andrzej Bialecki

On 2009-12-10 19:59, Jesse Hires wrote:

I'm seeing a lot of duplicates where a single site is getting recognized as
two different sites. Specifically I am seeing www.domain.com and
domain.combeing recognized as two different sites.
I imagine there is a setting to prevent this. If so, what is the setting, if
not, what would you recomend doing to prevent this?


This is a surprisingly difficult problem to solve in general case, 
because it's not always true that 'www.domain' equals 'domain'. If you 
do know this is true in your particular case, you can add a rule to 
regex-urlnormalizer that changes the matching urls to e.g. always lose 
the 'www.' part.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Andrzej Bialecki

Eran Zinman wrote:

Hi,

Sorry to bother you guys again, but it seems that no matter what I do I
can't run the new version of Nutch with Hadoop 0.20.

I am getting the following exceptions in my logs when I execute
bin/start-all.sh


Do you use the scripts in place, i.e. without deploying the nutch*.job 
to a separate Hadoop cluster? Could you please try it with a standalone 
Hadoop cluster (even if it's a pseudo-distributed, i.e. single node)?



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch 1.0 wml plugin

2009-12-07 Thread Andrzej Bialecki

yangfeng wrote:

I have completed the plugin for parsing the wml(wiredless mark language). I
hope to add it to lucene, what i do?



The best long-term option would be to submit this work to the Tika 
project - see http://lucene.apache.org/tika/. If you already implemented 
this as a Nutch plugin, please creata a JIRA issue in Nutch, and attach 
the patch.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How does generate work ?

2009-12-03 Thread Andrzej Bialecki

MilleBii wrote:

Oops continuing previous mail.

So I wonder if there would be a better  algorithm 'generate' which
would maintain a constant rate of host per 100 url ... Below a certain
threshold it stops or better starts including URLs of lower scores.


That's exactly how the max.urls.per.host limit works.



Using scores is de-optimzing the fetching process... Having said that
I should first read the code and try to understand it.


That wouldn't hurt in any case ;)

There is also a method in ScoringFilter-s (e.g. the default 
scoring-opic), where it determines the priority of URL during 
generation. See ScoringFilter.generatorSortValue(..), you can modify 
this method in scoring-opic (or in your own scoring filter) to 
prioritize certain urls over others.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: org.apache.hadoop.util.DiskChecker$DiskErrorExceptio

2009-12-02 Thread Andrzej Bialecki

BELLINI ADAM wrote:

hi,
i have this error when crawling

org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid 
local directory for 
taskTracker/jobcache/job_local_0001/attempt_local_0001_m_00_0/output/spill0.out


Most likely you ran out of tmp disk space.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: odd warnings

2009-12-01 Thread Andrzej Bialecki

Jesse Hires wrote:

What is segments.gen and segments_2 ?
The warning I am getting happens when I dedup two indexes.

I create index1 and index2 through generate/fetch/index/...etc
index1 is an index of 1/2 the segments. index2 is an index of the other 1/2

The warning is happening on both datanodes.

The command I am running is bin/nutch dedup crawl/index1 crawl/index2

If segments.gen and segments_2 are supposed to be directories, then why are
they created as files?

They are created as files from the start
bin/nutch index crawl/index1 crawl/crawldb /crawl/linkdb crawl/segments/XXX
crawl/segments/YYY

I don't see any errors or warnings about creating the index.


The command that you quote above produces multiple partial indexes, 
located in crawl/index1/part-N and only in these subdirectories the 
Lucene indexes can be found. However, the deduplication process doesn't 
accept partial indexes, so you need to specify each /part- dir as an 
input to dedup.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch frozen but not exiting

2009-11-28 Thread Andrzej Bialecki

Paul Tomblin wrote:

My nutch crawl just stopped.  The process is still there, and doesn't
respond to a kill -TERM or a kill -HUP, but it hasn't written
anything to the log file in the last 40 minutes.  The last thing it
logged was some calls to my custom url filter.  Nothing has been
written in the hadoop directory or the crawldir/crawldb or the
segments dir in that time.

How can I tell what's going on and why it's stopped?


If you run in distributed / pseudo-distributed mode, you can check the 
status in the JobTracker UI. If you are running in local mode, then 
it's likely that the process is in a (single) reduce phase sorting the 
data - with larger jobs in local mode the sorting phase may take very 
long time, due to a heavy disk IO (and in disk-wait state it may be 
uninterruptible).


Try to generate a thread dump to see what code is being executed.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch frozen but not exiting

2009-11-28 Thread Andrzej Bialecki

Paul Tomblin wrote:

On Sat, Nov 28, 2009 at 5:48 PM, Andrzej Bialecki a...@getopt.org wrote:

Paul Tomblin wrote:



-bash-3.2$ jstack -F 32507
Attaching to process ID 32507, please wait...

Hm, I can't see anything obviously wrong with that thread dump. What's the
CPU and swap usage, and loadavg?


The process is using a lot of CPU.  loadavg is up over 5.

top - 15:12:19 up 22 days,  4:06,  2 users,  load average: 5.01, 5.00, 4.93
Tasks:  48 total,   2 running,  45 sleeping,   0 stopped,   1 zombie
Cpu(s):  1.0% us, 99.0% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:   3170584k total,  2231700k used,   938884k free,0k buffers
Swap:0k total,0k used,0k free,0k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
32507 discover  16   0 1163m 974m 8604 S 394.7 31.5 719:40.71 java

Actually, the memory is a real annoyance - the hosting company doesn't
give me any swap, so when hadoop does a fork/exec just to do a
whoami, I have to leave as much memory free as the crawl reserves
with -Xmx for itself.


Hm, the curious thing here is that the java process is sleeping, and 99% 
of cpu is in system time ... usually this would indicate swapping, but 
since there is no swap in your setup I'm stumped. Still, this may be 
related to the weird memory/swap setup on that machine - try decreasing 
the heap size and see what happens.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Encoding the content got from Fetcher

2009-11-27 Thread Andrzej Bialecki

Santiago Pérez wrote:

Yes, I tried in that configuration file setting with the latin encoding
Windows-1250, but the value of this property does not affect to the encoding
of the content (I also tried with unexistent encoding and the result is the
same...)

property
  nameparser.character.encoding.default/name
  valueWindows-1250/value
  descriptionThe character encoding to fall back to when no other
information
  is available/description
/property

Has anyone had the same problem? (Hungarian o Polish people sure...)


The appearance of characters that you quoted in your other email 
indicates that the problem may be the opposite - your pages seem to use 
UTF-8, and you are trying to convert them using Windows-1250 ... Try 
putting UTF-8 in this property, and see what happens.


Generally speaking, pages should declare their encoding, either in HTTP 
headers or in meta tags, but often this declaration is either missing 
or completely wrong. Nutch uses ICU4J CharsetDetector plus its own 
heuristic (in util.EncodingDetector and in HtmlParser) that tries to 
detect character encoding if it's missing or even if it's wrong - but 
this is a tricky issue and sometimes results are unpredictable.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: 100 fetches per second?

2009-11-27 Thread Andrzej Bialecki

MilleBii wrote:

Interesting updates on the current run of 450K urls :
+ 30minutes @ 3Mbits/s
+ drop to 1Mbit/s (1/X shape)
+ gradual improvement to 1.5 Mbit/s and steady for 7 hours
+ sudden drop to 0.9 Mbits/s and steady for 4 hours
+ up to 1.7 Mbits for 1hour
+ staircasing down to 0.5 Mbit/s by steps of 1 hour

I don't know what to take as a conclusion, but it is quite strange to have
those sudden variation of bandwidth and overall very slow.
I can post the graph if people are interested.


This most likely comes from the allocation of urls to map tasks, and the 
maximum number of map tasks that you can run on your cluster. when tasks 
finish their run, you see a sudden drop in speed, until the next task 
starts running. Initially, I suspect that you have more tasks available 
than the capacity of your cluster, so it's easy to fill the slots and 
max the speed. Later on, slow map tasks tend to hang around, but still 
some of them finish and make space for new tasks. As time goes on, 
majority of your tasks becomes slow tasks, so the overall speed 
continues to drop down.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: 100 fetches per second?

2009-11-27 Thread Andrzej Bialecki

MilleBii wrote:

You mean map/reduce tasks ???


Yes.


Being in pseudo-distributed / single node I only have two maps during the
fetch phase... so it would be back to the URLs distribution.


Well, yes, but my explanation is still valid. Which unfortunately 
doesn't change the situation.


Next week I will be working on integrating the patches from Julien, and 
if time permits I could perhaps start working on a speed monitoring to 
lock out slow servers.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Broken segments ?

2009-11-26 Thread Andrzej Bialecki

Mischa Tuffield wrote:

Hello All,


http://people.apache.org/~hossman/#threadhijack

When starting a new discussion on a mailing list, please do not reply 
to an existing message, instead start a fresh email.  Even if you change 
the subject line of your email, other mail headers still track which 
thread you replied to and your question is hidden in that thread and 
gets less attention.   It makes following discussions in the mailing 
list archives particularly difficult.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: dedup dont delete duplicates !

2009-11-25 Thread Andrzej Bialecki

BELLINI ADAM wrote:

hi,

my two urls points to the same page !


Please, no need to shout ...

If the MD5 signatures are different, then the binary content of these 
pages is different, period.


Use readseg -dump utility to retrieve the page content from the segment, 
extract just the two pages from the dump, and run a unix diff utility.



can you tell m eplz more about TextProfileSignature ? how should i
use it


Configure this type of signature in your nutch-site.xml - please see the 
nutch-default.xml for instructions. Please note that you will have to 
re-parse segments and update the db in order to update the signatures.




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch config IOException

2009-11-25 Thread Andrzej Bialecki

Mischa Tuffield wrote:
Hello Again, 

Following my previous post below, I have noticed that I get the following IOException every time I atttempt to use nutch. 


!--
2009-11-25 12:19:18,760 DEBUG conf.Configuration - java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.init(Configuration.java:176)
at org.apache.hadoop.conf.Configuration.init(Configuration.java:164)
at 
org.apache.hadoop.hdfs.protocol.FSConstants.clinit(FSConstants.java:51)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2757)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)

--

Any pointers would be great, I wonder is there a way for me to validate my conf 
options before I deploy nutch?


This exception is innocuous - it helps to debug at which points in the 
code the Configuration instances are being created. And you wouldn't 
have seen this if you didn't turn on the DEBUG logging. ;)



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: 100 fetches per second?

2009-11-25 Thread Andrzej Bialecki

MilleBii wrote:

I have to say that I'm still puzzled. Here is the latest. I just restarted a
run and then guess what :

got ultra-high speed : 8Mbits/s sustained for 1 hour where I could only get
3Mbit/s max before (nota bits and not bytes as I said before).
A few samples show that I was running at 50 Fetches/sec ... not bad. But why
this high-speed on this run I haven't got the faintest idea.


Than it drops and I get that kind of logs

2009-11-25 23:28:28,584 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516
2009-11-25 23:28:29,227 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=120
2009-11-25 23:28:29,584 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516
2009-11-25 23:28:30,227 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=120
2009-11-25 23:28:30,585 INFO  fetcher.Fetcher - -activeThreads=100,
spinWaiting=100, fetchQueues.totalSize=516

Don't fully understand why it is oscillating between two queue size never
mind but it is likely the end of the run since hadoop shows 99.99%
percent complete for the 2 map it generated.

Would that be explained by a better URL mix 


I suspect that you have a bunch of hosts that slowly trickle the 
content, i.e. requests don't time out, crawl-delay is low, but the 
download speed is very very low due to the limits at their end (either 
physical or artificial).


The solution in that case would be to track a minimum avg. speed per 
FetchQueue, and lock-out the queue if this number crosses the threshold 
(similarly to what we do when we discover a crawl-delay that is too high).


In the meantime, you could add the number of FetchQueue-s to that 
diagnostic output, to see how many unique hosts are in the current 
working set.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: dedup dont delete duplicates !

2009-11-24 Thread Andrzej Bialecki

BELLINI ADAM wrote:


hi,

dedup doesn't work for me.
I have read that  Duplicates have either the same contents (via MD5 hash) or 
the same URL
in my case i dont have the same URLS but still have the same contents for those 
URLS.
i give you an exemple:  i have three urls that have the same content

1- www.domaine/folder/
2- www.domaine/folder/index.html
3- www.domaine/folder/index.html?lang=fr

but i find all of them in my index :(
i was wondering that dedup will delete 1 and 2 


the dedup wont work correclty !!


Please check the value of the Signature field for all the above urls in 
your crawldb.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: dedup dont delete duplicates !

2009-11-24 Thread Andrzej Bialecki

BELLINI ADAM wrote:
yes i cheked the signatures and it's not the same !! it's realy weird 


the url www.domaine/folder/index.html?lang=fr is just this one 
www.domaine/folder/index.html


Apparently it isn't a bit-exact replica of the page, so its MD5 hash is 
different. You need to use a more relaxed Signature implementation, e.g. 
TextProfileSignature.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: AbstractFetchSchedule

2009-11-22 Thread Andrzej Bialecki

reinhard schwab wrote:

there is some piece of code i dont understand

  public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
// pages are never truly GONE - we have to check them from time to time.
// pages with too long fetchInterval are adjusted so that they fit
within
// maximum fetchInterval (segment retention period).
if (datum.getFetchTime() - curTime  (long) maxInterval * 1000) {
  datum.setFetchInterval(maxInterval * 0.9f);
  datum.setFetchTime(curTime);
}
if (datum.getFetchTime()  curTime) {
  return false;   // not time yet
}
return true;
  }


First, concerning the segment retention - we want to enforce that pages 
that were not refreshed longer than maxInterval should be retried, no 
matter what is their status - because we want to obtain a copy of the 
page in a newer segment in order to be able to delete the old segment.




why is the fetch time set here to curTime?


Because we want to fetch it now - see the next line where this condition 
is checked.



and why is the fetch interval set to maxInterval * 0.9f whithout
checking the current value of fetchInterval?


Hm, indeed this looks like a bug - we should instead do like this:

if (datum.getFetchInterval()  maxInterval) {
  datum.setFetchInterval(maxInterval * 0.9);
}



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch upgrade to Hadoop

2009-11-21 Thread Andrzej Bialecki

Dennis Kubes wrote:
I have created NUTCH-768.  I am in the middle of testing a few thousand 
page crawl for the most recent released version of Hadoop 0.20.1. 
Everything passes unit tests fine and there are no interface breaks. 
Looks like it will be an easy upgrade so far :)


Great, thanks!

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch upgrade to Hadoop

2009-11-20 Thread Andrzej Bialecki

John Martyniak wrote:
Does anybody know of any concrete plans to update Nutch to Hadoop 0.20,  
0.21?


Something like a Nutch 1.1 release, get in some bug fixes and get 
current on Hadoop?


I think that should be one of the goals.

My 2 cents.


I'm planning to do this upgrade soon (~a week) - and I agree that we 
should have a 1.1 release in the near future.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch near future - strategic directions

2009-11-20 Thread Andrzej Bialecki

Sami Siren wrote:

Lots of good thoughts and ideas, easy to agree with.

Something for the ease of use category:
-allow running on top of plain vanilla hadoop


What does it mean plain vanilla here? Do you mean the current DB 
implementation? That's the idea, we should aim for an abstract layer 
that can accommodate both HBase and plain MapFile-s.



-split into reusable components with nice and clean public api
-publish mvn artifacts so developers can directly use mvn, ivy etc to 
pull required dependencies for their specific crawler


+1, with slight preference towards ivy.



My biggest concern is in execution of this (or any other) plan.
Some of the changes or improvements that have been proposed are quite 
heavy in nature and would require large changes. I am just thinking 
that would it still be better to take a fresh start instead of trying to 
do this incrementally on top of existing code base.


Well ... that's (almost) what Dogacan did with the HBase port. I agree 
that we should not feel too constrained by the existing code base, but 
it would be silly to throw everything away and start from scratch - we 
need to find a middle ground. The crawler-commons and Tika projects 
should help us to get rid of the ballast and significantly reduce the 
size of our code.


In the history of Nutch this approach is not something new (remember map 
reduce?) and in my opinion it worked nicely then. Perhaps it is 
different this time since the changes we are discussing now have many 
abstract things hanging in the air, even fundamental ones.


Nutch 0.7 to 0.8 reused a lot of the existing code.



Of course the rewrite approach means that it will take some time before 
we actually get into the point where we can start adding real substance 
(meaning new features etc).


So to summarize, I would go ahead and put together a branch nutch N.0 
that would consist of (a.k.a my wish list, hope I am not being too 
aggressive here):


-runs on top of plain hadoop


See above - what do you mean by that?

-use osgi (or some other more optimal extension mechanism that fits and 
is easy to use)
-basic http/https crawling functionality (with db abstraction or hbase 
directly and smart data structures that allow flexible and efficient 
usage of the data)

-basic solr integration for indexing/search
-basic parsing with tika

After the basics are ok we would start adding and promoting any of the 
hidden gems we might have, or some solutions for the interesting 
challenges.


I believe that's more or less where Dogacan's port is right now, except 
it's not merged with the OSGI port.


ps. many of the interesting challenges in your proposal seem to fall in 
the category of data analysis and manipulation that are mostly, used 
after the data has been crawled or between the fetch cycles so many of 
those could be implemented into current code base also, somehow I just 
feel that things could be made more efficient and understandable if the 
foundation (eg. data structures, extendability for example) was in 
better shape. Also if written nicely other projects could use them too!


Definitely agree with this. Example: the PageRank package - it works 
quite well with the current code, but it's design is obscured by the 
ScoringFilter api and the need to maintain its own extended DB-s.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch upgrade to Hadoop

2009-11-20 Thread Andrzej Bialecki

Dennis Kubes wrote:
I would like to get a couple things in this release as well.  Let me 
know if you want help with the upgrade.


You mean you want to do the Hadoop upgrade? I won't stand in your way :)

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch near future - strategic directions

2009-11-16 Thread Andrzej Bialecki

Subhojit Roy wrote:

Hi,

Would it be possible to include in Nutch, the ability to crawl  download a
page only if the page has been updated since the last crawl? I had read
sometime back that there were plans to include such a feature. It would be a
very useful feature to have IMO. This of course depends on the last
modified timestamp being present on the webpage that is being crawled,
which I believe is not mandatory. Still those who do set it would benefit.


This is already implemented - see the Signature / MD5Signature / 
TextProfileSignature.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: decoding nutch readseg -dump 's output

2009-11-16 Thread Andrzej Bialecki

Yves Petinot wrote:

Hi,

I'm trying to build a small perl (could be any scripting language) 
utility that takes nutch readseg -dump 's output as its input, decodes 
the content field to utf-8 (independent of what encoding the raw page 
was in) and outputs that decoded content. After a little bit of 
experimentation, i find myself unable to decode the content field, even 
when i try using the various charset hints that are available either in 
the content metadata, or in the raw content itself.


I was wondering if someone on the list has already succeeded in building 
this type of functionality, or is the content returned by readseg using 
a specific encoding that i don't know of ?


The dump functionality is not intended to provide a bit-by-bit copy of 
the segment, it's mostly for debugging purposes. It uses System.out, 
which in turn uses the default platform encoding - any characters 
outside this encoding will be replaced by question marks.


If you want to get an exact copy of the raw binary content then please 
use the SegmentReader API.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Scalability for one site

2009-11-16 Thread Andrzej Bialecki

Mark Kerzner wrote:

Hi,

I want to politely crawl a site with 1-2 million pages. With the speed of
about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop,
and can I coordinate the crawlers so as not to cause a DOS attack?


Your Hadoop cluster does not increase the scalability of the target 
server and that's the crux of the matter - whether you use Hadoop or 
not, multiple threads or a single thread, if you want to be polite you 
will be able to do just 1 req/sec and that's it.


You can prioritize certain pages for fetching so that you get the most 
interesting pages first (whatever interesting means).



I know that URLs from one domain as assigned to one fetch segment, and
polite crawling is enforced. Should I use lower-level parts of Nutch?


The built-in limits are there to avoid causing pain for inexperienced 
search engine operators (and webmasters who are their victims). The 
source code is there, if you choose you can modify it to bypass these 
restrictions, just be aware of the consequences (and don't use Nutch 
as your user agent ;) ).


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch Hadoop question

2009-11-13 Thread Andrzej Bialecki

TuxRacer69 wrote:

Hi Eran,

mapreduce has to store its data on HDFS file system.


More specifically, it needs read/write access to a shared filesystem. If 
you are brave enough you can use NFS, too, or any other type of 
filesystem that can be mounted locally on each node (e.g. a NetApp).


But if you want to separate the two groups of servers, you could build 
two separate HDFS filesystems. To separate the two setups, you will need 
to make sure there is no cross communication between the two parts,


You can run two separate clusters even on the same set of machines, just 
 configure them to use different ports AND different local paths.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Synonym Filter with Nutch

2009-11-13 Thread Andrzej Bialecki

Dharan Althuru wrote:

Hi,


We are trying to incorporate synonym filter during indexing using Nutch. As
per my understanding Nutch doesn’t have synonym indexing plug-in by default.
Can we extend IndexFilter in Nutch to incorporate the synonym filter plug-in
available in Lucene using WordNet or custom synonym plug-in without any
negative impacts to existing Nutch indexing (i.e., considering bigram etc).


Synonym expansion should be done when the text is analyzed (using 
Analyzers), so you can reuse the Lucene's synonym filter.


Unfortunately, this happens at different stages depending on whether you 
use the built-in Lucene indexer, or the Solr indexer.


If you use the Lucene indexer, this happens in LuceneWriter, and the 
only way to affect it is to implement an analysis plugin, so that it's 
returned from AnalyzerFactory, and use your analysis plugin instead of 
the default one. See e.g. analysis-fr for an example of how to implement 
such plugin.


However, when you index to Solr you need to configure the Solr's 
analysis chain, i.e. in your schema.xml you need to define for your 
fieldType that it has the synonym filter in its indexing analysis chain.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Problems with Hadoop source

2009-11-11 Thread Andrzej Bialecki

Pablo Aragón wrote:

Hej,

I am developing a project based on Nutch. It works great (in Eclipse) but
due to new requirements I have to change the library hadoop-0.12.2-core.jar
to the original source code.

I download succesfully that code in:
http://archive.apache.org/dist/hadoop/core/hadoop-0.12.2/hadoop-0.12.2.tar.gz. 


After adding it to the project in Eclipse everything seems correct but the
execution shows:

Exception in thread main java.io.IOException: No FileSystem for scheme:
file
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:157)
at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:119)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:91)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:103)

Any idea?


Yes - when you worked with a pre-built jar it contained an embedded 
hadoop-default.xml that defines the implementation of the file:// 
schema FileSystem. Now you probably forgot to put hadoop-default.xml on 
your classpath. Go to Build Path and add this file to your classpath, 
and all should be ok.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Nutch near future - strategic directions

2009-11-09 Thread Andrzej Bialecki
. We should make Nutch an attractive platform 
for such users, and we should discuss what this entails. Also, if we 
refactor Nutch in the way I described above, it will be easier for such 
users to contribute back to Nutch and other related projects.


3. Provide a platform for solving the really interesting issues
---
Nutch has many bits and pieces that implement really smart algorithms 
and heuristics to solve difficult issues that occur in crawling. The 
problem is that they are often well hidden and poorly documented, and 
their interaction with the rest of the system is far from obvious. 
Sometimes this is related to premature performance optimizations, in 
other cases this is just a poorly abstracted design. Examples would 
include the OPIC scoring, meta-tags  metadata handling, deduplication, 
redirection handling, etc.


Even though these components are usually implemented as plugins, this 
lack of transparency and poor design makes it difficult to experiment 
with Nutch. I believe that improving this area will result in many more 
users contributing back to the project, both from business and from 
academia.


And there are quite a few interesting challenges to solve:

* crawl scheduling, i.e. determining the order and composition of 
fetchlists to maximize the crawling speed.


* spam  junk detection (I won't go into details on this, there are tons 
of literature on the subject)


* crawler trap handling (e.g. the classic calendar page that generates 
infinite number of pages).


* enterprise-specific ranking and scoring. This includes users' feedback 
(explicit and implicit, e.g. click-throughs)


* pagelet-level crawling (e.g. portals, RSS feeds, discussion fora)

* near-duplicate detection, and closely related issue of extraction of 
the main content from a templated page.


* URL aliasing (e.g. www.a.com == a.com == a.com/index.html == 
a.com/default.asp), and what happens with inlinks to such aliased pages. 
Also related to this is the problem of temporary/permanent redirects and 
complete mirrors.


Etc, etc ... I'm pretty sure there are many others. Let's make Nutch an 
attractive platform to develop and experiment with such components.


-
Briefly ;) that's what comes to my mind when I think about the future of 
Nutch. I invite you all to share your thoughts and suggestions!


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: changing/addding field in existing index

2009-11-09 Thread Andrzej Bialecki

fa...@butterflycluster.net wrote:

hi all,

i have an existing index - we have a custom field that needs to be added
or changed in every currently indexed document ;

whats the best way to go about this without recreating the index again?


There are ways to do it directly on the index, but this is complicated 
and involves hacking the low-level Lucene format. Alternatively, you 
could build a parallel index with just these fields, but synchronized 
internal docId-s, open both indexes with ParallelReader, and then create 
a new index using IndexWriter.addIndexes().


I suggest recreating the index.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Direct Access to Cached Data

2009-11-05 Thread Andrzej Bialecki

Hugo Pinto wrote:

Hello,

I am using Nutch for mirroring, rather than crawling and indexing.
I need to access directly the cached data in my Nutch index, but I am
unable to find an easy way to do so.
I browsed the documentation(wiki, javadocs, and skimmed the code), but
found no straightforward way to do it.
Would anyone suggest a place to look for more information, or perhaps
have done this before and could share a few tips?


Most likely what you need is not the Lucene index, but the segments 
(shards), right? There's a utility called SegmentReader (available from 
cmd-line as readseg), and you can use its API to retrieve either all or 
individual records from a segment (using URL as key).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: updatedb is talking long long time

2009-11-02 Thread Andrzej Bialecki

Kalaimathan Mahenthiran wrote:

I forgot to add the detail...

The segment i'm trying to do updatedb on has 1.3 millions urls fetched
and 1.08 million urls parsed..

Any help related to this would be appreciated...


On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran
matha...@gmail.com wrote:

hi everyone

I'm using nutch 1.0. I have fetched successfully and currently on the
updatedb process. I'm doing updatedb and its taking so long. I don't
know why its taking this long. I have a new machine with quad core
processor and 8 gb of ram.

I believe this system is really good in terms of processing power. I
don't think processing power is the problem here. I noticed that all
the ram is getting using up. close to 7.7gb by the updatedb process.
The computer is becoming is really slow.

The updatedb process has been running for the last 19 days continually
with the message merging segment data into db.. Does anyone know why
its taking so long... Is there any configuration setting i can do to
increase the speed of the updatedb process...


First, this process normally takes just a few minutes, depending on the 
hardware, and not several days - so something is wrong.


* do you run this in local or pseudo-distributed mode (i.e. running a 
real jobtracker and tasktracker?) Try the pseudo-distributed mode, 
because then you can monitor the progress in the web UI.


* how many reduce tasks do you have? with large updates it helps if you 
run  1 reducer, to split the final sorting.


* if the task appears to be completely stuck, please generate a thread 
dump (kill -SIGQUIT) and see where it's stuck. This could be related to 
urlfilter-regex or urlnormalizer-regex - you can identify if these are 
problematic by removing them from the config and re-running the operation.


* minor issue - when specifying the path names of segments and crawldb, 
do NOT append the trailing slash - it's not harmful in this particular 
case, but you could have a nasty surprise when doing e.g. copy / mv 
operations ...


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: including code between plugins

2009-11-02 Thread Andrzej Bialecki

Eran Zinman wrote:

Hi,

I've written my own plugin that's doing some custom parsing.

I've needed language parsing in that plugin and the language-identifier
plugin is wokring great for my needs.

However, I can't use the language identifier plugin as it is, since I want
to parse only a small portion of the webpage.

I've used the language identifier functions and it worked great in eclipse,
but when I try to compile my plugin I'm unable to compile it since it
depends on the language-identifier source code.

My question is - how can I include the language identifier code in my plugin
code without actually using the language-identifier plugin?


You need to add the language-identifier plugin to the requires section 
in your plugin.xml, like this:


   requires
  import plugin=nutch-extensionpoints/
  import plugin=language-identifier/
   /requires


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: could you unsubscribe me from this mailing list pls. tks

2009-11-02 Thread Andrzej Bialecki

Nico Sabbi wrote:

Il giorno lun, 02/11/2009 alle 10.04 +0100, Heiko Dietze ha scritto:

Hello,

there is no Administrator. But you can do the unsubscribe your-self. On 
the Nutch Maling-List information site


http://lucene.apache.org/nutch/mailing_lists.html

you can find the following E-Mail address:

nutch-user-unsubscr...@lucene.apache.org

Then your unsubscribe requests should work.

regards,

Heiko Dietze


doesn't work, as reported by me and others last week.
Thanks,


Did you get the message with the subject of confirm unsubscribe from 
nutch-user@lucene.apache.org and did you respond to it from the same 
email account that you were subscribed from?




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Unsubscribe step-by-step (Re: could you unsubscribe me from this mailing list pls. tks)

2009-11-02 Thread Andrzej Bialecki

Andrzej Bialecki wrote:


doesn't work, as reported by me and others last week.
Thanks,


Did you get the message with the subject of confirm unsubscribe from 
nutch-user@lucene.apache.org and did you respond to it from the same 
email account that you were subscribed from?


.. I just verified that this process works correctly - I subscribed and 
unsubscribed successfully. Please make sure that you complete the 
unsubscription process as listed below:


1. make sure you are sending requests from the same email address that 
you were subscribed from!

2. send email to nutch-user-subscr...@lucene.apache.org .
3. you will get a confirm unsubscribe message - make sure your 
anti-spam filters don't block this message, and make sure you are still 
using the correct email account when responding.

4. you need to reply to the confirm unsubscribe message (duh...)
5. you will get a GOODBYE message.

Now, let me understand this clearly: did you go through all 5 steps 
listed above, and you are still getting messages from this list?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: unbalanced fetching

2009-10-29 Thread Andrzej Bialecki

Jesse Hires wrote:

I have a two datanode and one namenode setup. One of my datanodes is slower
than the other, causing the fetch to run significantly longer on it. Is
there a way to balance this out?


Most likely the number of URLs/host is unbalanced, meaning that the 
tasktracker that takes the longest is assigned a lot of URLs from a 
single host.


A workaround for this is to limit the max number of URLs per host (in 
nutch-site.xml) to a more reasonable number, e.g. 100 or 1000, whatever 
works best for you.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread Andrzej Bialecki

caezar wrote:

Some more information. Debugging reduce method I've noticed, that before code
if (fetchDatum == null || dbDatum == null
|| parseText == null || parseData == null) {
  return; // only have inlinks
}
my page has fetchDatum, parseText and parseData not null, but dbDatum is
null. Thats why it's skipped :) 
Any ideas about the reason?


Yes - you should run updatedb with this segment, and also run 
invertlinks with this segment, _before_ trying to index. Otherwise the 
db status won't be updated properly.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Deleting stale URLs from Nutch/Solr

2009-10-27 Thread Andrzej Bialecki

Gora Mohanty wrote:

On Mon, 26 Oct 2009 17:26:23 +0100
Andrzej Bialecki a...@getopt.org wrote:
[...]

Stale (no longer existing) URLs are marked with STATUS_DB_GONE.
They are kept in Nutch crawldb to prevent their re-discovery
(through stale links pointing to these URL-s from other pages).
If you really want to remove them from CrawlDb you can filter
them out (using CrawlDbMerger with just one input db, and setting
your URLFilters appropriately).

[...]

Thank you for your help. Your suggestions look promising, but I
think that I did not make myself adequately clear. Once we have
completed a site crawl with Nutch, ideally I would like to be
able to find stale links without doing a complete recrawl, i.e.,
only through restarting the crawl from where it last left off. Is
that possible.

I tried a simple test on a local webserver with five pages in a
three-level hierarchy. The crawl completes, and discovers all
five URLs as expected. Now, I remove a tertiary page. Ideally,
I would like to be able run a recrawl, and have Nutch dicover
the now-missing URL. However, when I try that, it finds no new
links, and exits.


I assume you mean that the generate step produces no new URL-s to 
fetch? That's expected, because they become eligible for re-fetching 
only after Nutch considers them expired, i.e. after the fetchTime + 
fetchInterval, and the default fetchInterval is 30 days.


You can pretend that the time moved on using the -adddays parameter. 
Then Nutch will generate a new fetchlist, and when it discovers that the 
page is missing it will mark it as gone - actually, you could then take 
that information directly from the Nutch segment and instead of 
processing the CrawlDb you could process the segment to collect a 
partial list of gone pages.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to index files only with specific type

2009-10-27 Thread Andrzej Bialecki

Dmitriy Fundak wrote:

If I disable html-parser(remove parse-(html from plugin.includes
property) html filed didn't get parsed
So didn't get outlinks to kml files from html.
So I can't parse and index kml files.
I might not be right, but I have a feeling that it's not possible
without modifying source code.


It's possible to do this with a custom indexing filter - see other 
indexing filters to get a feeling of what's involved. Or you could do 
this with a scoring filter, too, although the scoring API looks more 
complicated.


Either way, when you execute the Indexer, these filters are run in a 
chain, and if one of them returns null then that document is discarded, 
i.e. it's not added to the output index. So, it's easy to examine in 
your indexing filter the content type (or just a URL of the document) 
and either pass the document on or reject it by returning null.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Deleting stale URLs from Nutch/Solr

2009-10-26 Thread Andrzej Bialecki

Gora Mohanty wrote:

Hi,

  We are using Nutch to crawl an internal site, and index content
to Solr. The issue is that the site is run through a CMS, and
occasionally pages are deleted, so that the corresponding URLs
become invalid. Is there any way that Nutch can discover stale
URLs during recrawls, or is the only solution a completely fresh
crawl? Also, is it possible to have Nutch automatically remove
such stale content from Solr?

  I am stumped by this problem, and would appreciate any pointers,
or even thoughts on this.


Hi,

Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They are 
kept in Nutch crawldb to prevent their re-discovery (through stale links 
pointing to these URL-s from other pages). If you really want to remove 
them from CrawlDb you can filter them out (using CrawlDbMerger with just 
 one input db, and setting your URLFilters appropriately).


Now when it comes to removing them from Solr ... The simplest (no 
coding) way would be to dump the CrawlDb, use some scripting tools to 
collect just the URL-s with the status GONE, and send them as a delete 
command to Solr. A slightly more involved solution would be to implement 
a tool that reads such URLs directly from CrawlDb (using e.g. 
CrawlDbReader API) and then uses SolrJ API to send the same delete 
requests + commit.




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Targeting Specific Links

2009-10-23 Thread Andrzej Bialecki

Eric Osgood wrote:

Andrzej,

Based on what you suggested below, I have begun to write my own scoring 
plugin:


Great!



in distributeScoreToOutlinks() if the link contains the string im 
looking for, I set its score to kept_score and add a flag to the 
metaData in parseData (KEEP, true). How do I check for this flag in 
generatorSortValue()? I only see a way to check the score, not a flag.


The flag should have been automagically added to the target CrawlDatum 
metadata after you have updated your crawldb (see the details in 
CrawlDbReducer). Then in generatorSortValue() you can check for the 
presence of this flag by using the datum.getMetaData().


BTW - you are right, the Generator doesn't treat Float.MIN_VALUE in any 
special way ... I thought it did. It's easy to add this, though - in 
Generator.java:161 just add this:


if (sort == Float.MIN_VALUE) {
return;
}


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Accessing an Index from a shared location

2009-10-21 Thread Andrzej Bialecki

JusteAvantToi wrote:

Hi all,

I am new on using Nutch and I found that Nutch is really good. I have a
problem and hope somebody can shed a light. 


I have built an index and a web application that makes use of that index. I
plan to have two web application servers running the application. Since I do
not want to replicate the application and the index  on each web application
server, I put the application and the index on a shared location and
configure nutch-site.xml as follow:

property
  namesearcher.dir/name
  value\\111.111.111.111\folder\index/value
  description Path to root of crawl/description
/property

property
  nameplugin.folders/name
  value\\111.111.111.111\folder\plugins/valuedescription
/property

However it seems that my application can not find the index. I have checked
that the web application server have access to the shared location. 


Is there something that I missed here? Does Nutch allow us to put the index
on a network location?


UNC paths are not supported in Java - you need to mount this location as 
a local volume.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Extending HTML Parser to create subpage index documents

2009-10-20 Thread Andrzej Bialecki

malcolm smith wrote:

I am looking to create a parser for a groupware product that would read
pages message board type web site.  (Think phpBB).  But rather than creating
a single Content item which is parsed and indexed to a single lucene
document, I am planning to have the parser create a master document (for the
original post) and an additional document for each reply item.

I've reviewed the code for protocol plugins, parser plugins and indexing
plugins but each interface allows for a single document or content object to
be passed around.

Am I missing something simple?

My best bet at the moment is to implement some kind of new fake protocol
for the reply items then I would use the http client plugin for the first
request to the page and generate outlines on the
fakereplyto://originalurl/reply1 fakereplyto://originalurl/reply2 to go
back through and fetch the sub page content.  But this seems round-about and
would probably generate an http request for each reply on the original
page.  But perhaps there is a way to lookup the original page in the segment
db before requesting it again.

Needless to say it would seem more straightforward to tackle this in some
kind of parser plugin that could break the original page into pieces that
are treated as standalone pages for indexing purposes.

Last but not least conceptually a plugin for the indexer might be able to
take a set of custom meta data for a replies collection and index it as
separate lucene documents - but I can't find a way to do this given the
interfaces in the indexer plugins.

Thanks in advance
Malcolm Smith


What version of Nutch are you using? This should be already possible to 
do using the 1.0 release or a nightly build. ParseResult (which is what 
parsers produce) can hold multiple Parse objects, each with its own URL.


The common approach to handle whole-part relationships (like zip/tar 
archives, RSS, and other compound docs) is to split them in the parser 
and parse each part, then give each sub-document its own URL (e.g 
file.tar!myfile.txt) and add the original URL in the metadata, to keep 
track of the parent URL. The rest should be handled automatically, 
although there are some other complications that need to be handled as 
well (e.g. don't recrawl sub-documents).




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: ERROR: current leaseholder is trying to recreate file.

2009-10-20 Thread Andrzej Bialecki

Eric Osgood wrote:
This is the error I keep getting whenever I try to fetch more than 400K 
files at a time using a 4 node hadoop cluster running nutch 1.0.


org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to 
create file 
/user/hadoop/crawl/segments/20091013161641/crawl_fetch/part-00015/index 
for DFSClient_attempt_200910131302_0011_r_15_2 on client 
192.168.1.201 because current leaseholder is trying to recreate file.


Please see this issue:

https://issues.apache.org/jira/browse/NUTCH-692

Apply the patch that is attached there, rebuild Nutch, and tell me if 
this fixes your problem.


(the patch will be applied to trunk anyway, since others confirmed that 
it fixes this issue).




Can anybody shed some light on this issue? I was under the impression 
that 400K was small potatoes for a nutch hadoop combo?


It is. This problem is rare - I think I crawled cumulatively ~500mln 
pages in various configs and it didn't occur to me personally. It 
requires a few things to go wrong (see the issue comments).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch Enterprise

2009-10-17 Thread Andrzej Bialecki

Dennis Kubes wrote:
Depending on what you are wanting to do Solr may be a better choice as 
and Enterprise search server. If you are needing crawling you can use 
Nutch or attach a different crawler to Solr.  If you are wanting to do 
more full web type search, then Nutch is a better option.  What are your 
 requirements?


Dennis

fredericoagent wrote:
Does anybody have any information on using Nutch as Enterprise search 
?, and

what would I need ?
is it just a case of the current nutch package or do you need other 
addons.


And how does that compare against Google Enterprise ?

thanks


I agree with Dennis - use Nutch if you need to do a larger-scale 
discovery such as when you crawl the web, but if you already know all 
target pages in advance then Solr will be a much better (and much easier 
to handle) platform.




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: ERROR datanode.DataNode - DatanodeRegistration ... BlockAlreadyExistsException

2009-10-17 Thread Andrzej Bialecki

Jesse Hires wrote:

Does anyone have any insight into the following error I am seeing in the
hadoop logs? Is this something I should be concerned with, or is it expected
that this shows up in the logs from time to time? If it is not expected,
where can I look for more information on what is going on?

2009-10-16 17:02:43,061 ERROR datanode.DataNode -
DatanodeRegistration(192.168.1.7:50010,
storageID=DS-1226842861-192.168.1.7-50010-1254609174303,
infoPort=50075, ipcPort=50020):DataXceiver
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException:
Block blk_90983736382565_3277 is valid, and cannot be written to.


Are you sure you are running a single datanode process per machine?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to run a complete crawl?

2009-10-17 Thread Andrzej Bialecki

Vincent155 wrote:

I have a virtual machine running (VMware 1.0.7). Both host and guest run on
Fedora 10. In the virtual machine, I have Nutch installed. I can index
directories on my host as if they are websites.

Now I want to compare Nutch with another search enige. For that, I want to
index some 2,500 files in a directory. But when I execute a command like
crawl urls -dir crawl.test -depth 3 -topN 2500, of leave away the
topN-statement, there are still only some 50 to 75 files indexed.


Check in your nutch-site.xml what is the value of 
db.max.outlinks.per.page, the default is 100 - when crawling filesystems 
each file in a directory is treated as an outlink, and this limit is 
then applied.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: http keep alive

2009-10-14 Thread Andrzej Bialecki

Marko Bauhardt wrote:

hi.
is there a way for using http-keep-alive with nutch?
supports protocol-http or protocol-httpclient keep alive?

i cant find the using of http-keep-alive inside the code or in 
configuration files?


protocol-httpclient can support keep-alive. However, I think that it 
won't help you much. Please consider that Fetcher needs to wait some 
time between requests, and in the meantime it will issue requests to 
other sites. This means that if you want to use keep-alive connections 
then the number of open connections will climb up quickly, depending on 
the number of unique sites on your fetchlist, until you run out of 
available sockets. On the other hand, if the number of unique sites is 
small, then most of the time the Fetcher will wait anyway, so the 
benefit from keep-alives (for you as a client) will be small - though 
there will be still some benefit for the server side.




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Incremental Whole Web Crawling

2009-10-13 Thread Andrzej Bialecki

Eric Osgood wrote:
Ok, I think I am on the right track now, but just to be sure: the code I 
want is the branch section of svn under nutchbase at 
http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/ correct?


No, you need the trunk from here:

http://svn.apache.org/repos/asf/lucene/nutch/trunk


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Incremental Whole Web Crawling

2009-10-13 Thread Andrzej Bialecki

Eric Osgood wrote:

So the trunk contains the most recent nightly update?


It's the other way around - nightly build is created from a snapshot of 
the trunk. The trunk is always the most recent.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: indexing just certain content

2009-10-10 Thread Andrzej Bialecki

MilleBii wrote:

Andzej,

The use case you are thinking is : at the parsing stage, filter out garbage
content and index only the rest.

I have a different use case, I want to keep everything as standard indexing
_AND_  also extract part for being indexed in a dedicated field (which will
be boosted at search time). In a document certain part have more importance
than others in my case.

So I would like either
1. to access html representation at indexing time... not possible or did not
find how
2. create a dual representation of the document, plain  standard, filtered
document

I think option 2. is much better because it better fits the model and allows
for a lot of different other use cases.


Actually, creativecommons provides hints how to do this .. but to be 
more explicit:


* in your HtmlParseFilter you need to extract from DOM tree the parts 
that you want, and put them inside ParseData.metadata. This way you will 
preserve both the original text, and your special parts that you extracted.


* in your IndexingFilter you will retrieve the parts from 
ParseData.metadata and add them as additional index fields (don't forget 
to specify indexing backend options).


* in your QueryFilter plugin.xml you declare that QueryParser should 
pass your special fields without treating them as terms, and in the 
implementation you create a BooleanClause to be added to the translated 
query.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to ignore search results that don't have related keywords in main body?

2009-10-10 Thread Andrzej Bialecki

winz wrote:


Venkateshprasanna wrote:

Hi,

You can very well think of doing that if you know that you would crawl and
index only a selected set of web pages, which follow the same design.
Otherwise, it would turn out to be a never ending process - i.e., finding
out the sections, frames, divs, spans, css classes and the likes - from
each of the web pages. Scalability would obviously be an issue.



Hi,
Could I please know how we can ignore template items like header, footer and
menu/navigations while crawling and indexing pages which follow the same
design??
I'm using a content management system called Infoglue to develop my website.
A standard template is applied for all the pages on the website.

The search results from Nutch shows content from menu/navigation bar
multiple times.
I need to get rid of menu/navigation content from the search result.


If all you index is this particular site, then you know the positions of 
navigation items, right? Then you can remove these elements in your 
HtmlParseFilter, or modify DOMContentUtils (in parse-html) to skip these 
elements.




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to ignore search results that don't have related keywords in main body?

2009-10-10 Thread Andrzej Bialecki

BELLINI ADAM wrote:

hi guyes it's just what im talking about in my post 'indexing
just certain content‏'... you can read it mabe it could help you... i
was asking how to get rid of the garbage sections in a document and
to parse only the important data...so i guess you will create your
own parser and indexer...but the problem is how could we delete those
garbage section from an html...try to read my post...mabe we can
gather our two posts...i dont know if we can gather posts on thsi
mailing list...to keep tracking only one post...


What is garbage? Can you define it in terms of regex pattern or XPath 
expression that points to specific elements in DOM tree? If you crawl a 
single (or few) sites with well defined templates then you can hardcode 
some rules for removing unwanted parts of the page.


If you can't do this, then there are some heuristic methods to solve 
this. There are two groups of methods:


* page at a time (local): this group of methods considers only the 
current page that you analyze. The quality of filtering is usually limited.


* groups of pages (e.g. per site): these methods consider many pages at 
a time, and try to find recurring theme among them. Since you first need 
to accumulate some pages it can't be done on the fly, i.e. this requires 
a separate post-processing step.


The easiest to implement in Nutch is the first approach (page at a 
time). There are many possible implementations - e.g. based on text 
patterns, on visual position of elements, on DOM tree patterns, on 
block of content characteristics, etc.


Here's for example a simple method:

* collect text from the page in blocks, where each block fits within 
structural tags (div and table tags). Collect also the number of a 
links in each block.


* remove a percentage of the smallest blocks, where link number is high 
- these are likely navigational elements.


* reconstruct the whole page from the remaining blocks.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: indexing just certain content

2009-10-09 Thread Andrzej Bialecki

BELLINI ADAM wrote:

HI

hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was thinking 
to start to create an HTML tag filter class.
mabe i can create my own HTML parser ! as i do for parsing and indexing 
DublinCore metadata...it sounds possible don't you think so ?

i just hv to create also or to find a class which could filter an HTML pages 
and delete certain tag from it


Guys, please take a look at how HtmlParseFilters are implemented - for 
example the creativecommons plugin. I believe that's exactly the 
functionality that you are looking for.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Targeting Specific Links

2009-10-07 Thread Andrzej Bialecki

Eric Osgood wrote:

Andrzej,

How would I check for a flag during fetch?


You would check for a flag during generation - please check 
ScoringFilter.generatorSortValue(), that's where you can check for a 
flag and set the sort value to Float.MIN_VALUE - this way the link will 
never be selected for fetching.


And you would put the flag in CrawlDatum metadata when ParseOutputFormat 
calls ScoringFilter.distributeScoreToOutlinks().




Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page, but 
still needing a total of X links per page, if I find the links I want, I 
add them to the list up until X, if I don' reach X, I add other links 
until X is reached. This way, I don't waste crawl time on non-relevant 
links.


You can modify the collection of target links passed to 
distributeScoreToOutlinks() - this way you can affect both which links 
are stored and what kind of metadata each of them gets.


As I said, you can also use just plain URLFilters to filter out unwanted 
links, but that API gives you much less control because it's a simple 
yes/no that considers just URL string. The advantage is that it's much 
easier to implement than a ScoringFilter.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Targeting Specific Links

2009-10-06 Thread Andrzej Bialecki

Eric Osgood wrote:
Is there a way to inspect the list of links that nutch finds per page 
and then at that point choose which links I want to include / exclude? 
that is the ideal remedy to my problem.


Yes, look at ParseOutputFormat, you can make this decision there. There 
are two standard etension points where you can hook up - URLFilters and 
ScoringFilters.


Please note that if you use URLFilters to filter out URL-s too early 
then they will be rediscovered again and again. A better method to 
handle this, but also more complicated, is to still include such links 
but give them a special flag (in metadata) that prevents fetching. This 
requires that you implement a custom scoring plugin.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



  1   2   3   4   5   6   7   >