from:"Piotr Kosiorowski"

Re: Improvement of Nutch 0.7.2

2007-02-11 Thread Piotr Kosiorowski


I have created 0.7.3 label i JIRA and I am willing to commit useful
patches in this branch. I do not have time to develop new code myself
(and if I would I will better spend it in trunk in my opinion). So if
you have anything to submit I woul dbe willing to commit it.
Regards
Piotr

On 2/12/07, carmmello [EMAIL PROTECTED] wrote:

About 3 or 4 months ago, there was some discussion and seemingly, a consensus 
that Nutch 0.7.2 was a better choice for the ones that were using a single 
computer and aiming for indexed sites of just about a few thousand or, at most, 
a few millions of pages.  There were plans to develop a new version, Nutch 
0.7.3.
Since than, I did not hear anything more about the subject.
Does anyone know something about this?
Thanks
Carmmello

Re: 0.7.3 version

2006-11-23 Thread Piotr Kosiorowski

As no objections were raised I created a 0.7.3 version in JIRA so we can 
start assigning current JIRA issues to it.

Regards
Piotr
Piotr Kosiorowski wrote:

Hello committers,
Based on a recent discussion on nutch user list - (Strategic Direction
of Nutch) I would like to prepare 0.7.3 release. The idea is to allow
people who still use 0.7.2 to get rid of most important bugs and allow
them to add some small features they would need as the claim is 0.8.1
is not good for small crawls at the moment. It will allow us to work
on 0.8 branch so it would be more small installation friendly.
I would like to approach it this way that if noone objects I would
create a 0.7.3 release in JIRA and ask people to assign issues with
patches to it. I do not have a lot of time personally so I do not plan
to do any development myself - just taking care of high quality
patches and committing them - after some time when we gather some
aomount of bugfixes/isues I would prepare 0.7.3 release. Any
objections comments?
Regards
Piotr

Re: Strategic Direction of Nutch

2006-11-16 Thread Piotr Kosiorowski


We should use JIRA for it as Arun said. I will send separate email to
committers with a bit better title to get acceptance for preparation
of 0.7.3 version and create 0.7.3 version in JIRA so you can assign
issues to it.
Regards
Piotr

On 11/16/06, Arun Kaundal [EMAIL PROTECTED] wrote:

Hi Nitin,

   As per I understand, following is answer for the things you are looking
for:


On 11/16/06, Nitin Borwankar [EMAIL PROTECTED] wrote:

 Andrzej Bialecki wrote:

  Nitin Borwankar wrote:
 
  Hi all,
 
  First an intro. I am another Nutch newbie and am finding 0.7.2 to be
  quite an effective single machine crawler.
 
  [..]
 
  The ability to keep db formats compatible would be nice to allow reuse
  of existing results but is not necessary.
 
 
 
 
  That's probably not going to happen - each branch has specific
  requirements from the db and segment formats, which are incompatible.
  However, given enough interest we could implement converters, even
  bi-directional.
 
 
  As a potential developer I would like to volunteer for the ongoing
  maintenance and evolution of 0.7.2 as an effective single machine
  crawler.
 
 
 
  That's excellent! I imagine the procedure to get you involved would be
  something like this:
 
  * start collecting issues related to maintenance, bugfixes or
  improvements of that branch,

 what is the mechanism for this collection process - do we create a
 separate email list, a separate alias ... or everyone just sends me
 email ( this may get messy fast ).


Well, Once you create your jira user, You can choose among number of project
you want to contribute. You have to filter the request to all the issues
(that includes bug fix, improvement and patches etc )  Or else you can
Browse Nutch project to see its issues. Here you can chhose version 0.7.2



  * create JIRA issues, plus start collecting patches, tested and ready
  for committing. One of the existing developers will commit them on
  your behalf.
 
 sounds good.

  * after a while we would consider giving you committer rights so that
  you could work directly with the code.
 
 fair enough.  Do we take this offline for further thrashing out ? Or
 continue here ?

 Nitin Borwankar

 
  Consider this a proposal to maintain two separate versions by
 continuing
  bug fix versions of 0.7  until one of two things happen
 
  a) 0.8 evolves to something satisfactory for use as also as a single
  machine search engine and everyone is happy moving to it
  b) a critical mass of developers steps forward to support the ongoing
  development of 0.7.2 into say Nutch-lite always and only meant for
  single machine use.
 
 
  I do hope that option a) becomes a reality sooner rather than later.
  But if there is sufficient interest (and enough developers) in
  developing 0.7 branch, then go for it - keeping in mind, though, that
  eventually these code bases will diverge so much that maintaining them
  will require two mostly separate teams ...
 

 Well, I am also working on nutch 0.7.2 occasionally for in-house product
from last one year. As nutch 0.8.x aheading somewhere in different
direction, We continue want to enhance and upgrade the functionality of
0.7.2 . I will like to extend you help in this regards and want to work with
you to improve it.

My zira id is :sharma_arun_se . I have created issue for you.

Keep it up!!!

Fwd: 0.7.3 version

2006-11-16 Thread Piotr Kosiorowski

Hello,
I am forwarding an email sent to committers so nutch users are also
aware of this initiative.
Regards,
Piotr

-- Forwarded message --
From: Piotr Kosiorowski [EMAIL PROTECTED]
Date: Nov 16, 2006 10:09 PM
Subject: 0.7.3 version
To: nutch-dev@lucene.apache.org

Hello committers,
Based on a recent discussion on nutch user list - (Strategic Direction
of Nutch) I would like to prepare 0.7.3 release. The idea is to allow
people who still use 0.7.2 to get rid of most important bugs and allow
them to add some small features they would need as the claim is 0.8.1
is not good for small crawls at the moment. It will allow us to work
on 0.8 branch so it would be more small installation friendly.
I would like to approach it this way that if noone objects I would
create a 0.7.3 release in JIRA and ask people to assign issues with
patches to it. I do not have a lot of time personally so I do not plan
to do any development myself - just taking care of high quality
patches and committing them - after some time when we gather some
aomount of bugfixes/isues I would prepare 0.7.3 release. Any
objections comments?
Regards
Piotr

Re: Strategic Direction of Nutch

2006-11-15 Thread Piotr Kosiorowski


I agree with Andrzej. On my part if some  takes the effort of
preparing patches and testing I as a committer (not very active one
recently) may focus on 7.2 issues and commit the patches. And in
future prepare 7.3 release.
Regards,
Piotr

On 11/15/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:

Nitin Borwankar wrote:
 Hi all,

 First an intro. I am another Nutch newbie and am finding 0.7.2 to be
 quite an effective single machine crawler.

[..]
 The ability to keep db formats compatible would be nice to allow reuse
 of existing results but is not necessary.



That's probably not going to happen - each branch has specific
requirements from the db and segment formats, which are incompatible.
However, given enough interest we could implement converters, even
bi-directional.


 As a potential developer I would like to volunteer for the ongoing
 maintenance and evolution of 0.7.2 as an effective single machine
 crawler.


That's excellent! I imagine the procedure to get you involved would be
something like this:

* start collecting issues related to maintenance, bugfixes or
improvements of that branch,

* create JIRA issues, plus start collecting patches, tested and ready
for committing. One of the existing developers will commit them on your
behalf.

* after a while we would consider giving you committer rights so that
you could work directly with the code.


 Consider this a proposal to maintain two separate versions by continuing
 bug fix versions of 0.7  until one of two things happen

 a) 0.8 evolves to something satisfactory for use as also as a single
 machine search engine and everyone is happy moving to it
 b) a critical mass of developers steps forward to support the ongoing
 development of 0.7.2 into say Nutch-lite always and only meant for
 single machine use.

I do hope that option a) becomes a reality sooner rather than later. But if 
there is sufficient interest (and enough developers) in developing 0.7 branch, 
then go for it - keeping in mind, though, that eventually these code bases will 
diverge so much that maintaining them will require two mostly separate teams ...

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Strategic Direction of Nutch

2006-11-12 Thread Piotr Kosiorowski


Anthony,
I do not think nutch can forget about small implementations. It was
one of its strong points
and I do think we will want to support them. For any issues please
report them in JIRA and I am sure they would be taken care of.
Regards
Piotr

On 11/12/06, Anthony May [EMAIL PROTECTED] wrote:

Greetings all,

I have just been handed the administration of our nutch implementation,
we are currently using nutch 0.7 and it very badly needs updating.
However we are evaluating several options, and I wanted to know about
where nutch is going as a project. I have not been able to find anything
in the wiki or in the mailing list archives with this information
(forgive me if I have missed it).

The central issue is that our needs are for our crawling our own
website with about 200,000 pages and documents with a single machine
containing nutch, not for crawling the web with a massively scalar
architecture. I have heard nutch is moving towards the latter and that
the former usage is becoming very slow in 0.8 compared to 0.7, is this
correct?

Thank you for helping me out.

Regards,


Anthony May
Web Developer
NZQA


This email may contain legally privileged information and is intended only for 
the addressee. It is not necessarily the official view or
communication of the New Zealand Qualifications Authority. If you are not the 
intended recipient you must not use, disclose, copy or distribute this email or
information in it. If you have received this email in error, please contact the 
sender immediately. NZQA does not accept any liability for changes made to this 
email or attachments after sending by NZQA.

All emails have been scanned for viruses and content by MailMarshal.
NZQA reserves the right to monitor all email communications through its network.

Re: details: stackoverflow error

2006-04-07 Thread Piotr Kosiorowski


Hello Rajesh,
I have run  bin/nutch crawl urls -dir crawl.test -depth 3
on standard nutch-0.7.2 setup.
The urls file contain http://www.math.psu.edu/MathLists/Contents.html only.
In crawl-rlfilter I have changed the url pattern to:
# accept hosts in MY.DOMAIN.NAME
+^http://

JVM: java version 1.4.2_06
Linux

It runs without problems.
Please reinstall from distribution make only required changes and 
retest. If it fails we will to track it down again.

Regards
Piotr



Rajesh Munavalli wrote:

Forgot to mention one more parameter. Modify the crawl-urlfilter to accept
any URL.

On 4/6/06, Rajesh Munavalli [EMAIL PROTECTED] wrote:

 Java version: JSDK 1.4.2_08
URL Seed: http://www.math.psu.edu/MathLists/Contents.html

I even tried allocating more stack memory using -Xss, process memory
-Xms option. However, if I run the individual tools (fetchlisttool,
fetcher, updatedb..etc) separately from the shell, it works fine.

Thanks,
 --Rajesh



On 4/6/06, Piotr Kosiorowski [EMAIL PROTECTED] wrote:

Which Java version do you use?
Is it the same for all urls or only for specific one?
If URL you are trying to crawl is public you can send it to me (off list

if you wish) and I can check it on my machine.
Regards
Piotr

Rajesh Munavalli wrote:

I had earlier posted this message to the list but havent got any

response.

Here are more details.

Nutch versionI: nutch.0.7.2
URL File: contains a single URL. File name: urls
Crawl-url-filter: is set to grab all URLs

Command: bin/nutch crawl urls -dir crawl.test -depth 3
Error: java.lang.StackOverflowError

The error occurrs while it executes the UpdateDatabaseTool.

One solution I can think of is to provide more stack memory. But is

there a

better solution to this?

Thanks,

Rajesh

Re: details: stackoverflow error

2006-04-06 Thread Piotr Kosiorowski


Which Java version do you use?
Is it the same for all urls or only for specific one?
If URL you are trying to crawl is public you can send it to me (off list 
if you wish) and I can check it on my machine.

Regards
Piotr

Rajesh Munavalli wrote:

I had earlier posted this message to the list but havent got any response.
Here are more details.

Nutch versionI: nutch.0.7.2
URL File: contains a single URL. File name: urls
Crawl-url-filter: is set to grab all URLs

Command: bin/nutch crawl urls -dir crawl.test -depth 3
Error: java.lang.StackOverflowError

The error occurrs while it executes the UpdateDatabaseTool.

One solution I can think of is to provide more stack memory. But is there a
better solution to this?

Thanks,

Rajesh

Re: Nutch 0.7.2 release

2006-04-02 Thread Piotr Kosiorowski

Yes. Correct link is
http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=390158
It was used on the Web site but I made a mistake while pasting it into email
(I used the one for 0.7.1 release).
Thanks for spotting it.
Regrads
Piotr


On 4/1/06, TDLN [EMAIL PROTECTED] wrote:

 Is this the correct revision of the release notes?


 http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=390158

 Rgrds, Thomas

 On 4/1/06, TDLN [EMAIL PROTECTED] wrote:
 
  Yes! This is great news, thank you so much.
 
  By the way: in the revision of the release notes that you posted
 (292986),
  the changes for 0.7.2 are missing.
 
  Rgrds, Thomas
 
 
  On 4/1/06, Piotr Kosiorowski [EMAIL PROTECTED] wrote:
  
   Hello all,
  
   The 0.7.2 release of Nutch is now available. This is a bug fix release
   for 0.7 branch. See CHANGES.txt
   (
 http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986
   )
 for details. The release is available on
   http://lucene.apache.org/nutch/release/ .
  
   Regards,
   Piotr

Re: Nutch 0.7.2 release | upgrading from 0.7.1?

2006-04-02 Thread Piotr Kosiorowski

The 0.7.2 release should work without problems with 0.7.1 data.
Regards
Piotr



On 4/2/06, Håvard W. Kongsgård [EMAIL PROTECTED] wrote:

 What about upgrading from 0.7.1? Can I use my existing db and segments?

 Piotr Kosiorowski wrote:
  Hello all,
 
  The 0.7.2 release of Nutch is now available. This is a bug fix release
  for 0.7 branch. See CHANGES.txt
  (
 http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986
 )
 
   for details. The release is available on
  http://lucene.apache.org/nutch/release/.
 
  Regards,
  Piotr

Nutch 0.7.2 release

2006-04-01 Thread Piotr Kosiorowski


Hello all,

The 0.7.2 release of Nutch is now available. This is a bug fix release 
for 0.7 branch. See CHANGES.txt 
(http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986)
 for details. The release is available on 
http://lucene.apache.org/nutch/release/.


Regards,
Piotr

Re: A possible error in the tutorial

2006-03-29 Thread Piotr Kosiorowski


Thanks. Fixed in SVN. Will be deployed on the Web site with 0.7.2 release.
fabrizio silvestri wrote:

Hi guys..

Just a quick question about the tutorial:

the line
bin/nutch org.apache.nutch.crawl.DmozParser content.rdf.u8 -subset
5000  dmoz/urls

shouldn't be
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset
5000  dmoz/urls


--
Dr. Fabrizio SILVESTRI
High Performance Computing Laboratory
Information Science and Technologies Institute (ISTI), Italian
National Research Council (CNR)
Via G. Moruzzi, 1, 56126 Pisa, Italy
Phone: +39 050 315-3011 (Direct)
Mobile: +39 328-9552152
FAX: +39 050 3138091 (G3)
WWW: http://miles.isti.cnr.it/~silvestr

Re: Getting java.io.IOException: Couldn't rename \tmp\nutch\mapred\local\map_n68li2\part-0.out with Nutch 0.8

2006-01-06 Thread Piotr Kosiorowski

As I stated in recent email on similar subject - disable antivirus software
if you have one. I have seen many cases when AV was keeping file locked on
Windows.
Regards
Piotr

On 1/6/06, Arun Kaundal [EMAIL PROTECTED] wrote:

 Anybody PLz rely I am waiting for it

 On 1/6/06, Arun Kaundal [EMAIL PROTECTED] wrote:
 
  No, I don't have any expolrer or file opened. I am facing this error
 from
  last two days with no good response any one there.
 
  On 1/5/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
  
   Windows:
   Do you still have explorer or any file open located in this directory?
   Linux:
   Do you may had start nutch once as root and now as a othe user?
  
   Stefan
  
   Am 05.01.2006 um 08:08 schrieb Arun Kumar Sharma:
  
Hi,
   I am facing this error while running Nutch on local file system.
   
java.io.IOException: Couldn't rename \tmp\nutch\mapred\local
\map_n68li2\part-0.out
 at org.apache.nutch.mapred.LocalJobRunner$Job.run
(LocalJobRunner.java:81)
060105 122537  map 100%
java.io.IOException: Job failed!
 at org.apache.nutch.mapred.JobClient.runJob (JobClient.java:308)
 at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:52)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
Exception in thread main
   
   
Regards,
   
Arun Kumar Sharma (Tech Lead -Java/J2EE)
Mob: +91.981.529.5761
Send instant messages to your online friends http://
in.messenger.yahoo.com

Re: java.io.IOException: already exists

2006-01-04 Thread Piotr Kosiorowski

It looks like majority of people who get it run it on Windows - is it the
same in your case?
Maybe some kind of antivirus software is preventing the folder from being
deleted?
Regards
Piotr

On 1/4/06, Nguyen Ngoc Giang [EMAIL PROTECTED] wrote:

   Hi all,

   I'd like to bring back this topic, which has been ignored several times
 in
 Nutch mailing list as well as JIRA (
 http://issues.apache.org/jira/browse/NUTCH-94,
 http://issues.apache.org/jira/browse/NUTCH-96,
 http://issues.apache.org/jira/browse/NUTCH-117 ). Here is my error stack:

 060104 110314 Finishing update
 060104 110314 Processing pagesByURL: Sorted 11 instructions in
 0.016seconds.
 060104 110314 Processing pagesByURL: Sorted 687.5 instructions/second
 java.io.IOException: already exists:
 C:\tomcat\webapps\ROOT\data\db\webdb.new\pagesByURL
 at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:86)
 at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(
 WebDBWriter.java:549)
 at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
 at org.apache.nutch.tools.UpdateDatabaseTool.close(
 UpdateDatabaseTool.java:375)


   This error happens not only at update time, but also at fetchlist time.
 And the weird thing is that it happens so undeterministically. I debugged
 around and it seems the problem is because some CloseProcessors didn't
 terminate correctly, causing the webdb.new not deletable. Then I try to
 reduce to only 1 thread, with lightweight load (as suggested in the JIRA
 discussion), but it doesn't help. But when I try to run step by step using
 debugging mode of the IDE, there was no problem.

   Can anyone help me to figure out this issue? Thanks very much.

   Regards,
   Giang

Re: build instructions?

2005-12-19 Thread Piotr Kosiorowski

It is a known bug in 0.7.1 distribution. You can get the sources 
directly from svn and it build fine. It is also fixed in preparation for 
0.7.2 release and in trunk. Or you can fix it locally by creating empty 
src/java folder I am not sure if it is the only one empty folder missing 
in nutch-extensionpoints folder but there should be not so many of them.

Regards
Piotr
Teruhiko Kurosaka wrote:

Where can I find the build instructions for Nutch?

Just typing ant ended with an error complaining that
there is no such directory as 
...\src\plugin\nutch-extensionpoints\src\java


This is Nutch 0.7.1 download and I'm trying to build
on Windows XP Professional with Cygwin and JDK 1.5.
(I tried JDK 1.4.1 but I saw the same failure.)

-Kuro

Re: try to restart aborted crawl

2005-12-07 Thread Piotr Kosiorowski


Hi,
I had the same problems with JVM crashes and it was in fact hardware 
problem (memory). It can also be a problem with your software config 
(but as far as I remember you are using quite standard configuration).
I doubt it has anything to do with nutch (except nutch stresses 
JVM/whole box) so it can be seen easier than during normal system usage.

Regards,
Piotr

wmelo wrote:
The biggest problem, is not to restart the crawl, but the problem that 
lead do failure itself, more precisely:
 
Exception in thread main java.io.IOException: key out of order:

http://web.mit
.edu/is/about/index.html after http://web.mit.edu/is/?ut/index.html;
 
 
This kind of problem occurs, with me, almost all the time (together with 
another that says that there is some problem with Java Hot Spot), 
preventing me to, really, use Nutch. 
 
I have reported those two problems before, without any answer.  I don't 
know, but this may be (or not)  a bug of Nutch (or of Lucene , I don't 
have any idea),   The only thing I know, is that both issues are very 
big non-conformities, that should be corrected as soon as possible.
 
Wmelo





No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.362 / Virus Database: 267.13.12/192 - Release Date: 5/12/2005

Re: Nutch returns irrelevant site

2005-12-07 Thread Piotr Kosiorowski

You can use explain page to find out why this page is scored the way it
is. I would expect anchor text would be th emain component of it.

Regards
Piotr

Aled Jones wrote:

I'm currently setting up a nutch search engine that searches travel
websites. It works quite well but sometimes returns odd results.
One good example:
One of the 100 or so sites I've asked it to crawl is
http://www.hfholidays.co.uk/ . This site is mainly about walking
holidays and has many pages with the word walking in it, so when I
type in walking into nutch then I'd expect it to turn up, however the
first result I get back from using the keyword walking is
http://www.hfholidays.co.uk/email.asp . This page doesn't have the word
walking in it anywhere.
Could someone please explain if this is a bug or the way nutch works.
I've got an idea how google works, if nutch works in a similar fashion
does this page appear because it is linked from many pages with the word
walking in them?

Thanks
Aled

This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored.

Although we have scanned this e-mail and any attachments for viruses, it is
your responsibility to ensure that they are actually virus free.

Re: Why Nutch 0.7.1 Does Not Compile???

2005-11-23 Thread Piotr Kosiorowski

I compiled it for the release - linux, jdk 1.4.2, ant 1.6.2.  No 
problems - otherwise I would be unable to relase it :).

Please report your environment for such problems.
Regards
Piotr,

Victor Lee wrote:

ok, it's weird now.  If I use the command ant jar, it builds successfully.  If I use 
ant tar, it has tons of errors.  Maybe something is broken in package, war, or javadoc?

Victor Lee [EMAIL PROTECTED] wrote:  Hi,
 I extracted a fresh copy of original nutch 0.7.1 and tried to compile it, but it doesn't 
compile at all. Why? Why would a official release can't even compile? I did ant 
tar at the outmost nutch directory where the build.xml locates. Please let me know 
what I did wrong. There are whole bunch of errors besides warnings.
 
 Anyone tried to compile it before.  Please let me know what you did to make it compile.
 
 Many thanks.
 

  
-
 Yahoo! FareChase - Search multiple travel sites in one click.  




-
 Yahoo! FareChase - Search multiple travel sites in one click.

Re: Using FetchListEntry -dumpurls

2005-11-13 Thread Piotr Kosiorowski


Hi,
I think this is the reason:
 Exception in thread main java.lang.NoClassDefFoundError:
 net/nutch/pagedb/FetchListEntry
In 0.7 branch all classes where moved to org.apache.nutch package 
structure and scripts where updated so you are probably using old script 
with new release.

Regards
Piotr


Bryan Woliner wrote:

Hi,

I am trying to dump all of the URLS from a segment to a text file. I was
able to do this successfully under Nutch 0.6 but am not able to do so under
0.7.1

Please take a look a the line below and let me know if you can figure out
why I'm getting an error. Perhaps it a due to change from version 0.6 to
0.7.1, or maybe I just have the wrong syntax.

Note: the segments/20051107233629 directory is a valid segments directory,
as is evidence by the ls statement below.

___-


-bash-2.05b$ bin/nutch net.nutch.pagedb.FetchListEntry -dumpurls
segments/20051107233629 foo.txt
Exception in thread main java.lang.NoClassDefFoundError:
net/nutch/pagedb/FetchListEntry


-bash-2.05b$ ls -la segments/20051107233629
total 8
drwxr-xr-x 8 bryan bryan 1024 Nov 7 23:36 .
drwxr-xr-x 3 bryan bryan 1024 Nov 7 23:36 ..
drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 content
drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 fetcher
drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 fetchlist
drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 index
-rw-r--r-- 1 bryan bryan 0 Nov 7 23:36 index.done
drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 parse_data
drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 parse_text

Re: Wrapping Nutch

2005-10-11 Thread Piotr Kosiorowski

Yes. You are right - first exception idndicates that some required plugins
are missing. And this is probably because you do not have plugins directory
with all plugins in classpath
P.

On 10/10/05, Matt Clark [EMAIL PROTECTED] wrote:

 Sorry,
 One more clue. Here is the colsole output -

 051010 143748 10 Plugins: directory not found: plugins

 My guess is that Nutch wants the plugins directory on my classpath. Am
 I on the right path?

 matt

Re: Dedup won't actually dedup

2005-10-09 Thread Piotr Kosiorowski


Hello Jon,
As far as I remember dedup marks the records as deleted only without 
physically removing them.
And first action of dedup is to clear old deletions (as it is written in 
log). So if you repeat it you will get the same number of deleted 
records each time.

Regards
Piotr

Jon Shoberg wrote:

Any idea why dedup won't actually remove the items?

Thoughts?

*** First Pass 

051008 144843 Clearing old deletions in
051008 144843 Reading url hashes...
051008 144902 Sorting url hashes...
051008 144905 Deleting url duplicates...
051008 144907 Deleted 4082 url duplicates.
051008 144907 Reading content hashes...
051008 144918 Sorting content hashes...
051008 144923 Deleting content duplicates...
051008 144925 Deleted 228430 content duplicates.
051008 144925 Duplicate deletion complete locally.  Now returning to NFS...
051008 144925 DeleteDuplicates complete

*** Second Pass ***

051008 144932 Reading url hashes...
051008 144949 Sorting url hashes...
051008 144953 Deleting url duplicates...
051008 144955 Deleted 4082 url duplicates.
051008 144955 Reading content hashes...
051008 145005 Sorting content hashes...
051008 145011 Deleting content duplicates...
051008 145012 Deleted 228430 content duplicates.
051008 145012 Duplicate deletion complete locally.  Now returning to NFS...
051008 145012 DeleteDuplicates complete

Re: Fwd: problem about the fetch of dinamic page

2005-10-02 Thread Piotr Kosiorowski

You can use nutch readdb command to check if urls you are interested 
in where added to WebDB - if yes check the segments if they contain 
these urls. Please review the logs from fetch to check if there was an 
attempt to fetch from these urls (you might have some problem with 
authentication). Right now the description is too generic for me to help

with more details.
Regards
Piotr


[EMAIL PROTECTED] wrote:

Hi, I have a question about nutch crawler:



I want to make a document search on a site one that has approached with 
authentication (user/password).
As soon as fact the login, the first page visualized from the composed 
application e' from two frame:

HTML
HEAD
TITLESistema Provvedimenti -SUPER/TITLE
/HEAD FRAMESET ROWS=14%,*
FRAME NORESIZE NAME=MENU SRC=Servlet1?menu=1 SCROLLING=AUTO
FRAME NAME=PAGE SRC=../a.html SCROLLING=AUTO
/FRAMESET
/HTML

The servlet Servlet1 publish on web a table with a 1 line and N columns, 
where every column contains a href with the URL of an other servlet (a Servlet2-ServletN).


DESCRIPTION OF THE PROBLEM:

My problem is that I ago see that crawler make the fetch of the page of login, 
of the static page a.html, of servlet the Servlet1, but not ago fetch of no the 
other servlet (Servlet2-ServletN).
Instead if I put of the href in the page a.html, Nutch succeeds to make the 
fetch of the URL and works all.


DESCRIPTION OF OUR CONFIGURATION OF NUTCH:
I installed  Nutch 0.6.  I launch the nutch in this mode:
/usr/nutch-0.6/bin/nutch crawl url -dir index -depth 10 -threads 8 
crawl.log

where in the file url there is only the url of the sie with just the login 
and passw

I modified the file of configuration of Nutch crawl-urlfilter.txt  like :

-^(ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|m
ov|MOV|exe)$
+[?=]
+.


Plese somebody help me!!! It is very important for me

Adriano Palombo



-
Visita http://domini.interfree.it, il sito di Interfree dove trovare
soluzioni semplici e complete che soddisfano le tue esigenze in Internet,
ecco due esempi di offerte:

-  Registrazione Dominio: un dominio con 1 MB di spazio disco +  2 caselle
   email a soli 18,59 euro
-  MioDominio: un dominio con 20 MB di spazio disco + 5 caselle email 
   a soli 51,13 euro


Vieni a trovarci!

Lo Staff di Interfree 
-




-
Visita http://domini.interfree.it, il sito di Interfree dove trovare
soluzioni semplici e complete che soddisfano le tue esigenze in Internet,
ecco due esempi di offerte:

-  Registrazione Dominio: un dominio con 1 MB di spazio disco +  2 caselle
   email a soli 18,59 euro
-  MioDominio: un dominio con 20 MB di spazio disco + 5 caselle email 
   a soli 51,13 euro


Vieni a trovarci!

Lo Staff di Interfree 
-

Re: a few questions

2005-10-01 Thread Piotr Kosiorowski


Earl Cahill wrote:

Tempted to do each question as a separate email, but
here you go.

1.  Does nutch use pure lucene for its indexing?  Does
the nutch index = lucene + potentially ndfs?  If I am
going to run a search web service, I am just wondering
what advantages nutch would serve over lucene.


Yes it uses lucene for indexing. It is not only lucene + ndfs as
it contains a fetcher, WebDB maintenance, map/reduce and plenty of plugins.



2.  Turns out I am going to write a web service for
search.  I have played with the nutch search example,
but if I want to do rather arbitrary key/value pairs
and have a web service return xml, I am guessing I am
going to have to write my own.  Is that right?  Is
there an easy way to get results in xml format? 
Guessing I need to build it all myself.

There is a servlet OpenSearchServlet that return XML.


3.  In another project, I want to use ndfs to store
two+ distinct copies of a file, but I really don't
want anything else to do with nutch on the project. 
Is that possible?  Is there a clean break?  I want to

make a list of servers, then have an api call that
takes a file and stores 2+ copies across my servers,
and an api call that reads a file, with appropriate
failover.


I think that it is possible. Take a look at TestClient (not the best
name) to look at API usage.


4.  Guessing I write a plugin, but I want to interject
some code during the nutch crawl process that does
some analysis and actually does the index insertion
itself.  There any good docs on how to do such a
thing?


Plugin writting is easy - you can take a look at one of existing
ones as an example. But all depends what and where you want to change as
plugins are executed at the well specified points of processing so it 
might happen that there is no possibility of using a plugin for the 
thing you want to achieve.

Regards,
Piotr

Nutch 0.7.1 release

2005-10-01 Thread Piotr Kosiorowski

The 0.7.1 release of Nutch is now available. This is a bug fix release. 
See CHANGES.txt

(http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=292986)
 for details. The release is available here 
(http://lucene.apache.org/nutch/release/).


Please report all problems in JIRA setting appropriate version number in 
bug description.


Regards,
Piotr

Re: link analysis and update segments

2005-09-26 Thread Piotr Kosiorowski

UpdateDB copies link information and score from the WebDB to segments so 
it is important to have score calculated before updatedb is run. One can 
use current standard nutch score (based on number of inlinks) or try to 
use analyze - I have committed a patch for it some time ago that might 
help a bit with it disk space requirements so the best approach would be 
to test it (it worked ok for me) and if it is ok for you - report it so 
others can also try it out.

Regards
Piotr
AJ Chen wrote:
In a whole-web or vertical crawling setting, is it right that link 
analysis and update segment from DB should be performed in right order 
before indexing the segments?


There's not much talk about update segment from DB. I think it should be 
an important step. Could someone point out when it should be  run and 
what the benefits are?


I remember it was mentioned sometime ago that the link analysis tool 
does not work yet and the number of in-links should be used instead. Any 
update? If it's still not working, how to set it to use link numbers?


Thanks,
AJ

Re: Re-indexing segments to add more field information

2005-09-13 Thread Piotr Kosiorowski


Hello,
I think it is enough to delete index.done file and index folder. I did 
it this way some time ago.

Regrads
Piotr
Mike Berrow wrote:
I would like to re-build the indexes I have in existing segments using 
a custom index filter plug-in (adds more field information to assist with

a custom sort).

Should that be just a matter of deleting the index.done files to let it
go through?,  or is thre more to be done than that?
Any known pitfalls?

Thanks much,

-- Mike Berrow

Re: Link Analysis Score..

2005-09-04 Thread Piotr Kosiorowski

There are many ways nutch can boost document in the index. But I suspect 
you are refereing to analyze process - it uses PagrReank computation for 
page score. For details read DistributedAnalysisTool - especially 
computeRound method.

Regards
Piotr
Rozina Sorathia wrote:
I wanted to know where exactly the Link Analysis Score is calculated …Is 
there any code snippet available.?


How is the Link Analysis Score affecting the overall final score of the 
document?


 

 

 


//Rozina Sorathia,//

//Systems Executive,//

//KPIT Cummins Infosystems Ltd.,//

//[EMAIL PROTECTED] mailto:[EMAIL PROTECTED]//

Re: Analyser error

2005-08-31 Thread Piotr Kosiorowski

I was never doing it this way - creating webdb content based on segments 
only. So I do not know if it works - I do not have time at the moment to 
test it myslef - sorry.

Regards
Piotr

EM wrote:

The problem is still there, maybe I'm doing something wrong?

1. 'rm -r db' 
2. 'mkdir db'

3. ' bin/nutch admin db -create'
4. I'll then updatedb db from a fetched segment, this should fill it up with
links?
5. 'bin/nutch analylze db 7'
And it fails here with three 'tmpsomething' directories and webdb.new 




-Original Message-
From: Piotr Kosiorowski [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, August 30, 2005 3:07 PM

To: nutch-user@lucene.apache.org
Subject: Re: Analyser error

It looks like you have temporary results from previous run (probably 
killed or terminated not successfully). It shoudl be safe to remove 
db\webdb.new directory and start again.

regars
Piotr
EM wrote:


What does it mean if the bin/nutch analyze db 7 fails with:


050830 024914 Target pages from init(): 27419
050830 024914 Processing pagesByURL: Sorted 27419 instructions in 0.172
seconds.
050830 024914 Processing pagesByURL: Sorted 159412.79069767444
instructions/second
Finished at Tue Aug 30 02:49:14 EDT 2005
Exception in thread main java.io.IOException: already exists:
db\webdb.new\pagesByURL
   at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:86)
   at



org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:54


9)
   at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
   at



org.apache.nutch.tools.DistributedAnalysisTool.completeRound(DistributedAnal


ysisTool.java:562)
   at
org.apache.nutch.tools.LinkAnalysisTool.iterate(LinkAnalysisTool.java:60)
   at
org.apache.nutch.tools.LinkAnalysisTool.main(LinkAnalysisTool.java:81)

Re: PDF support? Does crawl parse p

2005-08-31 Thread Piotr Kosiorowski


Hello Diane,
There is a plugin to parse pdf files. You have to enable it in 
nutch-site.xml (just copy entry from nutch-default.xml).

You have to change plugin.includes property to include parse-pdf plugin:
[...] parse-(text|html) [...] to [...] parse-(text|html|pdf) [...]
Regards
Piotr

Diane Palla wrote:
Does Nutch have a way to parse pdf files, that is, application/pdf 
content type files?


I noticed a plugin variable setting in default.properties:

plugin.pdf=org.apache.nutch.parse.pdf*

I never changed this file.

Is that the right value?

I am using Nutch 0.7.

What do I have to do make parse pdf files?

When I do the crawl, I get this error with application/pdf files:

050831 145126 fetch okay, but can't parse 
mainurl/research/126900/126969/126969.pdf, reason: failed(2,203): 
Content-Type not text/html: application/pdf



If it's not possible, what future version of Nutch do developers expect to 
support application/pdf types  and have such parsing of pdf files 
available?



Diane Palla
Web Services Developer
Seton Hall University
973 313-6199
[EMAIL PROTECTED]




Bryan Woliner [EMAIL PROTECTED] 
08/23/2005 05:22 PM

Please respond to
nutch-user@lucene.apache.org


To
nutch-user@lucene.apache.org
cc

Subject
Adding small batches of fetched URLs to a larger aggregate segment/index






Hi,

I have a number of sites that I want to crawl, then merge their segments 
and 
create a single index. One of the main reasons I want to do this is that I 

want some of the sites in my index to be crawls on a daily basis, others 
on 
a weekly basis, etc. Each time I re-crawl a site, I want to add the 
fetched 
URLs to a single aggregate segment/index. I have a couple questions about 
doing this:


1. Is it possible to use a different regex.urlfilter.txt file for each 
site 
that I am crawling? If so, how would I do this?


2. If I have a very large segment that is indexed (my aggregate index) and 
I 
want to add another (much smaller) set of fetched URLs to this index, what 

is the best way to do this. It seems like merging the small and large 
segments and then re-indexing the whole thing would be very time consuming 

-- especially if I wanted to add news small sets of fetched URLs 
frequently. 



Thanks for any suggestions you have to offer,
Bryan

Re: permissions error with nutch 0.7

2005-08-25 Thread Piotr Kosiorowski

It looks like it has problem creating lucene lock file - I think it is
usually created in /tmp if you are running it on Unix. Can you check
if you can correctly access it?
Regards
Piotr

On 8/25/05, Jason Martens [EMAIL PROTECTED] wrote:
 On Wed, 2005-08-24 at 18:20 -0700, Michael Ji wrote:
  did you try switch your user mode to su (Super User)
  mode?
 
 In nutch itself?  I configured this all as root, and I would rather not
 run tomcat as root.  Is there some way I can find out what files it is
 trying to write to when it gets the permission denied message?
 
 Jason Martens

Re: FetchedSegments.getSummary() for a PDF

2005-08-25 Thread Piotr Kosiorowski

You can try it out but I think parsing separately expects some 
directories in segment have different names than you have after standard 
fetch with parsing.

Regards
Piotr
Lucas Rockwell wrote:

Hi Piotr,

Thanks for the response.

So, I can't use:

   bin/nutch parse segment directory

and then reindex?

-lucas

On Aug 25, 2005, at 11:28 AM, Piotr Kosiorowski wrote:

As I understand if you had parse-pdf disabled you have to reparse (snd 
then reindex) segments. There is no standard way to do it (I think it 
might be done with some tricks). The easiest way would be to refetch 
it with pdf parsing enabled.

Piotr
Lucas Rockwell wrote:


Hi all,
I have enabled the parse-pdf and index-more plugins and reindexed my 
segments and then enabled those plus the query-more plugin in my 
front-end application and when I do a query I still can not get at 
the contents of the PDFs in the index. And even when I search for 
pdf -- which gets me all PDF files because of the url -- and use 
FetchedSegments.getSummary() there is nothing there. Any idea what I 
am doing wrong?

Thanks.
-lucas

Re: Nutch 0.7 released

2005-08-17 Thread Piotr Kosiorowski

Upps, you are right - we have not updated README.txt after changing our 
documentation format to Apache Forrest. Right now these files are 
deployed on nutch site:

http://lucene.apache.org/nutch/tutorial.html
etc in Documentation section.
Thanks for catching it
Piotr

Lukas Vlcek wrote:

Hi,

May be I don't understanding correctly but according to README.txt
file included in release pack I can't find docs/en/tutorial.html and
docs/en/developers.html files.

Lukas

On 8/17/05, Piotr Kosiorowski [EMAIL PROTECTED] wrote:


Hi,
New Nutch release was prepared today. This is the first Nutch release as
an Apache Lucene sub-project. You can download it from
http://lucene.apache.org/nutch/release/nutch-0.7.tar.gz.

There was a package name change from net.nutch.* to org.apache.nutch.*
for this release, so local modifications and configuration files
containing class names may require an update.

Release numbers were created in JIRA too, so please use them while
reporting a bug.

Regards,
Piotr

Re: webdb - orphaned pages?

2005-08-10 Thread Piotr Kosiorowski

Hello,
Pages from WebDB are not deleted automatically. Nutch does not check
if page has inlinks during fetchlist generation - so orphaned page
would be refetched. It will stop to refetch the page if page becomes
unavailable for some number of fetch attempts.
Regards
Piotr

On 8/10/05, Raymond Creel [EMAIL PROTECTED] wrote:
 I have a question about the webdb and fetching.  When
 a page that used to have incoming links is found to be
 orphaned (i.e. there are no longer any pages that
 have links to it), is it deleted from the webdb?  Or
 is it left in the webdb but set not to be refetched?
 Or will it continue to be refetched anyway (this
 doesn't seem right to me)?
 
 Conversely, what will happen when a link to it
 reappears later?
 
 One more thing - are pages injected with the webdb
 injector treated any differently (as I see them as
 being sort of the root nodes of the webdb - they
 should never be deleted)?
 
 Thanks much for any clarity on this!
 
 raymond
 
 
 
 __
 Yahoo! Mail for Mobile
 Take Yahoo! Mail with you! Check email on your mobile phone.
 http://mobile.yahoo.com/learn/mail

Re: Problem in Incremental crawling with 4GB segment directories

2005-08-08 Thread Piotr Kosiorowski

Hello Ayyanar ,
Please be more specific with your setup and problem description.
Recently I fetched a segment that contains 73GB of data now so I do
not think size of your segment is a problem.
Regards
Piotr

On 8/8/05, Ayyanar Inbamohan [EMAIL PROTECTED] wrote:
 Hi,
 
 I have 4 GB data, which is existing one,
 I want to do incremental crawling to this 4 GB data,
 with a new URL, but the crwal is not success...
 
 why?
 
 
 thanks in advance,
 Ayyanar...
 
 
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around
 http://mail.yahoo.com

Re: Is it possible to have multiple search.dir in nutch-site.xml, Please reply immediately

2005-08-08 Thread Piotr Kosiorowski

No it is not possible. But you can search in both if you merge the
segments or simply put them in one location.
Regards
Piotr

On 8/8/05, Ayyanar Inbamohan [EMAIL PROTECTED] wrote:
 Hi all,
 
 Is it possible to have multiple search.dir in
 nutch-site.xml, Please reply immediately
 
 I want to search the text in both the directories,
 /home/oss/docs/Kmportal_repository_search_data
 /home/oss/spikesearch/src/nutch/var/nutch/dir-url.extranet
 
 Can we give something like following,
 
 nutch-conf
 property
   namesearcher.dir/name
 
 value/home/oss/docs/Kmportal_repository_search_data/value
 
 value/home/oss/spikesearch/src/nutch/var/nutch/dir-url.extranet/value
 
   descriptionLocation of the Nutch
 Index./description
 /property
 /nutch-conf
 
 but above is not working...
 thanks,
 Ayyanar..
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around
 http://mail.yahoo.com

Re: regex-url filter

2005-08-08 Thread Piotr Kosiorowski


Hello,
I am not sure which way is better but I would look for dot:
orginal http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/
modifiedhttp://([a-z0-9]*\.)*(com|org|net|biz|edu|biz|mil|us|info|cc)/
In my opinion dot before com,org etc is already included in 
([a-z0-9]*\.)* and additional one (not escaped) means any character so 
it would match eg:

http://www.abc.xcom/
but not
http://www.abc.com/.
Regards,
P.


Chirag Chaman wrote:

Here's a better way

http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/

FYI, this will not remove non-English sites -- but international sites that
follow the two-letter convention.

CC-
 
-Original Message-
From: Jay Pound [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 08, 2005 2:37 PM

To: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org
Subject: regex-url filter

I would like a confirmation from someone that this will work, I've edited
the regex filter in hopes to weed out non-english sites from my search
results, I'll be testing pruning on my current 40mil index to see if it
works there, or maybe there is a way to set the search to return only
english results, but I'm trying it this way now, is this the right way to
add just extensions without sites?
I'll try it soon but just wanted to not waste my time if its not correct!!!
Thanks,
-Jay Pound
# The default url filter.

# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression

# prefixed by '+' or '-'. The first matching pattern in the file

# determines whether a URL is included or ignored. If no pattern

# matches, the URL is ignored.

# skip file: ftp: and mailto: urls

-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|m
ov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.

[EMAIL PROTECTED]

# accept US only sites

+^http://([a-z0-9]*\.)*.com/

+^http://([a-z0-9]*\.)*.org/

+^http://([a-z0-9]*\.)*.edu/

+^http://([a-z0-9]*\.)*.net/

+^http://([a-z0-9]*\.)*.mil/

+^http://([a-z0-9]*\.)*.us/

+^http://([a-z0-9]*\.)*.info/

+^http://([a-z0-9]*\.)*.cc/

+^http://([a-z0-9]*\.)*.biz/

Re: distributed search

2005-08-05 Thread Piotr Kosiorowski

If you have two search servers
search1.mydomain.com
search2.mydomain.com
Then on each of them run
./bin/nutch server 1234 /index

Now go to your tomcat box. In the directory where you used to have
segments dir
(either tomcat startup directory or directory specified in nutch config xml).
Create search-servers.txt file containing:
search1.mydomain.com 1234
search2.mydomain.com 1234

And move your old segment/index directories somewhere else so they are
not by accident used.
You should see search activity in your search servers logs now.
Regards
Piotr





On 8/2/05, webmaster [EMAIL PROTECTED] wrote:
 I read the wiki on the server option, how does it talk with tomcat for the
 search? it says
 ./bin/nutch server port index dir
 ./bin/nutch server 1234 /index
 
 how does it talk with eachother to find the other servers in the cluster?
 -Jay

Re: Preventing the fetch command from going to certain URLs

2005-07-29 Thread Piotr Kosiorowski

Hello Joe,
If you are using whole web crawling you should change regex-urlfilter.txt 
insead of crawl-urlfilter.txt.

Piotr

On 7/28/05, Vacuum Joe [EMAIL PROTECTED] wrote:
 I have a simple question: I'm using Nutch to do some
 whole-web crawling (just a small dataset).  Somehow
 Nutch has gotten a lot of URLs from af.wikipedia.org
 into its segments, and when I generate another
 segments (using -topN 2) it wants to crawl a bunch
 more urls from af.wikipedia.org.  I don't want to
 crawl any of the Afrikaans Wikipedia.  Is there a way
 to block that?  Also, I want to block it from ever
 crawling domains like 33.44.55.66, because those are
 usually very badly configured servers with worthless
 content.
 
 I tried to put those things into crawl-urlfilter.txt
 file and the banned-hosts.txt file, but it seems that
 the fetch command doesn't pay attention to those two
 files.
 
 Should I be using crawl instead of fetch?
 
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around
 http://mail.yahoo.com

Re: [Nutch-general] number of indexed pages

2005-07-29 Thread Piotr Kosiorowski

Hello,
First one will give you number of pages in WebDB and not all of them
are indexed.

Regards,
Piotr

On 7/29/05, Erik Hatcher [EMAIL PROTECTED] wrote:
 Two options:
 
  bin/nutch readdb crawl/db -stats
 
 or use Luke (Google for luke lucene) to open the Lucene index.
 
  Erik
 
 On Jul 28, 2005, at 9:44 PM, blackwater dev wrote:
 
  After I finish a crawl...what is the best way to go into my crawl
  directory and get the number of indexed pages?
 
  Thanks!
 
 
  ---
  SF.Net email is Sponsored by the Better Software Conference  EXPO
  September
  19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
  Agile  Plan-Driven Development * Managing Projects  Teams *
  Testing  QA
  Security * Process Improvement  Measurement * http://www.sqe.com/
  bsce5sf
  ___
  Nutch-general mailing list
  Nutch-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: Problem Starting Nutch (Tutorial like)

2005-07-28 Thread Piotr Kosiorowski

Hello,
I would rather suspect some misconfiguration of networking. 
According to JavaDcoc:
InetAddress.getLocalHost() throws
UnknownHostException - if no IP address for the host could be found.
Regards
Piotr

On 7/28/05, blackwater dev [EMAIL PROTECTED] wrote:
 Are you sure your urls file doesn't have an extension?  I had a
 similiar problem and found my  urls file was  .rtf which I didn't see
 until I viewed the file via the command line.
 
 On 8/28/05, Nils Hoeller [EMAIL PROTECTED] wrote:
  Hi
 
  my Problem is:
 
  I ve done everything as descriped in the Getting Started Tutorial at
  nutch.org.
 
  When I now run the command: bin/nutch crawl urls -dir crawl.test -depth
  3  crawl.log
 
  I get this Exception in the log file:
  run java in /usr/java/jdk1.5.0_04
  050828 104004 parsing
  file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-default.xml
  050828 104004 parsing
  file:/home/nils/Studienarbeit/nutch-nightly/conf/crawl-tool.xml
  050828 104004 parsing
  file:/home/nils/Studienarbeit/nutch-nightly/conf/nutch-site.xml
  050828 104004 No FS indicated, using default:local
  050828 104004 crawl started in: crawl.test
  050828 104004 rootUrlFile = urls
  050828 104004 threads = 10
  050828 104004 depth = 3
  Exception in thread main java.lang.RuntimeException:
  java.net.UnknownHostException: linux: linux
  at org.apache.nutch.io.SequenceFile
  $Writer.init(SequenceFile.java:67)
  at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:94)
  at org.apache.nutch.db.WebDBWriter.init(WebDBWriter.java:1507)
  at
  org.apache.nutch.db.WebDBWriter.createWebDB(WebDBWriter.java:1438)
  at
  org.apache.nutch.tools.WebDBAdminTool.main(WebDBAdminTool.java:172)
  at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:133)
  Caused by: java.net.UnknownHostException: linux: linux
  at java.net.InetAddress.getLocalHost(InetAddress.java:1308)
  at org.apache.nutch.io.SequenceFile
  $Writer.init(SequenceFile.java:64)
  ... 5 more
 
 
  My urls file looks like this:
 
  http://www.nutch.org/
 
  I ve also tried:
 
  http://www.ifis.uni-luebeck.de/ which I d like to get nutched
 
  Also in the urlfilter conf is written
 
  +^http://([a-z0-9]*\.)*ifis.uni-luebeck.de/
  +^http://([a-z0-9]*\.)*nutch.org/
 
 
  Can anyone give me a Hint?
  Where is the error?
 
  Thanks Nils

Re: total pages

2005-07-27 Thread Piotr Kosiorowski


Hello,
I assume  second counts are printed by some tool accessing WebD. Right?
If so - 2 250 000 is the number of pages generated to be fetched (so all 
fetched pages, fetch attempts with error) - simply total number of pages 
in segments. The second number is amount of Pages/Links in WebDB - pages 
/links known to nutch gathered by extracting links from already fetched 
pages. Some of these pages have been already fetched but some of them 
are to be fetched in future.

Regards
Piotr

Ilia S. Yatsenko wrote:
Hello 

 

Sorry my little English 

 


How nutch count document in search index?

 


I have 90 segments with 25000 in each segment

 


Total is 2 250 000 pages in index (this number I see when execute
mergesegs).

 


But in the same time nutch report me:

 


Number of pages: 4318557
Number of links: 5541456
 


Why I see in 2 times more pages than I have in real index?

Re: Merge Crawl results

2005-07-27 Thread Piotr Kosiorowski


Hello,
You can merge segments for these two crawls using nutch mergesegs, in 
fact you can simply copy all segment directories to one place. But it 
would not be a full merge of crawls as right now there is no way to 
merge WebDB for these two crawls. You can deduplicate it using nutch 
dedup (nutch mergesegs does deduplication for you too).
But in fact you should probably try a different approach - Intranet 
crawling was meant as an easy way to crawl small sites strating every 
time from scratch. If you do not want to start from scratch you should 
use Whole web crawling tutorial limiting it to your site/sites only in 
config file.

Regards
Piotr

Benny Lin wrote:

Hi,

I am looking to see if there are ways to merge
different Crawl results,

Let's say, I have two url sets in two different files,

I use following commands,

bin/nutch crawl URLs1.txt -dir test1 -depth 0 
test1.log

bin/nutch crawl URLs2.txt -dir test1 -depth 0 
test2.log


Then I have two folders test1 and test2.

My question are,

1. Is there a way to merge two sets of above result?
If it is, what's command string?

2. If above two sets have duplicate urls, how to make
merge results unique?


Or may be I can do it in different way if I want to do
accumlation indexing but not need do everything again
and again from beginning.

Can someone help out?

Thanks a lot.

Benny







Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs

Re: Searching by content type

2005-07-27 Thread Piotr Kosiorowski


Hello,
Please have a look at index-more and query-more plugins for content-type
handling.
Regards
Piotr
Vacuum Joe wrote:

I have been looking through the API docs and I can't
figure this out.  Here is my question:

Is there a way to search based on meta-information,
such as content type, or even the value of header
fields?  For example, let's say I would like to find
only PDFs, or perhaps put higher weight on PDFs vs.
other kinds of documents.  Can this be done?

I looked at the query interface.  It looks like
NutchBean allows me to specify a Query, and a Query is
basically made up of Strings which are in the content.
 I can't find any way to specify meta-information I'm
looking for.

Any ideas on this?

Thanks





Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs

Re: prioritizing newly injected urls for fetching

2005-07-27 Thread Piotr Kosiorowski


Hello Kamil,

Do you want to generate a fetchlist with urls that are present in WebDB 
but where not fetched till now?


I am not sure what you are trying to achive but, you can generate any 
fetchlist you want using latest tool by Andrzej Bialecki 
(http://issues.apache.org/jira/browse/NUTCH-68) (have not tried it myself).
There was also (some time ago) discussion on the nutch mailing list 
about refetchonly param for fetchlist generator - some ideas are still 
not implemented but you can read how it works currently.

Regards
Piotr


Kamil Wnuk wrote:

Hi All,

I have recently started using nutch and I am looking for a method of
prioritizing urls injected during an ongoing crawl process (similar to
the whole-web crawl scenario described in the tutorial) so that they
are guaranteed to be included at the top of the next fetchlist
generated.  The purpose of this is so that I can give nutch the urls
of newly created web pages that I want indexed as quickly as possible.

I have looked through the nutch documentation and the mailing list
archives and have not been able to find a solution.  Does a good
method for doing this exist?

Thanks in advance,
Kamil

Re: Skipping the final indexing step?

2005-07-21 Thread Piotr Kosiorowski

Hello Otis,
If you are only reading ParseData and FetcherOutput from nutch segment
you do not need lucene index at all. So you can safely skip -i switch.
Regards
Piotr

On 7/21/05, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 Hello,
 
 I'm using SegmentMergeTool to merge some large segments, and I see that
 the final index optimization (below) takes a looong time.  I think this
 index creation and optimization is triggered by the -i param to
 SegmentMergeTools.  From what I saw in the SegmentMergeTools.java, this
 is an optional parameter, but I'm wondering if I can just skip this
 final indexing and index optimization step all together.  Right now I'm
 not making use of this final Nutch index, as I'm just reading from
 ParseData.DIR_NAME and FetcherOutput.DIR_NAME using ArrayFile.Reader.

Re: How to view the URLs stored in a segment

2005-07-19 Thread Piotr Kosiorowski


Hellom

Bryan Woliner wrote:

A couple more (basic) questions:

When FetchListEntry is called with the -dumpurls option, where does the 
fetchlist get dumped, in what format, and how do I access it?



The list of urls is dumped to stdout (System.out). Format of single line:
Recno record_number: url



On a related note: How do I know what method of FetchListEntry.java is 
called (and with what parameters) when I type :

bin/nutch net.nutch.pagedb.FetchListEntry -dumplist $s1


You have to read the source code of FetchListEntry.


In general how do I find out what methods (and parameters) are called by 
command line calls to nutch java files.
In bin/nutch shell script there is a mapping from command names to java 
classes that are executed for a command. You have to read the source 
code of java class(mainly parsing of command line options) to find out 
what methods are called and their params.


Regards
Piotr

Re: ndfs stuff

2005-07-07 Thread Piotr Kosiorowski


Hello Ferenc,
Some documentation on running ndfs can be found on wiki:
http://wiki.apache.org/nutch/NutchDistributedFileSystem
Regards,
Piotr

[EMAIL PROTECTED] wrote:

Have any location the ndfs usage documentation?
Regards,
Ferenc

47 matches

Mail list logo