PruneRegexTool

2006-12-14 Thread Bryan Woliner

Nutchers,

I know that I have seen many posts to this list regarding the usage of
nutch's prune tool (org.apache.nutch.tools.PruneIn dexTool) and that many of
those posts noted the difficulty of having to pass Lucene queries as
parameters (for those of us who don't already have a firm understanding of
Lucene queries). I also know that many users wish that there was a pruning
tool that could prune an index based on regex commands (like those used by
regex-urlfilter.txt).

A while back a friend of mine who is more java savy that I am (I am not java
savy at all) wrote such at tool, which we called PruneRegexTool. The reason
for this post is twofold: first to share this tool with the nutch community
and second, to ask for help in updating this tool to work with nutch 0.8.1,
since it currently only works with 0.7.1. I'm not exactly sure why it
doesn't work with 0.8.1, but I think of at least two probably reasons: this
tool uses Lucene 1.4.3, which is currently outdated, and the tool is based
on the PruneIndexTool, which probably changed between nutch 0.7.1 and nutch
0.8.1.

As I have already noted, my understanding of java is quite poor, so I don't
know if updating this tool for use with nutch 0.8.1 is a 20 minute or 20
hour task (I'm guessing somewhere inbetween). Anyway, I have included the
.java and .class file for this tool (as attachments), as well as the
instructions that my friend provided me regarding how to use it.

If anyone updates it for use with nutch 0.8.1, I would be eternally grateful
to get a copy of the updated code, as well as any changes in the
instructions for how to use this tool properly.

Final Disclaimer: I cannot make any assurance about this tool and the
instructions provided, except that it worked fine for me when I was still
using nutch 0.7.1.

Instructions:

*Stage 1:* Compiling PruneRegexTool

Everything in this stage except for the very last step (running javac on
PruneRegexTool.java) only needs to be done once. The last step only needs to
be done again if changes are made to PruneRegexTool.java.

If you haven't already, make sure that the java compiler is in your PATH. If
not, you can add to by putting the following in your ~/.bash_profile:

  - PATH=$PATH:/usr/local/j2sdk1.4.2_08/bin/
  - export PATH

Then run:

  - source ~/.bash_profile

to make the change take effect for your current session. (The ~ is an alias
for your home directory, so you can run this command from anywhere.)

PruneRegexTool uses code from Nutch and two other libraries that are not
included with the Nutch package: Lucene and ORO. Download the source code
for these libraries into your nutch-0.7.1 directory with the commands (run
from within the nutch directory):

  - wget
  http://apache.mirrors.pair.com/jakarta/lucene/source/lucene-1.4.3-src.tar.gz
  - wget
  
http://apache.mirrors.versehost.com/jakarta/oro/source/jakarta-oro-2.0.8.tar.gz

And then extract them with:

  - tar xzvf lucene-1.4.3-src.tar.gz
  - tar xzvf jakarta-oro-2.0.8.tar.gz

Now that we've got these libraries set up, we need to tell the java compiler
where they (and Nutch's own source code) live. This is done by setting the
CLASSPATH environmental variable. Add the following to your ~/.bash_profile,
replacing USERNAME with your username:

CLASSPATH=/home/USERNAME/nutch-0.7.1/jakarta-oro-2.0.8
/src/java:/home/USERNAME/nutch-0.7.1/lucene-1.4.3
/src/java:/home/USERNAME/nutch-0.7.1/src/java:/home/USERNAME/nutch-0.7.1
/src/plugin/urlfilter-regex/src/java:.

export CLASSPATH

Next, copy PruneRegexTool.java and PruneRegexTool.class to your nutch
directory.

You should then be able to compile the tool with the command:

javac PruneRegexTool.java
*
Stage 2:* Set Up conf/regex-prune.txt

The file conf/regex-prune.txt is the default file that PruneRegexTool reads
to determine which pages to keep and which to discard based on their urls.
It uses the same format as conf/regex-urlfilter.txt above, with + indicating
that a url will be pruned, and - indicating that it will not be pruned. A
custom file can be specified when running the tool with the flag -regexfile
filename.

*Stage 3:* Run The Tool

The full set of command line arguments that can be given to the tool are as
follows:

  - PruneRegexTool indexDir | segmentsDir [-regexfile filename]
  [-dryrun] [-force] [-output filename]
  - NOTE: exactly one of indexDir or segmentsDir must be provided
  - -*regexfile * specify the file containing the regex to be used
  during pruning. defaults to conf/regex-prune.txt
  - -*dryrun* don't do anything, just show what would be done.
  - -*force* force index unlock, if locked. Use with caution!
  - -*output* store pruned URLs in a text file


An example run of the tool (from your nutch-0.7.1directory):

  - bin/nutch PruneRegexTool crawl/segments -dryrun -output
  prunedurls.log

End of Instructions

Again, if anyone who uses nutch 0.8.1 thinks this tool would be useful, I
would love for you to update it so it works with nutch 0.8.1. For those of
you using 

Can PruneIndexTool still be used in Nutch 0.8.1?

2006-12-12 Thread Bryan Woliner

Hi,

When using 0.7.x I often used the PruneIndexTool, but I noticed that calling
bin/nutch Prune no longer works and Prune is not included in the
0.8commandline options section of the nutch wiki. Furthermore, when I
call the
command locate PruneIndexTool, all the returned files start with: 
nutch-0.7.1/docs/api/org/apache/nutch/tools/ and nothing comes up from my
nutch-0.8.1 directory.

Can the PruneIndexTool still be used with nutch 0.8.1? If so, is the usage
the same as it was under nutch 0.7.x and where can the source files be
found?

Thanks for any help anyone can provide!

-Bryan


Does nutch 0.8.x have an command like bin/nutch fetchlist -dumpurls

2006-11-12 Thread Bryan Woliner

Hi,

When I was using nutch 0.7, I found the bin/nutch fetchlist -dumpurls
command to be very useful. However, I have not been able to find an
equivalent command in nutch 0.8.x.

Essentially all I want to do is dump all urls stored in a certain segment
(or group of segments) into a text file.

In nutch 0.7.x I would call a command like this:

*$ bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls $s1 foo.txt

*Any suggestions for how this can be accomplished in nutch 0.8.x are very
much appreciated.

Thanks,
Bryan


Two Errors in Nutch 0.8 Tutorial?

2006-07-25 Thread Bryan Woliner

I am certainly far from a nutch expert, but it appears to me that there are
two errors in the current Nutch 0.8 tutorial.

First off, here is the version of Nutch 0.8 that I am using, in case there
has been changes made in newer version that invalidate my comments:

-bash-2.05b$ svn info
Path: .
URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
Repository Root: http://svn.apache.org/repos/asf
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 414318
Node Kind: directory
Schedule: normal
Last Changed Author: siren
Last Changed Rev: 414306
Last Changed Date: 2006-06-14 11:08:28 -0500 (Wed, 14 Jun 2006)
Properties Last Updated: 2006-06-14 12:00:57 -0500 (Wed, 14 Jun 2006)

Error #1:

Towards the end of the tutorial, the following command is found:

bin/nutch invertlinks crawl/linkdb crawl/segments


When I call this command verbatim, I get the following error:

2006-07-25 08:44:40,503 WARN  mapred.LocalJobRunner
(LocalJobRunner.java:run(119))
- job_8ly5hf
java.io.IOException: No input directories specified in: Configuration:
defaults: hadoop-default.xml , mapred-default.xml ,
/home/bryan/nutch-8d/hadoop/mapred/local/localRunner/job_8ly5hf.xmlfinal:
hadoop-site.xml
   at org.apache.hadoop.mapred.InputFormatBase.listPaths(
InputFormatBase.java:96)
   at org.apache.hadoop.mapred.SequenceFileInputFormat.listPaths(
SequenceFileInputFormat.java:37)
   at org.apache.hadoop.mapred.InputFormatBase.getSplits(
InputFormatBase.java:106)
   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
LocalJobRunner.java:80)
Exception in thread main java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:342)
   at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:203)
   at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:305)

I think the correct syntax for the command should be:

bin/nutch invertlinks crawl/linkdb crawl/segments/* (with the /* added
to the end).

Error #2:

The tutorial says that to index, the following command should be called:

bin/nutch index indexes crawl/linkdb crawl/segments/*

However, when I call that command I get the following error:

Usage: index crawldb linkdb segment ...

I believe the correct syntax should be:

bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

If these are indeed errors in the tutorial, perhaps someone with the
authority to do so would be kind enough the make the necessary
changes.

My two cents,
Bryan


Dissecting the Nutch Search Page (Please Help!)

2006-07-23 Thread Bryan Woliner

I am trying to modify the standard nutch search page (for nutch 0.8-dev) and
have several questions:

1. Do most people modify the search.html file directly, or is it better to
modify the files that are used to automatically generate the
search.htmlpage. If the latter is the case, are there any files
besides these that are
involved in the creation of the search page:

  - *../nutch-8d/src/web/jsp/search.sjp*
  - *../nutch-8d/src/web/include/style.html*
  - *../nutch-8d/src/web/include/footer.html*
  - *../nutch-8d/src/web/include/en/header.xml
  *
  - *../nutch-8d/src/web/pages/en/search.xml*


2. I have looked at the source of the search.html page that comes up when
you open :8080, and it appears that this page is mostly generated from
search.jsp and the certain other html pages it includes (listed above)).
However, I cannot figure out where the menu is being imported from. This is
the section of code that follows the imported style sheet, but precedes the
input box and button used to search. Where does this code come from?? Also,
I cannot figure out where the nutch_logo.gif image is coming from (that file
name doesn't even appear the source for search.html).

Any help is much appreciated.

Thanks,
Bryan


Re: Problems switching over from nutch 0.7.1 to nutch 0.8 (dev) -- zero search results problem with invertlinks

2006-06-20 Thread Bryan Woliner

Kuro,

Thanks for the tip. I made the changes you suggested and took a look at the
debug output, which allowed me to realize that my difficulties were actually
occurring before I got the Job Failed exception. Specifically, when I call
the bin/nutch inject command at the beginning of the my whole-web crawl, I
get the following error, which I haven't been able to figure out (any
insights are much apprecaited):

2006-06-14 18:02:32,401 DEBUG
conf.Configuration(Configuration.java:init(67))-
java.io.IOException: config()
   at org.apache.hadoop.conf.Configuration.init(Configuration.java:67)
at 
org.apache.nutch.util.NutchConfiguration.create(NutchConfiguration.java:50)
at org.apache.nutch.crawl.Injector.main(Injector.java:148)

This problem seemed like a different issue so I posted a separate post about
it  yesterday:
http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200606.mbox/[EMAIL 
PROTECTED]

Thanks,
Bryan


On 6/16/06, Teruhiko Kurosaka [EMAIL PROTECTED] wrote:


Bryan,
Some recent changes in the logging code changed the default logging
behavior;
nutch doesn't output anything to the console. (It supposedly sends the
logging
output to a file described as ${nutch.log.dir}/${nutch.log.file} but I
don't know
what the default values of these variables.)

You can change conf/log4j.propertiies to change the logging behavior of
the nutch command line.  (There is another logging properties for
search
GUI.)  I changed conf/log4j.properties like outlined below, to enable
full debug
logging. (Only changed lines are shown).
#log4j.rootLogger=INFO,DRFA
log4j.rootLogger=DEBUG, stdout
#log4j.logger.org.apache.nutch=INFO
#log4j.logger.org.apache.hadoop=WARN

I hope this helps.
-kuro

 From: Bryan Woliner [mailto:[EMAIL PROTECTED]
 Sent: 2006-6-15 18:21

 $ bin/nutch crawl test -dir crawl3 -depth 2 -topN 50

 It seemed like everything worked correctly (although unlike nutch
0.7.1, no
 ouput was generated)




Error when calling bin/nutch inject -- java.io.IOException: config()

2006-06-19 Thread Bryan Woliner

On June 13th, I downloaded the trunk version of nutch-0.8-dev and then built
it using ant.

I then created a valid urls file and put it in the urlsdir subdirectory of
my nutch directory. I also made sure that my conf/regex-urlfilter.txt file
was valid.

At that point, I tried to do my first whole-web crawl using 0.8, so I called
the command:

bin/nutch inject testcrawl/crawldb urlsdir

However, when calling this command, the logged output included the following
error (more logged output included below):

2006-06-14 18:02:32,401 DEBUG conf.Configuration
(Configuration.java:init(67))
- java.io.IOException: config()
   at org.apache.hadoop.conf.Configuration.init(Configuration.java
:67)
   at org.apache.nutch.util.NutchConfiguration.create(
NutchConfiguration.java:50)
   at org.apache.nutch.crawl.Injector.main(Injector.java:148)

What am I doing wrong?

Are there any configuration files or environment variables that I need to
modify?

Thanks for any helpful suggestions anyone can provide!

-Bryan

LOGGED OUTPUT:


2006-06-19 18:02:32,401 DEBUG conf.Configuration
(Configuration.java:init(67))
- java.io.IOException: config()
   at org.apache.hadoop.conf.Configuration.init(Configuration.java
:67)
   at org.apache.nutch.util.NutchConfiguration.create(
NutchConfiguration.java:50)
   at org.apache.nutch.crawl.Injector.main(Injector.java:148)

2006-06-19 18:02:32,407 INFO  crawl.Injector (Injector.java:inject(110)) -
Injector: starting
2006-06-19 18:02:32,410 INFO  crawl.Injector (Injector.java:inject(111)) -
Injector: crawlDb: crawl8/crawldb
2006-06-19 18:02:32,410 INFO  crawl.Injector (Injector.java:inject(112)) -
Injector: urlDir: urlsdir
2006-06-19 18:02:32,696 INFO  conf.Configuration (
Configuration.java:loadResource(397)) - parsing
jar:file:/home/bryan/nutch-8d/lib/hadoop-0.3.2.jar!/hadoop$
2006-06-19 18:02:32,816 INFO  conf.Configuration (
Configuration.java:loadResource(397)) - parsing
file:/home/bryan/nutch-8d/conf/nutch-default.xml
2006-06-19 18:02:32,861 INFO  conf.Configuration (
Configuration.java:loadResource(397)) - parsing
file:/home/bryan/nutch-8d/conf/nutch-site.xml
2006-06-19 18:02:32,872 INFO  conf.Configuration (
Configuration.java:loadResource(397)) - parsing
file:/home/bryan/nutch-8d/conf/hadoop-site.xml
2006-06-19 18:02:32,873 INFO  crawl.Injector (Injector.java:inject(120)) -
Injector: Converting injected urls to crawl db entries.
2006-06-19 18:02:32,876 DEBUG conf.Configuration
(Configuration.java:init(76))
- java.io.IOException: config(config)
   at org.apache.hadoop.conf.Configuration.init(Configuration.java
:76)
   at org.apache.hadoop.mapred.JobConf.init(JobConf.java:86)
   at org.apache.hadoop.mapred.JobConf.init(JobConf.java:97)
   at org.apache.nutch.util.NutchJob.init(NutchJob.java:26)
   at org.apache.nutch.crawl.Injector.inject(Injector.java:121)
   at org.apache.nutch.crawl.Injector.main(Injector.java:155)

2006-06-19 18:02:32,889 INFO  conf.Configuration (
Configuration.java:loadResource(397)) - parsing
jar:file:/home/bryan/nutch-8d/lib/hadoop-0.3.2.jar!/hadoop$
2006-06-19 18:02:32,914 INFO  conf.Configuration (
Configuration.java:loadResource(397)) - parsing
file:/home/bryan/nutch-8d/conf/nutch-default.xml
2006-06-19 18:02:32,930 INFO  conf.Configuration (
Configuration.java:loadResource(397)) - parsing
jar:file:/home/bryan/nutch-8d/lib/hadoop-0.3.2.jar!/mapred$
2006-06-19 18:02:32,933 INFO  conf.Configuration (
Configuration.java:loadResource(397)) - parsing
jar:file:/home/bryan/nutch-8d/lib/hadoop-0.3.2.jar!/mapred$
2006-06-19 18:02:32,937 INFO  conf.Configuration (
Configuration.java:loadResource(397)) - parsing
file:/home/bryan/nutch-8d/conf/nutch-site.xml
2006-06-19 18:02:32,940 INFO  conf.Configuration (
Configuration.java:loadResource(397)) - parsing
file:/home/bryan/nutch-8d/conf/hadoop-site.xml
2006-06-19 18:02:33,503 DEBUG conf.Configuration
(Configuration.java:init(76))
- java.io.IOException: config(config)
   at org.apache.hadoop.conf.Configuration.init(Configuration.java
:76)
   at org.apache.hadoop.mapred.JobConf.init(JobConf.java:86)
   at org.apache.hadoop.mapred.LocalJobRunner$Job.init(
LocalJobRunner.java:57)
   at org.apache.hadoop.mapred.LocalJobRunner.submitJob(
LocalJobRunner.java:181)
   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:277)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:312)
   at org.apache.nutch.crawl.Injector.inject(Injector.java:131)
   at org.apache.nutch.crawl.Injector.main(Injector.java:155)


Problems switching over from nutch 0.7.1 to nutch 0.8 (dev) -- zero search results problem with invertlinks

2006-06-15 Thread Bryan Woliner

Hi All,

I have been using Nutch 0.7.1 for some time (although I am certainly not an
expert) and am now in the process of switching over to Nutch 0.8. However, I
have ran into a couple of problems along the way and am hoping that those of
you who have been using nutch 0.8 for a while will take a quick look at what
I have done and see if you can figure why I am running into these problems.
Thanks ahead of time for any help you can offer!!

__

The two problems I am having are essentially as follows (more detail
provided below):

1. So far I have been able to run a testcrawl using bin/nutch crawl, but
when I go to my nutch searchpage (:8080) and try a search, I always get zero
results returned, even though I am able to open the index using Luke and
verify that there are approximately 200 documents and approximately 40,000
search terms in my index and there are no errors in the Tomcat logs.

2. I am unable to get through the whole-web crawl in the nutch-0.8 tutorial.
Specifically, I get stuck on the bin/nutch invertlinks step, where I get
the message:

Exception in thread main java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:342)
   at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:203)
   at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:305)
___

** Details **

These are the steps I took to install nutch 0.8.

1. Downloaded Nutch 0.8 (dev)

I was previously using the release copy of nutch 0.7.1, so this was the
first time I had to build a release of nutch using ant. I downloaded ant and
then installed the current trunk of nutch 0.8 (thinking it would be more
stable than the nightly build). To do this I did the following from my home
directory:

$ svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk
$ mv trunk nutch-8d
$ export ANT_HOME=/usr/local/ant/apache-ant-1.6.5
$ export PATH=${PATH}:${ANT_HOME}/bin
$ cd nutch-8d
$ ant

2. Compiled Nutch 0.8 war file and then replaced ROOT Tomcat directory

I then did the following from my nutch-8d directory:

$ ant war
$ mv /usr/local/jakarta-tomcat-4.1.31/webapps/ROOT /usr/local/jakarta-
tomcat-4.1.31/webapps/ROOT_nutch-0.7/
$ cp build/nutch-0.8-dev.war /usr/local/jakarta-tomcat-4.1.31
/webapps/ROOT.war

3. Tried first Nutch 0.8 crawl using the CrawlTool

I first created an urls file at ../nutch-8d/test/urls and then set the
crawl-urlfilter.txt file to allow essentially all URLs.

I then did a round of fetching using the following call:

$ bin/nutch crawl test -dir crawl3 -depth 2 -topN 50

It seemed like everything worked correctly (although unlike nutch 0.7.1, no
ouput was generated)

I then did the following:

$cd crawl3

$ /usr/local/jakarta-tomcat-4.1.31/bin/catalina.sh stop
Using CATALINA_BASE:   /usr/local/jakarta-tomcat-4.1.31
Using CATALINA_HOME:   /usr/local/jakarta-tomcat-4.1.31
Using CATALINA_TMPDIR: /usr/local/jakarta-tomcat-4.1.31/temp
Using JAVA_HOME:   /usr/local/j2sdk1.4.2_08

$ /usr/local/jakarta-tomcat-4.1.31/bin/catalina.sh start
Using CATALINA_BASE:   /usr/local/jakarta-tomcat-4.1.31
Using CATALINA_HOME:   /usr/local/jakarta-tomcat-4.1.31
Using CATALINA_TMPDIR: /usr/local/jakarta-tomcat-4.1.31/temp
Using JAVA_HOME:   /usr/local/j2sdk1.4.2_08

Everything seemed to be working correctly, but when I went to my nutch
search page (i.e. :8080), no matter what search term I enter, I get zero
results returned.

I then did the following to troubleshoot the situation:

1. Reviewed the tomcat logs (no error messages of any sort).

2. Looked at the following segments stats:

$ bin/nutch segread -list -dir crawl3/segments

NAMEGENERATED   FETCHER START   FETCHER
END FETCHED PARSED
20060613200213  3   2006-06-13T20:02:20
2006-06-13T20:02:22 3   3
20060613200226  214 2006-06-13T20:02:32
2006-06-13T20:04:48 217 181

3. Opened the index I am trying to search using Luke, which allowed me to
verify that there are approximately 200 documents and approximately 40,000
seach terms in my index (including search terms that were returning zero
results when I was searching for them).

I HAVE NO IDEA WHY ZERO SEARCH RESULTS ARE ALWAYS BEING RETURNED -- PLEASE
HELP.

4. Trying a Whole-Web Crawl

After I couldn't figure out why I was always getting zero search results, I
tried to follow the instructions for a whole-web crawl, just for the hell of
it. Things seemed to be going fine, until I got to the invertlinks steps, at
which point I always get an error message. Below are the command calls that
I made (and the error message). Please let me know what I am doing wrong:

I first made sure that the test/urls file and regex-urlfilter.txt files had
valid entries, which they do.


-bash-2.05b$ bin/nutch inject testcrawl/crawldb test

-bash-2.05b$ bin/nutch generate testcrawl/crawldb testcrawl/segments

-bash-2.05b$ s1=`ls -d testcrawl/segments/2* | tail -1`
-bash-2.05b$ echo 

What are valid names and location(s) for segments

2006-03-09 Thread Bryan Woliner
I am using nutch 0.7.1 and have a couple questions about valid segment names
and locations:

I can get nutch to work fine when I store my segments, with their original
nutch assigned names in the folder: /usr/local/nutch-0.7.1/live/segments/
and then start tomcat from the /usr/local/nutch-0.7.1/live/ directory.

However, if I change the names of any of the segments then I get either zero
search results or a blank screen when I try to search.

Additionally, if I do not change the names, but move the segments to
sub-directories of the /live/segments/ folder (i.e. /live/segments/site1/),
then I always get zero search results.

Question: What is the easiest way to get nutch to recognize segments with
modified names, or those that are stored in a sub-directory of the segments
folder.

In General: The larger problem that I am trying to solve is that my nutch
search engine currently crawls and indexes a couple dozes sites and I want
to update (i.e re-crawl) these sites independently and at different time
intervals.  My current plan is to have a ../live/segments/ folder and store
an updated (and indexed) segment for each site in that folder. With this in
mind, I'm sure you can understand why it would be difficult to keep this
folder organized without being able to rename segments and/or store them in
sub-directories. If anyone has an ideas about how to organize these segments
without renaming them or storing them in sub-directories, I'm all ears.

Thanks ahead of time for any suggestions,

Bryan


Re: Do not index seed page?

2006-01-24 Thread Bryan Woliner
I have a similar issue and have begun working on a tool that would prune an
index using a file of regexes. When I get it working I will be happy to make
it publicly available.

-Bryan

On 1/23/06, Stefan Groschupf [EMAIL PROTECTED] wrote:

 Blocking a page in a url filter will also not fetch a page, so that
 doesn't solve your problem.
 You can remove the page manually from the index e.g. by using
 PruneIndexTool.
 However I have something here that also can solve the problem but  I
 need some more time to prepare a patch.

 Stefan

 Am 21.01.2006 um 16:54 schrieb Franz Werfel:

  Yes, that is an option we are certainly considering, but we would
  rather have a start page and forget about it.
  Cheers, Fr
 
  On 1/20/06, Neal Whitley [EMAIL PROTECTED] wrote:
  Franz,
 
  Someone else will need to confirm this...
 
  FYI...why not simply inject the urls directly into Nutch?
 
  ./nutch inject db/ -urlfile seeds.txt
 
 
  At 03:49 PM 1/20/2006, you wrote:
 
  Thank you, but if I do that will the page be read for urls?
  Cheers, Frank
 
  On 1/20/06, Neal Whitley [EMAIL PROTECTED] wrote:
  Franz,
 
  I 'think' you could use the regex url filter to not index this page
  (regex-urlfilter.txt).
 
  Something like:  -^http://([a-z0-9]*\.)*tripod.com/
 
  I am new to Nutch so I make no guarantee... :-)
 
  Neal
 
 
 
  At 05:23 AM 1/20/2006, you wrote:
 
  Hello,
 
  We are trying to implement Nutch on an intranet and have setup a
  special page which has links to all the other pages of the
  site, since
  many are not linked together.
  We will start with this special page and then go from there to
  all the
  other pages, but we would like to not index the first page (so
  that it
  doesn't show up in search results), just use it for its links.
  Is it possible easily?
 
  Thank you.
 
 
 
 
 

 ---
 company:http://www.media-style.com
 forum:http://www.text-mining.org
 blog:http://www.find23.net






Common Lucene Queries for PruneIndexTool -- GROUPS of files or folders

2006-01-16 Thread Bryan Woliner
OK,

I have spent a fair amount of time trying to figure out how to create
the correct Lucene queries to use with the PruneIndexTool. I have read
the wiki page for bin/nutch Prune, looked at the Lucence Query Parser
Syntax page and browsed past mailing list discussions on the subject.

Accordingly, I have used bin/nutch org.apache.nutch.searcher.Query to
create queries for a specific URL or a specific directory. I enter the
URL or directory at the Query prompt and then copy the +(url:*)
section of the output into my queries.txt file.

However, I am still at a loss for how to create the proper lucene
queries for GROUPS of files and folders.

Here a some of the most common groupings of files and/or directories I
am trying to prune from my index. It would be great if anyone could
suggest the correct lucene query to use and/or how to figure out these
types of queries.

1. I want to prune the URL http://www.testsite.com/testdir/;, but I
don't want to prune any other files in the /testdir/ directory.

2. I want to prune URLs in the range: http://www.testsite.com/[20-40]/

(meaning the following URLs would be pruned):

http://www.testsite.com/20/
http://www.testsite.com/21/
...
http://www.testsite.com/39/
http://www.testsite.com/40]/

I would even settle for the following URLs being pruned:
http://www.testsite.com/??/

3. I want to prune the URLs http://www.testsite.com/*.php;

Either just in this directory, or recursively through all
sub-directories (ideally I would like to know how to do both).

Any help is much appreciated!

-Bryan


How can no URLs be fetched until the 11th round of fetching?

2006-01-15 Thread Bryan Woliner
I am using Nutch 0.7.1 (no mapreduce) and did a whole-web crawl with 14
rounds of fetching and an urls files with one URL in it. No urls were
fetched during the first 10 rounds, but then in the 11th round one URL was
fetched and increasing more URLs were fetched in rounds 12-14. I am basing
the numbers of URLs fetched  on the  output from calling bin/nutch segread
(included below). I don't understand how this can happen. If a URL is not
fetched during a round are its outlinks still added to the database for the
next round of fetching? Why would I have 10 rounds of fetching with no URLs
fetched and then suddenly have one fetched successfully in the 11th round?

Any suggestions are appreciated.
-Bryan

Here is the output when I call:

bin/nutch segread -list -dir segments

run java in /usr/local/j2sdk1.4.2_08
060115 205601 parsing file:/home/bryan/nutch-0.7.1/conf/nutch-default.xml
060115 205601 parsing file:/home/bryan/nutch-0.7.1/conf/nutch-site.xml
060115 205601 No FS indicated, using default:local
060115 205601 PARSED?   STARTED FINISHED
COUNT   DIR NAME
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173409
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173413
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173417
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173421
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173424
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173428
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173432
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173436
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173440
060115 205601 true  19691231-18:00:00   19691231-18:00:00
0   ../segments/20060115173443
060115 205602 true  20060115-17:34:51   20060115-17:34:51
1   ../segments/20060115173447
060115 205602 true  20060115-17:34:57   20060115-17:41:07
42  ../segments/20060115173454
060115 205602 true  20060115-17:41:16   20060115-18:12:28
234 ../segments/20060115174113
060115 205602 true  20060115-18:12:37   20060115-19:51:07
738 ../segments/20060115181234
060115 205602 TOTAL: 1015 entries in 14 segments.


Re: How can no URLs be fetched until the 11th round of fetching?

2006-01-15 Thread Bryan Woliner
I don't think that I was completely clear in my first post. What you
are saying makes sense if I was doing a one-round fetch on a number of
different occasions. However, I am doing 14 rounds of fetching each
called by one script, in the pattern outlined in the nutch tutorial,
where my script does 14 loops of the following:
--

bin/nutch generate db segments
s[i]=`ls -d segments/2* | tail -1`
bin/nutch fetch $s[i]
bin/nutch updatedb db $s[i]
--

Do you think the possibilities you suggested makes sense in light of
the fact that I am doing each of these rounds of fetching within
seconds of each other, each being called by the same script?

I also have a couple of related questions?

(1) In the first round of fetching, the fetchlist is generated from
the database, which was injected with the one URL that comprises my
urls file. If in the first round of fetching, the one URL in the fetch
list can't be fetched and/or parsed, I am assuming that subsequent
rounds of fetching just used the same one-URL fetchlist until this URL
is successfully fetched and its outlinks added to the database. Is
that correct?

(2) When I call the following command, the resulting file has no
output for the rounds where no URLs were fetched. This leads me to
believe that the fact that no URLs were fetched is not a result of a
fetching or parsing error (since such errors usually show up in the
output of this command). Does this make sense? If it does, then what
caused no URLs to be fetched.

Thanks for any helpful suggestions,
Bryan

On 1/15/06, Fuad Efendi [EMAIL PROTECTED] wrote:
 Many things could happen.

 Sample1: website was unavailable during first 10 fetches
 Sample2: 11th fetch used different IP, DNS-to-IP mapping changed (or may be
 finally resolved!)
 Sample3: Smth changed on a site, redirect added/changed, etc.
 Sample4: web-master modified robots.txt
 Sample5: big first HTML file, network errors during first 10 fetch attempts,
 etc.

 It should be very uncommon behaviour, but it may happen...


 -Original Message-
 From: Bryan Woliner

 I am using Nutch 0.7.1 (no mapreduce) and did a whole-web crawl with 14
 rounds of fetching and an urls files with one URL in it. No urls were
 fetched during the first 10 rounds, but then in the 11th round one URL was
 fetched and increasing more URLs were fetched in rounds 12-14. I am basing
 the numbers of URLs fetched  on the  output from calling bin/nutch segread
 (included below). I don't understand how this can happen. If a URL is not
 fetched during a round are its outlinks still added to the database for the
 next round of fetching? Why would I have 10 rounds of fetching with no URLs
 fetched and then suddenly have one fetched successfully in the 11th round?

 Any suggestions are appreciated.
 -Bryan

 Here is the output when I call:

 bin/nutch segread -list -dir segments

 run java in /usr/local/j2sdk1.4.2_08
 060115 205601 parsing file:/home/bryan/nutch-0.7.1/conf/nutch-default.xml
 060115 205601 parsing file:/home/bryan/nutch-0.7.1/conf/nutch-site.xml
 060115 205601 No FS indicated, using default:local
 060115 205601 PARSED?   STARTED FINISHED
 COUNT   DIR NAME
 060115 205601 true  19691231-18:00:00   19691231-18:00:00
 0   ../segments/20060115173409
 060115 205601 true  19691231-18:00:00   19691231-18:00:00
 0   ../segments/20060115173413
 060115 205601 true  19691231-18:00:00   19691231-18:00:00
 0   ../segments/20060115173417
 060115 205601 true  19691231-18:00:00   19691231-18:00:00
 0   ../segments/20060115173421
 060115 205601 true  19691231-18:00:00   19691231-18:00:00
 0   ../segments/20060115173424
 060115 205601 true  19691231-18:00:00   19691231-18:00:00
 0   ../segments/20060115173428
 060115 205601 true  19691231-18:00:00   19691231-18:00:00
 0   ../segments/20060115173432
 060115 205601 true  19691231-18:00:00   19691231-18:00:00
 0   ../segments/20060115173436
 060115 205601 true  19691231-18:00:00   19691231-18:00:00
 0   ../segments/20060115173440
 060115 205601 true  19691231-18:00:00   19691231-18:00:00
 0   ../segments/20060115173443
 060115 205602 true  20060115-17:34:51   20060115-17:34:51
 1   ../segments/20060115173447
 060115 205602 true  20060115-17:34:57   20060115-17:41:07
 42  ../segments/20060115173454
 060115 205602 true  20060115-17:41:16   20060115-18:12:28
 234 ../segments/20060115174113
 060115 205602 true  20060115-18:12:37   20060115-19:51:07
 738 ../segments/20060115181234
 060115 205602 TOTAL: 1015 entries in 14 segments.




Re: port :8080 no longer brings up Nutch search page!

2006-01-04 Thread Bryan Woliner
Nevermind, I was able to fix it by renaming the tomcat/webapps/ROOT/
directory and then restarting tomcat, which recreated the root directory
from the ROOT.war file. I must have messed up some of the permissions in the
ROOT folder.

On 1/4/06, Bryan Woliner [EMAIL PROTECTED] wrote:

 When I originally installed nutch and tomcat on my machine, I needed to
 change the ownership and permission of certain files in subdirectories of
 the ../jakarta-tomcat-4.1.31/ folder in order to be able to use tomcat
 and nutch together. I've had no problems with tomcat for some time, however
 I am currently in the process of setting up my server so several other users
 can test nutch and so I did the following:

 I originally had nutch installed in /home/user1/nutch-0.7.1/

 I copied the whole nutch folder to /home/user2/nutch-0.7.1/

 I got nutch to run fine from user2's account and the /home/user2/nutch-
 0.7.1/ folder, however I was getting some permission errors when trying to
 start tomcat from user2's account and nutch folder. Therefore, I did the
 following:


 All of the files in the /jakarta-tomcat-4.1.31/logs/ folder and the
 /jakarta-tomcat-4.1.31/webapps/ROOT folder, as well as the /jakarta-
 tomcat-4.1.31/webapps.ROOT.war file had user1 as the user and group owner
 and had file permissions of 655.

 To enable user2 to access these files I changed ownership to
 webadmin:webadmin (a group that user1 and user2 both belong to) and changed
 the permission on all of these files to 665.

 PROBLEM: I am now able to start tomcat from either user1's or user2's
 account, BUT when i go to my :8080 port I no longer get the Nutch search
 page -- instead I get a listing of my /jakarta-tomcat-4.1.31/webapps/ROOT/
 directory:

 Directory Listing For /
 --
  *Filename* *Size* *Last Modified*
 anchors.jsp/http://www.searchthenews.org:8080/anchors.jsp/
   null ca/ http://www.searchthenews.org:8080/ca/   null
 cached.jsp/ http://www.searchthenews.org:8080/cached.jsp/   null
 cluster.jsp/ http://www.searchthenews.org:8080/cluster.jsp/   null
 de/ http://www.searchthenews.org:8080/de/   null 
 en/http://www.searchthenews.org:8080/en/
   null es/ http://www.searchthenews.org:8080/es/   null
 explain.jsp/ http://www.searchthenews.org:8080/explain.jsp/   null
 fi/ http://www.searchthenews.org:8080/fi/   null 
 fr/http://www.searchthenews.org:8080/fr/
   null hu/ http://www.searchthenews.org:8080/hu/   null 
 img/http://www.searchthenews.org:8080/img/
   null include/ http://www.searchthenews.org:8080/include/   null
 index.jsp/ http://www.searchthenews.org:8080/index.jsp/   null 
 jp/http://www.searchthenews.org:8080/jp/
   null more.jsp/ http://www.searchthenews.org:8080/more.jsp/   null
 ms/ http://www.searchthenews.org:8080/ms/   null 
 nl/http://www.searchthenews.org:8080/nl/
   null pl/ http://www.searchthenews.org:8080/pl/   null 
 pt/http://www.searchthenews.org:8080/pt/
   null 
 refine-query-init.jsp/http://www.searchthenews.org:8080/refine-query-init.jsp/
   null 
 refine-query.jsp/http://www.searchthenews.org:8080/refine-query.jsp/
   null search.jsp/ http://www.searchthenews.org:8080/search.jsp/
 null sv/ http://www.searchthenews.org:8080/sv/   null 
 text.jsp/http://www.searchthenews.org:8080/text.jsp/
   null th/ http://www.searchthenews.org:8080/th/   null 
 zh/http://www.searchthenews.org:8080/zh/
   null
 --
 Apache Tomcat/4.1.31
 When I try to click on search.jsp/ or en/ I get the following error,
 even though I know these files/folders are in my ../webapps/ROOT/
 directory!!

 HTTP Status 404 - /search.jsp/
 --

 *type* Status report

 *message* */search.jsp/*

 *description* *The requested resource (/search.jsp/) is not available.*
 --
 Apache Tomcat/4.1.31
 PLEASE HELP!




port :8080 no longer brings up Nutch search page!

2006-01-04 Thread Bryan Woliner
When I originally installed nutch and tomcat on my machine, I needed to
change the ownership and permission of certain files in subdirectories of
the ../jakarta-tomcat-4.1.31/ folder in order to be able to use tomcat and
nutch together. I've had no problems with tomcat for some time, however I am
currently in the process of setting up my server so several other users can
test nutch and so I did the following:

I originally had nutch installed in /home/user1/nutch-0.7.1/

I copied the whole nutch folder to /home/user2/nutch-0.7.1/

I got nutch to run fine from user2's account and the /home/user2/nutch-0.7.1/
folder, however I was getting some permission errors when trying to start
tomcat from user2's account and nutch folder. Therefore, I did the
following:


All of the files in the /jakarta-tomcat-4.1.31/logs/ folder and the
/jakarta-tomcat-4.1.31/webapps/ROOT folder, as well as the /jakarta-
tomcat-4.1.31/webapps.ROOT.war file had user1 as the user and group owner
and had file permissions of 655.

To enable user2 to access these files I changed ownership to
webadmin:webadmin (a group that user1 and user2 both belong to) and changed
the permission on all of these files to 665.

PROBLEM: I am now able to start tomcat from either user1's or user2's
account, BUT when i go to my :8080 port I no longer get the Nutch search
page -- instead I get a listing of my /jakarta-tomcat-4.1.31/webapps/ROOT/
directory:

Directory Listing For /
--
 *Filename* *Size* *Last Modified*   
anchors.jsp/http://www.searchthenews.org:8080/anchors.jsp/
  null ca/ http://www.searchthenews.org:8080/ca/   null
cached.jsp/ http://www.searchthenews.org:8080/cached.jsp/   null
cluster.jsp/ http://www.searchthenews.org:8080/cluster.jsp/   null  
  de/http://www.searchthenews.org:8080/de/
  null en/ http://www.searchthenews.org:8080/en/   null
es/http://www.searchthenews.org:8080/es/
  null explain.jsp/ http://www.searchthenews.org:8080/explain.jsp/
null fi/ http://www.searchthenews.org:8080/fi/   null
fr/http://www.searchthenews.org:8080/fr/
  null hu/ http://www.searchthenews.org:8080/hu/   null
img/http://www.searchthenews.org:8080/img/
  null include/ http://www.searchthenews.org:8080/include/   null
index.jsp/ http://www.searchthenews.org:8080/index.jsp/   null
jp/http://www.searchthenews.org:8080/jp/
  null more.jsp/ http://www.searchthenews.org:8080/more.jsp/   null
ms/ http://www.searchthenews.org:8080/ms/   null
nl/http://www.searchthenews.org:8080/nl/
  null pl/ http://www.searchthenews.org:8080/pl/   null
pt/http://www.searchthenews.org:8080/pt/
  null 
refine-query-init.jsp/http://www.searchthenews.org:8080/refine-query-init.jsp/
  null 
refine-query.jsp/http://www.searchthenews.org:8080/refine-query.jsp/
  null search.jsp/ http://www.searchthenews.org:8080/search.jsp/
null sv/ http://www.searchthenews.org:8080/sv/   null
text.jsp/http://www.searchthenews.org:8080/text.jsp/
  null th/ http://www.searchthenews.org:8080/th/   null
zh/http://www.searchthenews.org:8080/zh/
  null
--
Apache Tomcat/4.1.31
When I try to click on search.jsp/ or en/ I get the following error,
even though I know these files/folders are in my ../webapps/ROOT/
directory!!

HTTP Status 404 - /search.jsp/
--

*type* Status report

*message* */search.jsp/*

*description* *The requested resource (/search.jsp/) is not available.*
--
Apache Tomcat/4.1.31
PLEASE HELP!


Re: which files/directories are needed after a segment or index merge

2005-12-22 Thread Bryan Woliner
Thanks Stefan! I guess I should have look at the searcher.dir entry in
nutch-site.xml to start with.

For the record, I was able to search the index of the merged-segment
successfully after I created a /nutch-0.7.1/Live/segements/ folder, put my
segments in that directory and started tomcat from the /Live/ directory.

It was not even necessary to modify the searcher.dir entry in my
nutch-site.xml file.

Thanks again,
Bryan


On 12/22/05, Stefan Groschupf [EMAIL PROTECTED] wrote:

 OK, sorry in case you use o.7 -my fault - you the index itself is
 stored in the segments.
 So you need to copy the segments that include the indexes into a
 folder may called finalSegments.
 In the nutch.default.xml your search folder than should be /home/
 finalSegments or so.
 Sorry!


 From my point of view sources are usable.
 
  Is there is estimated release date for 0.8
 Not yet.
 Stefan



which files/directories are needed after a segment or index merge

2005-12-21 Thread Bryan Woliner
I am using nutch 0.7.1 (non-mapred) and am a little confused about how to
move the contents of several test crawls into a single live directory.
Any suggestions are very much appreciated!

I want to have a Live directory that contains all the indexes that
are ready to be searched.

The first index I want to add to the Live directory comes from a
crawl with 10 rounds of fetching, whose db and segments are stored in
the following directories:

/crawlA/db/
/crawlA/segments/

I can merge all of the segments in the segments directory (using
bin/nutch mergesegs), which results in the following (11th) segment
directory:

/crawlA/segments/20051219000754/

I can then index this 11th (i.e. merged) segment.

However, I have the following questions about which files and
directories should be moved to the Live directory:

1. If I copy /crawlA/db/ to /Live/db/  and copy
/crawlA/segments/20051219000754/ to /Live/segments/20051219000754/ ,
then I can start tomcat from /Live/ and I'm able to search the index
fine. However, I'm note sure if that can be duplicated for my crawlB
directory. I can't copy /crawlB/db/
to the Live directory because there is already a db directory there.
What are the correct files and directories to copy from each crawl
into the Live directory?

2. On a side note: am I even taking the correct approach in merging the 10
segments in
the crawlA/segments/ directory before I index, or should I index each
segment first and then merge the 10 indexes? If I was to take the
latter approach (merging indexes instead of segments), which files from the
/crawlA/ directory would I need
to move to the Live directory.

Thanks ahead of time for any helpful suggestions,


Re: which files/directories are needed after a segment or index merge

2005-12-21 Thread Bryan Woliner
Stefan,

Thanks so much for the speedy reply!

I have a couple of comments:

1. I am currently NOT using NFDS or map reduce because the number of sites I
am looking to fetch and index is relatively small (currently less than 1
million). Accordingly, I am using the 0.7.1 version of nutch available from
the Nutch website. Does this seem like the correct choice?

2. I currently do use a script which basically looks like this (I have
another version that includes indexing):

mkdir $DB_DIR
mkdir $SEG_DIR

bin/nutch admin $DB_DIR -create

bin/nutch inject $DB_DIR -urlfile $URL_FILE

i=$FETCH_ROUNDS

while [ $i -gt 0 ]
do

bin/nutch generate $DB_DIR $SEG_DIR $TOP_N $MAX_SITE

s[$i]=`ls -d $SEG_DIR/2* | tail -1`

bin/nutch fetch ${s[$i]}

bin/nutch updatedb $DB_DIR ${s[$i]}

i=`expr $i - 1`

done

3. It is my understanding that Nutch 0.7.1 (no NFDS or mapred) only has a
webdb and not the linkdb/crawldb structure. If that is correct, then if I'm
trying to add two merged segments (and the index of each) to my live
folder, do I also need the webdb of each (and if so, do I need to merge
them)?

Thanks again for the help,
Bryan

On 12/21/05, Stefan Groschupf [EMAIL PROTECTED] wrote:

 In general I suggest using a shell script and doing the command
 manually instead of using the crawl command, may something like.

 NUTCH_HOME=$HOME/nutch-0.8-dev

 while [ 1 ]
 # or may just 10 rounds
 do
 DATE=$(date +%d%B%Y_%H%M%S)

  $NUTCH_HOME/bin/nutch generate /user/nutchUser/
 crawldb /user/nutchUser/segments -topN 500
  s=`$NUTCH_HOME/bin/nutch ndfs -ls /user/nutchUser/
 segments | tail -1 | cut -c 1-38`
  $NUTCH_HOME/bin/nutch fetch $s
  $NUTCH_HOME/bin/nutch updatedb /user/nutchUser/
 crawldb $s
 # only when indexing$NUTCH_HOME/bin/nutch
 invertlinks /user/nutchUser/linkdb /user/nutchUser/segments
 # what to index, may the merged segment from the 10
 roundss=`$NUTCH_HOME/bin/nutch ndfs -ls /user/
 nutchUser/segments | tail -1 | cut -c 24-38`
 # index$NUTCH_HOME/bin/nutch index /user/nutchUser/
 indexes/$s /user/nutchUser/crawldb /user/nutchUser/linkdb /user/
 nutchUser/segments/$s

 done

 This prevent you from merging crawl db's.
 Than you only need the merged segment, the linkdb and the index from
 the merged segment.
 The 10 segments used to build the merged segment can be removed.

 Hope this helps, you should only may change the scripts to have a 10
 round loop  to create you 10 segments and the merging command is also
 not in the script.
 Stefan

 Am 21.12.2005 um 18:28 schrieb Bryan Woliner:

  I am using nutch 0.7.1 (non-mapred) and am a little confused about
  how to
  move the contents of several test crawls into a single live
  directory.
  Any suggestions are very much appreciated!
 
  I want to have a Live directory that contains all the indexes that
  are ready to be searched.
 
  The first index I want to add to the Live directory comes from a
  crawl with 10 rounds of fetching, whose db and segments are stored in
  the following directories:
 
  /crawlA/db/
  /crawlA/segments/
 
  I can merge all of the segments in the segments directory (using
  bin/nutch mergesegs), which results in the following (11th) segment
  directory:
 
  /crawlA/segments/20051219000754/
 
  I can then index this 11th (i.e. merged) segment.
 
  However, I have the following questions about which files and
  directories should be moved to the Live directory:
 
  1. If I copy /crawlA/db/ to /Live/db/  and copy
  /crawlA/segments/20051219000754/ to /Live/segments/20051219000754/ ,
  then I can start tomcat from /Live/ and I'm able to search the index
  fine. However, I'm note sure if that can be duplicated for my crawlB
  directory. I can't copy /crawlB/db/
  to the Live directory because there is already a db directory there.
  What are the correct files and directories to copy from each crawl
  into the Live directory?
 
  2. On a side note: am I even taking the correct approach in merging
  the 10
  segments in
  the crawlA/segments/ directory before I index, or should I index each
  segment first and then merge the 10 indexes? If I was to take the
  latter approach (merging indexes instead of segments), which files
  from the
  /crawlA/ directory would I need
  to move to the Live directory.
 
  Thanks ahead of time for any helpful suggestions,

 ---
 company:http://www.media-style.com
 forum:http://www.text-mining.org
 blog:http://www.find23.net






Re: Luke and Indexes

2005-12-08 Thread Bryan Woliner
Thank you very much for the helpful answers. Most of the pages that
didn't make it into the index were indeed due to protocol errors
(mostly exceeding http.max.delay).

One quick side note. When I was looking at the Nutch wiki page for
bin/nutch segread, I noticed an error on the page and wasn't sure how
to go about fixing it, or alerting someone who can. The page currently
reads:

...

-nocontent

  ignore content data

-noparsedata

  ignore parse_data data

-nocontent

  ignore parse_text data

...

The 2nd -nocontent should probably be -noparsetext, right?

Thanks again for the help,
Bryan

On 12/8/05, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Bryan Woliner wrote:

 I have a couple very basic questions about Luke and indexes in
 general. Answers to any of these questions are much appreciated:
 
 1. In the Luke overview tab, what does Index version refer to?
 
 

 It's the time (as in System.currentTimeMillis()) when the index was last
 modified.

 2. Also in the overview tab, if Has Deletions? is equal to yes,
 where are the possible sources of deletions? Dedup? Manual deletions
 through luke?
 
 
 

 Either. Both.

 3. Is there any way (w/ Luke or otherwise) to get a file listing all
 of the docs in an index. Basically is there an index equivalent of
 this command (which outputs all the URLs in a segment):
 
 bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls segmentsDir
 
 

 You can browse through documents on the Document tab. But there is no
 option to dump all documents to a file. Besides, some fields which are
 not stored are no longer accessible, so you cannot retrieve them from
 the index (you may be able to reconstruct them, but it's a lossy operation).

 4. Finally, my last question is the one I'm most perplexed by:
 
 I called bin/nutch segread -list -dir for a particular segments
 directory and found out that one directory had 93 entries. BUT, when I
 opened up the index of that segment in Luke, there were only 23
 documents (and 3 deletions)! Where did the rest of the URLs go??
 
 

 Do a segread -dump and check what is the protocol status and parse
 status for the pages that didn't make it to the index. Most likely you
 encountered either protocol errors or parsing errors, so there was
 nothing to index from these entries.

 In addition, if you ran the deduplication, some of the entries in your
 index may have been deleted because they were considered duplicates.

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





Luke and Indexes

2005-12-07 Thread Bryan Woliner
I have a couple very basic questions about Luke and indexes in
general. Answers to any of these questions are much appreciated:

1. In the Luke overview tab, what does Index version refer to?

2. Also in the overview tab, if Has Deletions? is equal to yes,
where are the possible sources of deletions? Dedup? Manual deletions
through luke?

3. Is there any way (w/ Luke or otherwise) to get a file listing all
of the docs in an index. Basically is there an index equivalent of
this command (which outputs all the URLs in a segment):

bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls segmentsDir

4. Finally, my last question is the one I'm most perplexed by:

I called bin/nutch segread -list -dir for a particular segments
directory and found out that one directory had 93 entries. BUT, when I
opened up the index of that segment in Luke, there were only 23
documents (and 3 deletions)! Where did the rest of the URLs go??

Thanks ahead of time for any helpful suggestions,
Bryan


Number of URLs in segment fetchlist vs. Number of URLs in index

2005-12-05 Thread Bryan Woliner
How is the number of URLs in a a group of segment's fetchlists related
to the number of urls in an index.

Specifically, when I call the following command using the segments2
directory, I find out that there are 166 entries in 15 segments:

$ bin/nutch segread -list -dir segments

However, when I tried to prune the index of the same segments2
directory, using the following command, it tells me that 15 of 45
directories have been deleted:

$ bin/nutch org.apache.nutch.tools.PruneIndexTool segments2 -dryrun
-queries queries.txt -showfields url,title

-

What I don't understand is how the number of directories went from 166
in the fetchlists for this folder of segments, to only 45 in the
indexes. I'm positive that there were not 121 duplicate URLs (or
anywhere near that amount).

Thanks,
Bryan


Re: RegexURLFilter / testing regex-urlfilter.txt

2005-11-30 Thread Bryan Woliner
Sorry if the answer to this question should be obvious, but where in
the bin/nutch script do you need to add the following line to be able
to test your regex-urlfilter.txt file from the command line?

CLASSPATH=${CLASSPATH}:$NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar



On 11/29/05, Thomas Delnoij [EMAIL PROTECTED] wrote:
 For the sake of the archives, I will answer my own question here: I had to
 add the following line to the bin/nutch script to be able to run
 org.apache.nutch.net.RegexURLFilter from the command line:

 CLASSPATH=${CLASSPATH}:$NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar

 The nutch script overrides the classpath environment variable, so adding the
 jar there didn't help.

 Rgrds, Thomas Delnoij


 On 10/5/05, Thomas Delnoij [EMAIL PROTECTED] wrote:
 
  All.
 
  The problem is actualy a bit different. I was a bit in a hurry when I
  posted the previous message, apologies.
 
  I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath.
 
  When I run java org.apache.nutch.net.RegexURLFilter, I am getting
 
  051005 221040 parsing jar:file:/C:/Personal/vvdb/Nutch/nutch-0.7.1/nutch-
  0.7.1.jar!/nutch-default.xml
  051005 221040 parsing jar:file:/C:/Personal/vvdb/Nutch/nutch-0.7.1/nutch-
  0.7.1.jar!/nutch-site.xml
  051005 221040 Plugins: directory not found: plugins
  Exception in thread main java.lang.ExceptionInInitializerError
  Caused by: java.lang.NullPointerException
  at org.apache.nutch.net.RegexURLFilter.clinit(
  RegexURLFilter.java:64)
 
  when I run nutch org.apache.nutch.net.RegexURLFilter, I am getting
 
  Exception in thread main java.lang.NoClassDefFoundError:
  org/apache/nutch/net/RegexURLFilter
 
  I know I am missing something obvious, but your help is really
  appreciated.
 
  Kind regards, Thomas Delnoij
 
 
  On 10/5/05, Thomas Delnoij [EMAIL PROTECTED] wrote:
  
   I was a bit in a hurry when I posted this message, apologies.
  
   The problem is actualy a bit different.
  
   I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath.
  
   When I run java org.apache.nutch.net.RegexURLFilter,
  
   On 10/5/05, Thomas Delnoij  [EMAIL PROTECTED] wrote:
   
All.
   
I want to run the RegexURLFilter's main() method for testing the
regex-urlfilter.txt.
   
I set up NUTCH_HOME and NUTCH_CONF_DIR so I think I set up my
environment correctly.
   
When I run nutch org.apache.nutch.net.RegexURLFilter I get Exception
in thread main java.lang.NoClassDefFoundError:
org/apache/nutch/net/RegexURLFilter.
   
Assuming this was a classpath issue, I added
NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar to my
classpath.
   
This did not solve the problem, as I am still getting the
NoClassDefFoundError.
   
So my first question is how to set up my environment correctly for
testing the regex-urlfilter.
   
Secondly, I want to tune my regex-urlfilter for maximum relevancy of
the crawl result. By now, I have around 50 entries. My second question 
is if
I can expect any performance impact?
   
Your help is greatly appreciated.
   
Kind regards, Thomas Delnoij.
   
   
   
   
   
   
  
 




Has anyone gotten the date query to function properly?

2005-11-21 Thread Bryan Woliner
If people have gotten the date query to work properly, it would be
great to know the steps they used in get it working.

I added the following property entry to my nutch-site.xml file and
used the search phrase:
url:http date:19000101-20051231 (which returned zero results).


property
  nameplugin.includes/name

valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  /description
/property

Thanks,
Bryan


Re: Using FetchListEntry -dumpurls

2005-11-13 Thread Bryan Woliner
Thanks for the tip. It turns out that the command worked fine when i
replaced bin/nutch net.nutch... with bin/nutch org.apache.nutch...

Accordingly, the correct command call is:

bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls $s1 foo.txt

Thanks,
Bryan




On 11/13/05, Piotr Kosiorowski [EMAIL PROTECTED] wrote:

 Hi,
 I think this is the reason:
  Exception in thread main java.lang.NoClassDefFoundError:
  net/nutch/pagedb/FetchListEntry
 In 0.7 branch all classes where moved to org.apache.nutch package
 structure and scripts where updated so you are probably using old script
 with new release.
 Regards
 Piotr


 Bryan Woliner wrote:
  Hi,
 
  I am trying to dump all of the URLS from a segment to a text file. I was
  able to do this successfully under Nutch 0.6 but am not able to do so
 under
  0.7.1
 
  Please take a look a the line below and let me know if you can figure
 out
  why I'm getting an error. Perhaps it a due to change from version 0.6 to
  0.7.1, or maybe I just have the wrong syntax.
 
  Note: the segments/20051107233629 directory is a valid segments
 directory,
  as is evidence by the ls statement below.
 
 
 ___-
 
 
  -bash-2.05b$ bin/nutch net.nutch.pagedb.FetchListEntry -dumpurls
  segments/20051107233629 foo.txt
  Exception in thread main java.lang.NoClassDefFoundError:
  net/nutch/pagedb/FetchListEntry
 
 
  -bash-2.05b$ ls -la segments/20051107233629
  total 8
  drwxr-xr-x 8 bryan bryan 1024 Nov 7 23:36 .
  drwxr-xr-x 3 bryan bryan 1024 Nov 7 23:36 ..
  drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 content
  drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 fetcher
  drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 fetchlist
  drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 index
  -rw-r--r-- 1 bryan bryan 0 Nov 7 23:36 index.done
  drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 parse_data
  drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 parse_text
 




A couple of questions about the date: query

2005-11-13 Thread Bryan Woliner
OK, I believe that I correctly included the more indexing and more query
plugins, which should allow searches using the date: query field. However,
I am current unable to search by date ranges. I tried to use the search
string that doug cutting suggested in an e-mail to the list on 9/12/2005. He
suggested,

If you want to see all documents in a date range, then perhaps try something
like url:http date:19000101-20051231.


 However, when I try the suggested search string, zero pages are returned. I
believe the fact that I have included the more plugin is evidenced by the
content of my nutch-site.xml file and part of the output nutch generates
during a whole-web crawl, each of which are listed below. Any suggestions
are much appreciated.

On a separate, but related, note, I am unable to interpret the number that
apparently represents the date that a page was last modified. When I click
on the (explain) link next to a search result link, there is a field like
this: lastModified = 1131904304000

What date does this represent?

Here is the contents of my nutch-site.xml file:

___

?xml version=1.0?
?xml-stylesheet type=text/xsl href=nutch-conf.xsl?

!-- Put site-specific property overrides in this file. --

nutch-conf

property
nameplugin.includes/name

valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)/value
descriptionRegular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
/description
/property

/nutch-conf



Also, here is part of the output I get when I do a whole-web crawl:


051113 232339 parsing: /usr/local/nutch-0.7.1/plugins/query-basic/plugin.xml
051113 232339 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.basic.BasicQueryFilter
051113 232339 parsing: /usr/local/nutch-0.7.1/plugins/query-more/plugin.xml
051113 232339 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.more.TypeQueryFilter
051113 232339 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.more.DateQueryFilter
051113 232339 parsing: /usr/local/nutch-0.7.1/plugins/query-site/plugin.xml
051113 232339 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.site.SiteQueryFilter
051113 232339 parsing: /usr/local/nutch-0.7.1/plugins/query-url/plugin.xml
051113 232339 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.url.URLQueryFilter
_

Thanks for any suggestions,
Bryan


Using FetchListEntry -dumpurls

2005-11-07 Thread Bryan Woliner
Hi,

I am trying to dump all of the URLS from a segment to a text file. I was
able to do this successfully under Nutch 0.6 but am not able to do so under
0.7.1

Please take a look a the line below and let me know if you can figure out
why I'm getting an error. Perhaps it a due to change from version 0.6 to
0.7.1, or maybe I just have the wrong syntax.

Note: the segments/20051107233629 directory is a valid segments directory,
as is evidence by the ls statement below.

___-


-bash-2.05b$ bin/nutch net.nutch.pagedb.FetchListEntry -dumpurls
segments/20051107233629 foo.txt
Exception in thread main java.lang.NoClassDefFoundError:
net/nutch/pagedb/FetchListEntry


-bash-2.05b$ ls -la segments/20051107233629
total 8
drwxr-xr-x 8 bryan bryan 1024 Nov 7 23:36 .
drwxr-xr-x 3 bryan bryan 1024 Nov 7 23:36 ..
drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 content
drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 fetcher
drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 fetchlist
drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 index
-rw-r--r-- 1 bryan bryan 0 Nov 7 23:36 index.done
drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 parse_data
drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 parse_text


Re: Collections.

2005-10-25 Thread Bryan Woliner
The regular expressions that you use in your regex-urlfilter.txt file allow
you to specify that Nutch should only crawl certain parts of a domain.

For example, you could limit your search to URLs that start with news.domainor
www.domain.com/news http://www.domain.com/news

If you search the mailing list archive or the Nutch WIKI you should be able
to find more info on what type of regular expressions the
regex-urlfiler.txtfile uses.

-Bryan

On 10/25/05, XIN LING [EMAIL PROTECTED] wrote:

 No, what I mean is a set of URLs in a collection. For example, a finance
 web site might divide the web pages into 2 collections, news and
 analysis. This way if I am only interested in news, I can refine my
 search to this collection, without bothering analysis part.

 I know other search engines can do this, google, htdig, etc.

 Thanks.

 -Original Message-
 From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, October 25, 2005 1:38 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: Collections.

 What do you mean with collections? java.lang.collections?

 Am 25.10.2005 um 20:27 schrieb XIN LING:

  Hi, does anyone know if Nutch supports collections? How to set
  collections in nutch?
 
  Thanks.
 
 

 ---
 company: http://www.media-style.com
 forum: http://www.text-mining.org
 blog: http://www.find23.net





Where are indexes stored and where to store indexes

2005-08-24 Thread Bryan Woliner
I know that this is a really basic question, but once you index segment(s), 
where is the index stored?

On a related note, I read in numerous emails to the list that you can search 
more than one index at the same time if they are in the same location when 
you start tomcat. Where is the correct location (or type of location) to 
store these indexes? Based on the fact that you need to create new db and 
segment directories each time you do a crawl, it seems like you would have 
to move indexes after they are created if you want multiple indexes in the 
same location.

Thanks for the help,
Bryan


Re: Where are indexes stored and where to store indexes

2005-08-24 Thread Bryan Woliner
An update to my question:

1. I found where the index is located, so nevermind on that one.

2. In term of using bin/nutch merge -- the wiki indicates that correct
syntax is something like this:

bin/nutch merge index segments/*

However, that seems to suggest that all of your indexes need to be in
the same segments directory in order to merge them.

What is the best practice for merging indexes that are in different
segment directories? Do you have to copy all of your segments to the
same segments directory first? Do you need to use mergesegs before you
call merge (it doesn't seem likely that this is the case).

Thanks,
Bryan




On 8/24/05, Bryan Woliner [EMAIL PROTECTED] wrote:
 I know that this is a really basic question, but once you index segment(s), 
 where is the index stored?
  
  On a related note, I read in numerous emails to the list that you can search 
 more than one index at the same time if they are in the same location when 
 you start tomcat. Where is the correct location (or type of location) to 
 store these indexes? Based on the fact that you need to create new db and 
 segment directories each time you do a crawl, it seems like you would have to 
 move indexes after they are created if you want multiple indexes in the same 
 location.
  
  Thanks for the help,
  Bryan



Adding small batches of fetched URLs to a larger aggregate segment/index

2005-08-23 Thread Bryan Woliner
Hi,

I have a number of sites that I want to crawl, then merge their segments and 
create a single index. One of the main reasons I want to do this is that I 
want some of the sites in my index to be crawls on a daily basis, others on 
a weekly basis, etc. Each time I re-crawl a site, I want to add the fetched 
URLs to a single aggregate segment/index. I have a couple questions about 
doing this:

1. Is it possible to use a different regex.urlfilter.txt file for each site 
that I am crawling? If so, how would I do this?

2. If I have a very large segment that is indexed (my aggregate index) and I 
want to add another (much smaller) set of fetched URLs to this index, what 
is the best way to do this. It seems like merging the small and large 
segments and then re-indexing the whole thing would be very time consuming 
-- especially if I wanted to add news small sets of fetched URLs frequently. 


Thanks for any suggestions you have to offer,
Bryan


Two Questions: Refetching and searching the archive of this list

2005-08-03 Thread Bryan Woliner
Two questions:

1. Is there a way to search all archived messages from this mailing list?

2. Is there a way to configure the fetcher to refetch only those pages 
either: (i) didn't exist during the last fetch; or (ii) have been modified 
since the last fetch? I know people have asked questions similar to this one 
before (hence my first question), but I could find the relevent thread(s).

Thanks,
Bryan