[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2005-12-21 Thread Jerome Charron (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361041 ] 

Jerome Charron commented on NUTCH-139:
--

Ok, Chris and me will implement MetadataNames in this way.
Just some few comments:

I plan to move the MetadataNames to a class rather than an interface. Two 
reasons:

1.1 I don't like the design of implementing an interface in order to import 
some constants in a class: It gives some javadoc with a lot of class with many 
public constants defined without any really needs to show these constants in 
the javadoc.

1.2 I want to add an utility method in MetadataNames that tries to find the 
appropriate Nutch normalized metadata name from a string. It will be based on 
the Levenshtein Distance (available in commons-lang). More about Levenshtein 
Distance at http://www.merriampark.com/ld.htm




 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: nutch-0.8-dev *mapred.input.subdir* problem ?

2005-12-21 Thread Stefan Groschupf

Lukas,
the input folder are normally setted by the tools to you can not  
change that.
However in case you use a unix box, check that the user that runs  
nutch has read and write acess to all the folder defined in the nutch- 
site/default.xml.
(I guess that can be the problem, nutch use e.g. /tmp to write in  
some data)
If this not solve the problem, just run the commands manually step by  
step, there is a tutorial in the wiki how to run the map rd commands  
step by step.


Stefan

Am 21.12.2005 um 06:56 schrieb Lukas Vlcek:


Hi,

I am trying to use nutch-0.8-dev and I have a problem with crawl run.
I did checkout from SVN and prepared fresh package (ant package - all
went fine). Then I installed nutch on linux and made only minor
changes to nutch-site.xml file (turned on some plugins and increased
several constansts), prepared file with urls and started bin/nutch
crawl.

This worked for nutch-0.7x but for nutch-0.8-dev I am receiving the
following exception in log file:

051220 204248 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ 
crawl-tool.xml

051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ 
nutch-site.xml

051220 204249 crawl started in: ./crawl.test
051220 204249 rootUrlDir = urls
051220 204249 threads = 10
051220 204249 depth = 6
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ 
crawl-tool.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ 
nutch-site.xml

051220 204249 Injector: starting
051220 204249 Injector: crawlDb: ./crawl.test/crawldb
051220 204249 Injector: urlDir: urls
051220 204249 Injector: Converting injected urls to crawl db entries.
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ 
crawl-tool.xml

051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ 
nutch-site.xml

051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing /home/lukas/nutch/mapred/local/localRunner/ 
job_4zwds6.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ 
nutch-site.xml

java.io.IOException: No input directories specified in: NutchConf:
nutch-default.xml , mapred-default.xml ,
/home/lukas/nutch/mapred/local/localRunner/job_4zwds6.xml ,
nutch-site.xml
at org.apache.nutch.mapred.InputFormatBase.listFiles 
(InputFormatBase.java:85)
at org.apache.nutch.mapred.InputFormatBase.getSplits 
(InputFormatBase.java:95)
at org.apache.nutch.mapred.LocalJobRunner$Job.run 
(LocalJobRunner.java:63)

051220 204249 Running job: job_4zwds6
Exception in thread main java.io.IOException: Job failed!
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java: 
308)

at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:101)

It seems that the problem is that Nutch is not able to find
mapred.input.subdir setting in neither of config files. I found that
there is mapred.input.dir property defined in config for particular
job (job_4zwds6.xml) with value equal to the name of my urls file but
I don't understand where should I define mapred.input.subdir property
and what value to assign to it (if it needs to be defined manually -
note that mapred.input.dir seems to be configured automatically).

Does anybody know the answer?

p.s: Note that number of lines it the exception trace above for
InputFormatBase.java file (85,95) can differ a bit as I tried to
insert some more LOG.debug() commands there in search of the root
cause and then I removed them again but it is possible that I left
some extra empty lines there.

Thanks,
Lukas





[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2005-12-21 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361043 ] 

Andrzej Bialecki  commented on NUTCH-139:
-

Regarding the move to a class with public static fields: I don't have any 
problem with that.

Regarding the Levenshtein distance: I think we can do even better, before we 
resort to such generic methods:

1) bring all property names to lowercase
2) remove any non-letters

Example: Content-type vs. ContentType:

1) content-type vs. contenttype.
2) contenttype vs. contenttype - match

These two steps could be simply implemented in a custom Comparator for the 
ContentProperties.

 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2005-12-21 Thread Jerome Charron (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361045 ] 

Jerome Charron commented on NUTCH-139:
--

Andrzej,

Do you read in my mind?
Yes of course, that's the way I want to do it: First checks for the most common 
cases (lower cases + keeps only letters), then use the Levenshtein distance if 
needed (last chance).
Regards

Jérôme

 Standard metadata property names in the ParseData metadata
 --

  Key: NUTCH-139
  URL: http://issues.apache.org/jira/browse/NUTCH-139
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt

 Currently, people are free to name their string-based properties anything 
 that they want, such as having names of Content-type, content-TyPe, 
 CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a 
 solution in which all property names be converted to lower case, but in 
 essence this really only fixes half the problem right (the case of 
 identifying that CONTENT_TYPE
 and conTeNT_TyPE and all the permutations are really the same). What about
 if I named it Content Type, or ContentType?
  I propose that a way to correct this would be to create a standard set of 
 named Strings in the ParseData class that the protocol framework and the 
 parsing framework could use to identify common properties such as 
 Content-type, Creator, Language, etc.
  The properties would be defined at the top of the ParseData class, something 
 like:
  public class ParseData{
.
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;

 }
 In this fashion, users could at least know what the name of the standard 
 properties that they can obtain from the ParseData are, for example by making 
 a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
 content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
 text/xml); Of course, this wouldn't preclude users from doing what they are 
 currently doing, it would just provide a standard method of obtaining some of 
 the more common, critical metadata without pouring over the code base to 
 figure out what they are named.
 I'll contribute a patch near the end of the this week, or beg. of next week 
 that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: nutch-0.8-dev *mapred.input.subdir* problem ?

2005-12-21 Thread Lukas Vlcek
Stefan,

Nutch created folders in /tmp so I think if it should able to create
files there as well. I also tried to change all /tmp* in conf file to
my home folder with the same result (i.e.: folders were created and
several files were dumped there but it yielded the same exception).

Are you able to run nutch from up-to-date trunk package build?
May be I didn't explain it clearly - I am using untch-0.8-dev which I
get from nutch-trunk.

Regards,
Lukas

On 12/21/05, Stefan Groschupf [EMAIL PROTECTED] wrote:
 Lukas,
 the input folder are normally setted by the tools to you can not
 change that.
 However in case you use a unix box, check that the user that runs
 nutch has read and write acess to all the folder defined in the nutch-
 site/default.xml.
 (I guess that can be the problem, nutch use e.g. /tmp to write in
 some data)
 If this not solve the problem, just run the commands manually step by
 step, there is a tutorial in the wiki how to run the map rd commands
 step by step.

 Stefan

 Am 21.12.2005 um 06:56 schrieb Lukas Vlcek:

  Hi,
 
  I am trying to use nutch-0.8-dev and I have a problem with crawl run.
  I did checkout from SVN and prepared fresh package (ant package - all
  went fine). Then I installed nutch on linux and made only minor
  changes to nutch-site.xml file (turned on some plugins and increased
  several constansts), prepared file with urls and started bin/nutch
  crawl.
 
  This worked for nutch-0.7x but for nutch-0.8-dev I am receiving the
  following exception in log file:
 
  051220 204248 parsing
  file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
  051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
  crawl-tool.xml
  051220 204249 parsing
  file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
  051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
  nutch-site.xml
  051220 204249 crawl started in: ./crawl.test
  051220 204249 rootUrlDir = urls
  051220 204249 threads = 10
  051220 204249 depth = 6
  051220 204249 parsing
  file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
  051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
  crawl-tool.xml
  051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
  nutch-site.xml
  051220 204249 Injector: starting
  051220 204249 Injector: crawlDb: ./crawl.test/crawldb
  051220 204249 Injector: urlDir: urls
  051220 204249 Injector: Converting injected urls to crawl db entries.
  051220 204249 parsing
  file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
  051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
  crawl-tool.xml
  051220 204249 parsing
  file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
  051220 204249 parsing
  file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
  051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
  nutch-site.xml
  051220 204249 parsing
  file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
  051220 204249 parsing
  file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
  051220 204249 parsing /home/lukas/nutch/mapred/local/localRunner/
  job_4zwds6.xml
  051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
  nutch-site.xml
  java.io.IOException: No input directories specified in: NutchConf:
  nutch-default.xml , mapred-default.xml ,
  /home/lukas/nutch/mapred/local/localRunner/job_4zwds6.xml ,
  nutch-site.xml
  at org.apache.nutch.mapred.InputFormatBase.listFiles
  (InputFormatBase.java:85)
  at org.apache.nutch.mapred.InputFormatBase.getSplits
  (InputFormatBase.java:95)
  at org.apache.nutch.mapred.LocalJobRunner$Job.run
  (LocalJobRunner.java:63)
  051220 204249 Running job: job_4zwds6
  Exception in thread main java.io.IOException: Job failed!
  at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:
  308)
  at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:101)
 
  It seems that the problem is that Nutch is not able to find
  mapred.input.subdir setting in neither of config files. I found that
  there is mapred.input.dir property defined in config for particular
  job (job_4zwds6.xml) with value equal to the name of my urls file but
  I don't understand where should I define mapred.input.subdir property
  and what value to assign to it (if it needs to be defined manually -
  note that mapred.input.dir seems to be configured automatically).
 
  Does anybody know the answer?
 
  p.s: Note that number of lines it the exception trace above for
  InputFormatBase.java file (85,95) can differ a bit as I tried to
  insert some more LOG.debug() commands there in search of the root
  cause and then I removed them again but it is possible that I left
  some extra empty lines there.
 
  Thanks,
  Lukas
 




Re: nutch-0.8-dev *mapred.input.subdir* problem ?

2005-12-21 Thread Stefan Groschupf
Yes, I'm able to run it, no problem but I'm using the step by step  
commands not the crawl (allinOne) command.

Can you try a ant test - do all test pass?

Am 21.12.2005 um 12:52 schrieb Lukas Vlcek:


Stefan,

Nutch created folders in /tmp so I think if it should able to create
files there as well. I also tried to change all /tmp* in conf file to
my home folder with the same result (i.e.: folders were created and
several files were dumped there but it yielded the same exception).

Are you able to run nutch from up-to-date trunk package build?
May be I didn't explain it clearly - I am using untch-0.8-dev which I
get from nutch-trunk.

Regards,
Lukas

On 12/21/05, Stefan Groschupf [EMAIL PROTECTED] wrote:

Lukas,
the input folder are normally setted by the tools to you can not
change that.
However in case you use a unix box, check that the user that runs
nutch has read and write acess to all the folder defined in the  
nutch-

site/default.xml.
(I guess that can be the problem, nutch use e.g. /tmp to write in
some data)
If this not solve the problem, just run the commands manually step by
step, there is a tutorial in the wiki how to run the map rd commands
step by step.

Stefan

Am 21.12.2005 um 06:56 schrieb Lukas Vlcek:


Hi,

I am trying to use nutch-0.8-dev and I have a problem with crawl  
run.
I did checkout from SVN and prepared fresh package (ant package -  
all

went fine). Then I installed nutch on linux and made only minor
changes to nutch-site.xml file (turned on some plugins and increased
several constansts), prepared file with urls and started bin/nutch
crawl.

This worked for nutch-0.7x but for nutch-0.8-dev I am receiving the
following exception in log file:

051220 204248 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
crawl-tool.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
nutch-site.xml
051220 204249 crawl started in: ./crawl.test
051220 204249 rootUrlDir = urls
051220 204249 threads = 10
051220 204249 depth = 6
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
crawl-tool.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
nutch-site.xml
051220 204249 Injector: starting
051220 204249 Injector: crawlDb: ./crawl.test/crawldb
051220 204249 Injector: urlDir: urls
051220 204249 Injector: Converting injected urls to crawl db  
entries.

051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
crawl-tool.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
nutch-site.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing /home/lukas/nutch/mapred/local/localRunner/
job_4zwds6.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
nutch-site.xml
java.io.IOException: No input directories specified in: NutchConf:
nutch-default.xml , mapred-default.xml ,
/home/lukas/nutch/mapred/local/localRunner/job_4zwds6.xml ,
nutch-site.xml
at org.apache.nutch.mapred.InputFormatBase.listFiles
(InputFormatBase.java:85)
at org.apache.nutch.mapred.InputFormatBase.getSplits
(InputFormatBase.java:95)
at org.apache.nutch.mapred.LocalJobRunner$Job.run
(LocalJobRunner.java:63)
051220 204249 Running job: job_4zwds6
Exception in thread main java.io.IOException: Job failed!
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:
308)
at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:101)

It seems that the problem is that Nutch is not able to find
mapred.input.subdir setting in neither of config files. I found that
there is mapred.input.dir property defined in config for particular
job (job_4zwds6.xml) with value equal to the name of my urls file  
but
I don't understand where should I define mapred.input.subdir  
property

and what value to assign to it (if it needs to be defined manually -
note that mapred.input.dir seems to be configured automatically).

Does anybody know the answer?

p.s: Note that number of lines it the exception trace above for
InputFormatBase.java file (85,95) can differ a bit as I tried to
insert some more LOG.debug() commands there in search of the root
cause and then I removed them again but it is possible that I left
some extra empty lines there.

Thanks,
Lukas










IndexSorter optimizer

2005-12-21 Thread Andrzej Bialecki

Hi,

I'm happy to report that further tests performed on a larger index seem 
to show that the overall impact of the IndexSorter is definitely 
positive: performance improvements are significant, and the overall 
quality of results seems at least comparable, if not actually better.


The reason why result quality seems better is quite interesting, and it 
shows that the simple top-N measures that I was using in my benchmarks 
may have been too simplistic.


Using the original index, it was possible for pages with high tf/idf of 
a term, but with a low boost value (the OPIC score), to outrank pages 
with high boost but lower tf/idf of a term. This phenomenon leads 
quite often to results that are perceived as junk, e.g. pages with a 
lot of repeated terms, but with little other real content, like for 
example navigation bars.


Using the optimized index, I reported previously that some of the 
top-scoring results were missing. As it happens, the missing results 
were typically the junk pages with high tf/idf but low boost. Since 
we collect up to N hits, going from higher to lower boost values, the 
junk pages with low boost value were automatically eliminated. So, 
overall the subjective quality of results was improved. On the other 
hand, some of the legitimate results with a decent boost values were 
also skipped because they didn't fit within the fixed number of hits... 
ah, well. Perhaps we should limit the number of hits in LimitedCollector 
using a cutoff boost value, and not the maximum number of hits (or 
maybe both?).


This again brings to attention the importance of the OPIC score: it 
represents a query-independent opinion about the quality of the page - 
whichever way you calculate it. If you use PageRank, it (allegedly) 
corresponds to other people's opinions about the page, thus providing an 
objective quality opinion. If you use a simple list of 
white/black-listed sites that you like/dislike, then it represents your 
own subjective opinion on the quality of the site; etc, etc... In this 
way, running a search engine that provides good results is not just a 
plain precision, recall, tf/idf and other tangible measures, it's also a 
sort of political statement of the engine's operator. ;-)


To conclude, I will add the IndexSorter.java to the core classes, and I 
suggest to continue the experiments ...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Crawling a nutch index with Lucene

2005-12-21 Thread Oliver Hummel
Hi,

I'm rather new to nutch, but is there something wrong with the idea of
creating an index with nutch (using the intranet search from the nutch
tutorial) and searching this index with Lucene? I.e. doing something
like this:

import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.index.IndexReader;
...
Searcher searcher = new IndexSearcher(IndexReader.open(indexDir));

For my setup this leads to the exception pasted at the end of the
message. :-(

I found a similar question on another list
(http://www.opensubscriber.com/message/nutch-general@lists.sourceforge.net/2641952.html)
but it looks like people there didn't really get the question and hence
this doesn' help much.

Any help for this is greatly appreciated!

Best,

  Oliver



java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.get(Unknown Source)
at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155)
at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:151)
at
org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:14
at
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:115)
at
org.apache.lucene.index.TermInfosReader.readIndex(TermInfosReader.java:8
at
org.apache.lucene.index.TermInfosReader.init(TermInfosReader.java:45)
at
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:112)
at
org.apache.lucene.index.SegmentReader.init(SegmentReader.java:89)
at
org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:118)
at org.apache.lucene.store.Lock$With.run(Lock.java:109)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:95)
at Lucy.main(Lucy.java:21)
search: (ioe) -1


Re: Crawling a nutch index with Lucene

2005-12-21 Thread Daniel Naber
On Mittwoch 21 Dezember 2005 17:13, Oliver Hummel wrote:

 java.lang.ArrayIndexOutOfBoundsException: -1

That's the error you get when you open a Lucene 1.9 index with Lucene 1.4. 
But I don't know if that's also the case here.

Regards
 Daniel

-- 
http://www.danielnaber.de


Re: Crawling a nutch index with Lucene

2005-12-21 Thread Oliver Hummel
Yep, that's it. Nutch has Lucene 1.9 in its lib.

Many thanks!

  Oliver



Daniel Naber wrote:
 On Mittwoch 21 Dezember 2005 17:13, Oliver Hummel wrote:
 
 
java.lang.ArrayIndexOutOfBoundsException: -1
 
 
 That's the error you get when you open a Lucene 1.9 index with Lucene 1.4. 
 But I don't know if that's also the case here.
 
 Regards
  Daniel
 


Re: nutch-0.8-dev *mapred.input.subdir* problem ?

2005-12-21 Thread Paul Baclace

You can ignore mapred.input.subdir; I find it is an unneeded option.

Now that the mapred branch is merged to be the trunk, there is a need
to clarify the documentation since the a change was made to have the
input be specified as a directory and then all files in that directory
are considered input files (no wildcard needed).  I will put that on
my ToDo list.

mapred.input.dir is an abstract path that is either the OS filesystem
or NDFS, depending on which is in use (if fs.default.name is local then
the local OS fs is being used, otherwise fs.default.name is something
like domainOfMyMasterNode:port).

To use NDFS, you need to copy your input file(s) from your local fs to NDFS:

  bin/nutch ndfs -put /home/peb/urls_localfs/oneFILENAME  /urls

The destination path /urls is arbitrary and is created as a side effect
of the file -put.  Repeat this for each file you have.

Paul

Lukas Vlcek wrote:
 java.io.IOException: No input directories specified in: NutchConf:
 nutch-default.xml , mapred-default.xml ,
 /home/lukas/nutch/mapred/local/localRunner/job_4zwds6.xml ,
 nutch-site.xml
 at 
org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:85)
 at 
org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:95)
 at 
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:63)
 051220 204249 Running job: job_4zwds6
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
 at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:101)



Re: IndexSorter optimizer

2005-12-21 Thread Stefan Groschupf

Hi Andrzej,

wow are really great news!
Using the optimized index, I reported previously that some of the  
top-scoring results were missing. As it happens, the missing  
results were typically the junk pages with high tf/idf but low  
boost. Since we collect up to N hits, going from higher to lower  
boost values, the junk pages with low boost value were  
automatically eliminated. So, overall the subjective quality of  
results was improved. On the other hand, some of the legitimate  
results with a decent boost values were also skipped because they  
didn't fit within the fixed number of hits... ah, well. Perhaps we  
should limit the number of hits in LimitedCollector using a cutoff  
boost value, and not the maximum number of hits (or maybe both?).


As far we experiment it would be good to have booth.

To conclude, I will add the IndexSorter.java to the core classes,  
and I suggest to continue the experiments ...


May someone out there in the community has a commercial search engine  
running (e.g. google appliance or similar) so we may can setup a  
nutch with the same pages and compare the results.
I guess it will be difficult to compare nutch with yahoo or google  
since nobody of us has a 4 billion index up and running. I would run  
one on my laptop but I do not have the bandwidth to fetch until next  
two days. :-D

Great work!

Cheers,
Stefan 


Re: IndexSorter optimizer

2005-12-21 Thread Byron Miller
I've got 400mill db i can run this against over the
next few days.

-byron

--- Stefan Groschupf [EMAIL PROTECTED] wrote:

 Hi Andrzej,
 
 wow are really great news!
  Using the optimized index, I reported previously
 that some of the  
  top-scoring results were missing. As it happens,
 the missing  
  results were typically the junk pages with high
 tf/idf but low  
  boost. Since we collect up to N hits, going from
 higher to lower  
  boost values, the junk pages with low boost
 value were  
  automatically eliminated. So, overall the
 subjective quality of  
  results was improved. On the other hand, some of
 the legitimate  
  results with a decent boost values were also
 skipped because they  
  didn't fit within the fixed number of hits... ah,
 well. Perhaps we  
  should limit the number of hits in
 LimitedCollector using a cutoff  
  boost value, and not the maximum number of hits
 (or maybe both?).
 
 As far we experiment it would be good to have booth.
 
  To conclude, I will add the IndexSorter.java to
 the core classes,  
  and I suggest to continue the experiments ...
 
 May someone out there in the community has a
 commercial search engine  
 running (e.g. google appliance or similar) so we may
 can setup a  
 nutch with the same pages and compare the results.
 I guess it will be difficult to compare nutch with
 yahoo or google  
 since nobody of us has a 4 billion index up and
 running. I would run  
 one on my laptop but I do not have the bandwidth to
 fetch until next  
 two days. :-D
 Great work!
 
 Cheers,
 Stefan 
 



Re: Static initializers

2005-12-21 Thread Stefan Groschupf

Andrzej,
well I'm not ready with digging into the problem but want to ask some  
more questions.
BTW I counted 195 places that use NutchConf.get(), so this will be a  
bigger patch. :)


As I mentioned I would love to go the inversion of control way, so  
not using nutchConf in the constructor but make classes implementing  
the Configurable interface. This for example would be sensefully for  
all classed realizing a extension.
But there are also classes where this makes no sense. For example I  
would suggest to change the PluginRegestry from a  singleton to a  
'normal' object, in this case I guess it make sense to use the  
nutchConf in the constructor, since the configuration here only need  
to know the

include and exclude regex for the plugins.
So:

   Extension.getExtensionInstance() - getExtensionInstance(NutchConf)
This makes sense, here we can check if the class implements the  
configurable interface and if so instantiate the object and set the  
configuration.



   ExtensionPoint.getExtensions() - getExtensions(NutchConf)
We don't need NutchConf here since if I understand it correct this is  
only needed to identify the activated plugins and this is done until  
regestry instantiation that in this case take a NutchConf as parameter.


   PluginRepository.getExtensionPoint(String) - getExtensionPoint 
(String, NutchConf)
We don't need it here as well, since we use NutchConf until regestry  
instantiation.
The other case would be that we have to build up the plugin  
dependency graph for each method call.
Would you agree to have a several plugin regestries with may be  
different NutchConf's but instantiate extensions with nutchConf but  
not query ExtentsionPoints etc?



etc, etc...

The way this would work would be similar to the mechanism described  
above: if plugin instances are not created yet, they would be  
created once (based on the current NutchConf argument), and then  
cached in this NutchConf instance.

I guess this is difficult.
First we have the plugin class instances, most or may all plugins I  
know do not have a plugin class implementation, second we have the  
extensions classes that at least do not need to implement a specific  
interface from the plugin regestry point of view (only such things  
like index filter interface etc.)
Caching plugin class instances makes sense since actually there is  
only one  plugin class instance per plugin in the jvm. However there  
will be many instances for each extension class, since e.g. the  
parser or protocoll runs multithreaded.




And also the plugin implementations would have to extend  
NutchConfigured, taking NutchConf as the argument to their  
constructors - because now the Extension.getExtensionInstance would  
pass the current NutchConf instance to their contructors.


In general my point of view is that:
In case we touch this issue anyway I would love to do a radical  
solution, since i have a other understanding of handling parameters  
than collect them in a kind of map and make the map general accessible.
Instead of giving any object access to the configuration object and  
handle properties like a bazar I would prefer handle configuration  
only in the first object in the stack, that would be in our case for  
example the indexing tool.
Than the indexing tool instantiate the plugin registry only with the  
required properties that would be part of the constructor, e.g.  
pluginFolders, include, exlude reg ex and autoactivation flag.
Later the extension instances can be also get some more values  
injected, but has in general no access to the configuration object.   
This would first of all make things better testable but also allows  
much much more flexibility to run several different fetchers  in one  
jvm.
Anyway this would be may a imporvement suggestion from me for nutch  
2.0 or 3.0 for now we would be some steps forward just changing  
NutchConf access to non static style.



I hopefully found some time until next days to do some experiments  
and will come back with some more details.


However we should found a general agreement about the way we go,  
since changing code in 195 places and lines that depends on it for  
just nothing is not that funny.


Stefan




Re: IndexSorter optimizer

2005-12-21 Thread American Jeff Bowden

Andrzej Bialecki wrote:


Hi,

I'm happy to report that further tests performed on a larger index 
seem to show that the overall impact of the IndexSorter is definitely 
positive: performance improvements are significant, and the overall 
quality of results seems at least comparable, if not actually better.



This is very interesting.  What's the computational complexity and disk 
I/O for index sorting as compared other operations on an index (e.g. 
adding/deleting N documents and running optimize)?




Re: IndexSorter optimizer

2005-12-21 Thread Andrzej Bialecki

American Jeff Bowden wrote:


Andrzej Bialecki wrote:


Hi,

I'm happy to report that further tests performed on a larger index 
seem to show that the overall impact of the IndexSorter is definitely 
positive: performance improvements are significant, and the overall 
quality of results seems at least comparable, if not actually better.




This is very interesting.  What's the computational complexity and 
disk I/O for index sorting as compared other operations on an index 
(e.g. adding/deleting N documents and running optimize)?



Comparable to optimize(). All index data needs to be read and copied, so 
the whole process is I/O bound.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com