Re: [DISCUSS] Board resolution for Nutch as TLP

2010-04-10 Thread Dennis Kubes

I think it looks good after the minor changes.  +1.

Dennis

Andrzej Bialecki wrote:

Hi,

I was told that the next step is to come up with the proposed Board
resolution and vote it among committers. Here's the proposed text
(shameless copypaste from Tika and Mahout proposals).

IMPORTANT NOTE: I removed from the members of the PMC those existing
Nutch committers that haven't been active for more than 1 year, with the
intention of moving them to Emeritus status. If any one of these people
feels left out and would like to become an active committer in the
project, please let us know and we will gladly welcome you back :)

The text of the resolution follows. Committers, please read it and
optionally comment on the salient points of the text, the rest is
boilerplate. If there's an overall consensus I will call for a formal
vote to submit this proposal to the Board.


==
X. Establish the Apache Nutch Project

WHEREAS, the Board of Directors deems it to be in the best
interests of the Foundation and consistent with the
Foundation's purpose to establish a Project Management
Committee charged with the creation and maintenance of
open-source software related to a large-scale web crawling
platform for distribution at no charge to the public.

NOW, THEREFORE, BE IT RESOLVED, that a Project Management
Committee (PMC), to be known as the Apache Nutch Project,
be and hereby is established pursuant to Bylaws of the
Foundation; and be it further

RESOLVED, that the Apache Nutch Project be and hereby is
responsible for the creation and maintenance of software
related to a large-scale web crawling platform; and be it further

RESOLVED, that the office of Vice President, Apache Nutch be
and hereby is created, the person holding such office to
serve at the direction of the Board of Directors as the chair
of the Apache Nutch Project, and to have primary responsibility
for management of the projects within the scope of
responsibility of the Apache Nutch Project; and be it further

RESOLVED, that the persons listed immediately below be and
hereby are appointed to serve as the initial members of the
Apache Nutch Project:

• Andrzej Bialecki a...@...
• Otis Gospodnetic o...@...
• Dogacan Guney doga...@...
• Dennis Kubes ku...@...
• Chris Mattmann mattm...@...
• Julien Nioche jnio...@...
• Sami Siren si...@...

RESOLVED, that the Apache Nutch Project be and hereby
is tasked with the migration and rationalization of the Apache
Lucene Nutch sub-project; and be it further

RESOLVED, that all responsibilities pertaining to the Apache
Lucene Nutch sub-project encumbered upon the
Apache Nutch Project are hereafter discharged.

NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki
be appointed to the office of Vice President, Apache Nutch, to
serve in accordance with and subject to the direction of the
Board of Directors and the Bylaws of the Foundation until
death, resignation, retirement, removal or disqualification,
or until a successor is appointed.
=






[jira] Commented: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20

2009-12-14 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790162#action_12790162
 ] 

Dennis Kubes commented on NUTCH-768:


The older jetty jar file was not removed with this patch.  It will need to be 
removed from the nutch lib directory if applying the patch versus pulling from 
trunk.  There is also a second patch that updates unit tests for the Jetty 
interfaces.  Neither of these will need to be applied if pulling from Trunk as 
those problems have been corrected.

 Upgrade Nutch 1.0 to use Hadoop 0.20
 

 Key: NUTCH-768
 URL: https://issues.apache.org/jira/browse/NUTCH-768
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-768-1-20091125.patch


 Upgrade Nutch 1.0 to use the Hadoop 0.20 release.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Build failed in Hudson: Nutch-trunk #1011

2009-12-14 Thread Dennis Kubes
This is failing because of the older jetty jar being removed and the 
Jetty interfaces changes.  I am currently working to fix the interfaces 
for the new Jetty version.  Hope to have a patch committed later today 
and this should be back to normal.


Dennis

Apache Hudson Server wrote:

See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1011/

--
[...truncated 4728 lines...]
jar:

init:

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: lib-regex-filter

compile-test:

compile:
 [echo] Compiling plugin: urlfilter-regex
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/classes

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/urlfilter-regex.jar

deps-test:

init:

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: lib-regex-filter

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/test

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlfilter-suffix
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes
[javac] Note: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
 uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/urlfilter-suffix.jar

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/test

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlfilter-validator
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/urlfilter-validator.jar

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic

copy-generated-lib:
 [copy] 

[jira] Closed: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20

2009-12-01 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-768.
--

Resolution: Fixed

Weird.  The hsqldb License file was the same checksum as that pulled from 
hadoop.  It must have had the windows EOL in hadoop distribution as well.  I 
changed it anyways.  Everything committed with revision 885778.

 Upgrade Nutch 1.0 to use Hadoop 0.20
 

 Key: NUTCH-768
 URL: https://issues.apache.org/jira/browse/NUTCH-768
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-768-1-20091125.patch


 Upgrade Nutch 1.0 to use the Hadoop 0.20 release.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20

2009-11-30 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784066#action_12784066
 ] 

Dennis Kubes commented on NUTCH-768:


If no objections I will commit this tomorrow sometime?

 Upgrade Nutch 1.0 to use Hadoop 0.20
 

 Key: NUTCH-768
 URL: https://issues.apache.org/jira/browse/NUTCH-768
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-768-1-20091125.patch


 Upgrade Nutch 1.0 to use the Hadoop 0.20 release.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: svn commit: r884075 - /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java

2009-11-25 Thread Dennis Kubes

Oops.  Sorry about that.

a...@apache.org wrote:

Author: ab
Date: Wed Nov 25 12:44:34 2009
New Revision: 884075

URL: http://svn.apache.org/viewvc?rev=884075view=rev
Log:
Change access from private to public - this fixes Crawl.java breakage.

Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java

Modified: 
lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
URL: 
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java?rev=884075r1=884074r2=884075view=diff
==
--- lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java 
(original)
+++ lucene/nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java 
Wed Nov 25 12:44:34 2009
@@ -50,7 +50,7 @@
 super(conf);
   }
 
-  private void indexSolr(String solrUrl, Path crawlDb, Path linkDb,

+  public void indexSolr(String solrUrl, Path crawlDb, Path linkDb,
   ListPath segments) throws IOException {
 LOG.info(SolrIndexer: starting);
 





[jira] Updated: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20

2009-11-25 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-768:
---

Attachment: NUTCH-768-1-20091125.patch

I thought I was going to be able to do this without code changes.  No such 
luck.  

There are many, many deprecations as a result of this upgrade.  Anything that 
used the old Mapper and Reducer interfaces seems to have deprecated methods in 
it.  The NutchBean class needed to implement the two RPC*Bean interfaces to 
handle changes in Hadoop RPC (that could have been a leftover from 1.0 changes 
but I don't think so).  Also there are numerous changes to build scripts and 
the nutch bin script to support different hadoop jars.

There are also many new files for the conf directory as Hadoop has split out 
files and has new configuration files for new capabilities.

After all changes I was able to run everything in local and pseudo-distributed 
mode as well as test out local and distributed searching.  Everything seems to 
work fine.  After we make this upgrade I would recommend going back and 
updating all of the tool interfaces for the most recent APIs.

 Upgrade Nutch 1.0 to use Hadoop 0.20
 

 Key: NUTCH-768
 URL: https://issues.apache.org/jira/browse/NUTCH-768
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-768-1-20091125.patch


 Upgrade Nutch 1.0 to use the Hadoop 0.20 release.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-771) Add WebGraph classes to the bin/nutch script

2009-11-24 Thread Dennis Kubes (JIRA)
Add WebGraph classes to the bin/nutch script


 Key: NUTCH-771
 URL: https://issues.apache.org/jira/browse/NUTCH-771
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All, shell script
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1


Currently the webgraph jobs are called on the command line by calling main 
methods on their classes.  I propose to upgrade the bin/nutch shell script to 
allow calling these jobs as well.  This would include the webgraphdb, linkrank, 
scoreupdater, and nodedumper jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-768) Upgrade Nutch 1.0 to use Hadoop 0.20

2009-11-24 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12782172#action_12782172
 ] 

Dennis Kubes commented on NUTCH-768:


I have tested the upgrade with Hadoop 0.20.  To upgrade this correctly we do 
need to upgrade Xerces both in the main lib jars and within the lib-xml plugin. 
 I have upgraded to the most recent version of Xerces 2.9.x.  Having run 
through multiple full crawl and index cycles both on the new and old indexing 
frameworks, including the webgraphdb, and the solr indexing process, I didn't 
find any errors within the process.  If no one has any objections I will commit 
these changes within the next 24 hours.

 Upgrade Nutch 1.0 to use Hadoop 0.20
 

 Key: NUTCH-768
 URL: https://issues.apache.org/jira/browse/NUTCH-768
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1


 Upgrade Nutch 1.0 to use the Hadoop 0.20 release.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-765) Allow Crawl class to call Either Solr or Lucene Indexer

2009-11-21 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes reassigned NUTCH-765:
--

Assignee: Dennis Kubes

 Allow Crawl class to call Either Solr or Lucene Indexer
 ---

 Key: NUTCH-765
 URL: https://issues.apache.org/jira/browse/NUTCH-765
 Project: Nutch
  Issue Type: Improvement
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0, 1.1

 Attachments: NUTCH-765-2009112-1.patch


 Change to the crawl class to have a -solr option which will call the solr 
 indexer instead of the lucene indexer.  This also allows it to ignore dedup 
 and merge for solr indexing and to point to a specific solr instance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-765) Allow Crawl class to call Either Solr or Lucene Indexer

2009-11-21 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-765.


Resolution: Fixed

Committed.

 Allow Crawl class to call Either Solr or Lucene Indexer
 ---

 Key: NUTCH-765
 URL: https://issues.apache.org/jira/browse/NUTCH-765
 Project: Nutch
  Issue Type: Improvement
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.1, 1.0.0

 Attachments: NUTCH-765-2009112-1.patch


 Change to the crawl class to have a -solr option which will call the solr 
 indexer instead of the lucene indexer.  This also allows it to ignore dedup 
 and merge for solr indexing and to point to a specific solr instance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-765) Allow Crawl class to call Either Solr or Lucene Indexer

2009-11-21 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-765.
--


 Allow Crawl class to call Either Solr or Lucene Indexer
 ---

 Key: NUTCH-765
 URL: https://issues.apache.org/jira/browse/NUTCH-765
 Project: Nutch
  Issue Type: Improvement
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0, 1.1

 Attachments: NUTCH-765-2009112-1.patch


 Change to the crawl class to have a -solr option which will call the solr 
 indexer instead of the lucene indexer.  This also allows it to ignore dedup 
 and merge for solr indexing and to point to a specific solr instance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Plugin Help

2009-11-14 Thread Dennis Kubes
It depends on how you are building and your classpath.  Lets call your 
plugin myhtmlfilter.  If running on a single server and you added it to 
your src/plugin/build.xml under the deploy section, a myhtmlfilter 
folder with the plugin should show up in under the build/plugins folder 
upon build.  Then you would just have to copy over that myhtmlfilter 
folder to where your deployment plugins directory.


If running on a cluster, even in pseudo-distributed mode you would need 
to copy over the nutch-*.job file.  It has the plugins inside of it and 
it gets distributed out to the cluster.  If referencing from a webapp or 
the nutch war file, you would need to copy to web-inf/classes/plugins.


Dennis

david.stu...@progressivealliance.co.uk wrote:

  Hi,

I am trying to write a plugin for nutch and am having real troubles 
getting it registered in the system. I have created in src/plugin and 
added it to both the build.xml in plugin and to nutch-site.xml now it 
seems to build ok but when I try to run a basic crawl urls -dir crawl 
-depth 3 -topN 2 I see the plugin registered in the hadoop.log


2009-11-14 14:57:45,739 INFO  plugin.PluginRepository -  Html Filter 
Parse Plug-in (parse-htmlfilter)


But then get the error message below. I have followed all of the 
tutorials but they are mostly for nutch 0.9 and have error in them which 
I have worked through


Thanks for your help

regards,
Dave
java.lang.RuntimeException: 
org.apache.nutch.plugin.PluginRuntimeException: 
java.lang.ClassNotFoundException: 
org.apache.nutch.parse.htmlfilter.HtmlfilterIndexer
at 
org.apache.nutch.indexer.IndexingFilters.init(IndexingFilters.java:100)
at 
org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:61)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)

at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
Caused by: org.apache.nutch.plugin.PluginRuntimeException: 
java.lang.ClassNotFoundException: 
org.apache.nutch.parse.htmlfilter.HtmlfilterIndexer
at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:166)
at 
org.apache.nutch.indexer.IndexingFilters.init(IndexingFilters.java:70)

... 8 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.nutch.parse.htmlfilter.HtmlfilterIndexer

at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:319)
at java.lang.ClassLoader.loadClass(ClassLoader.java:254)
at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:156)


[jira] Updated: (NUTCH-765) Allow Crawl class to call Either Solr or Lucene Indexer

2009-11-12 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-765:
---

Attachment: NUTCH-765-2009112-1.patch

 Allow Crawl class to call Either Solr or Lucene Indexer
 ---

 Key: NUTCH-765
 URL: https://issues.apache.org/jira/browse/NUTCH-765
 Project: Nutch
  Issue Type: Improvement
 Environment: All
Reporter: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0, 1.1

 Attachments: NUTCH-765-2009112-1.patch


 Change to the crawl class to have a -solr option which will call the solr 
 indexer instead of the lucene indexer.  This also allows it to ignore dedup 
 and merge for solr indexing and to point to a specific solr instance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-765) Allow Crawl class to call Either Solr or Lucene Indexer

2009-11-12 Thread Dennis Kubes (JIRA)
Allow Crawl class to call Either Solr or Lucene Indexer
---

 Key: NUTCH-765
 URL: https://issues.apache.org/jira/browse/NUTCH-765
 Project: Nutch
  Issue Type: Improvement
 Environment: All
Reporter: Dennis Kubes
Priority: Minor
 Fix For: 1.1, 1.0.0
 Attachments: NUTCH-765-2009112-1.patch

Change to the crawl class to have a -solr option which will call the solr 
indexer instead of the lucene indexer.  This also allows it to ignore dedup and 
merge for solr indexing and to point to a specific solr instance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Server suggestion

2009-07-25 Thread Dennis Kubes
My mistake, you're right.  The last processing clusters we built were 
using Xeon quad cores, not i7s.  The i7s were search servers which 
didn't need ecc memory.  AFAICT, wikipedia is correct and the i7s don't 
yet support ECC.


So my suggestion would be to stick with Xeon procs or something that 
supports ECC for the processing clusters.  I would never build a 
processing cluster that doesn't have ECC memory.  We spent a few weeks 
when we first started trying to tracking down weird corruption checksum 
bugs ultimately related to using non-ECC memory on a cluster.


Dennis

Doğacan Güney wrote:

Hi Dennis,

On Fri, Jul 24, 2009 at 16:46, Dennis Kubesku...@apache.org wrote:


fredericoagent wrote:

If I want to setup nutch with lets say 400 million urls in the database.

Is it better to have a 4-5 super fast and loaded servers or have 12-15
smaller , cheaper servers.

More smaller servers.  Make sure they are energy efficient though and have a
decent amount of Ram.  If a server goes down, you aren't affected as much.


By superfast I mean cpu is latest quad core or latest six core processor
with 6 Gigs Ram and 1. or 1.5 TB HD.

By cheap I mean something like a Xeon quad core 2.26 cpu with 3 Gig Ram
and
500 Sata HD.


or if anyone can suggest a better spec ideal

Our first servers were 1Ghz (Yes really) running hadoop 0.04 way back when.
 Our first production clusters were core2, 4G ECC, 1 750G hard drive.  These
days been building i7 8-core, 12G ECC, 4T raid-5 machines with up to 8
disks, 2U for around 2200.00 each.  If you are looking for a good server
builder check out swt.com. They are supermicro resellers and build solid
machines.



It suggests here:

http://en.wikipedia.org/wiki/Core_i7#Drawbacks

that core i7's do not support ECC rams. Have you ran into any issues or is WP
wrong here?



Suggestions.  Don't skimp on the hard drive, do at least 750G or more. Price
difference is negligible.  Do at least 2G Ram, 4G is better, 8G is better
than that.  You can get up to 12G on regular motherboards these days.  After
that it gets much more expensive.  Ao more recent processors, such as core2
or i7.  They are more power efficient per processing unit.  If you want a
really fast machine, do multiple disks in a raid-5 format.

Dennis







Re: Nutch dev. plans

2009-07-17 Thread Dennis Kubes



Doğacan Güney wrote:

On Fri, Jul 17, 2009 at 21:32, Andrzej Bialeckia...@getopt.org wrote:

Doğacan Güney wrote:

Hey list,

On Fri, Jul 17, 2009 at 16:55, Andrzej Bialeckia...@getopt.org wrote:

Hi all,

I think we should be creating a sandbox area, where we can collaborate
on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan
will
be importing his HBase work as 'nutchbase'. Tika work is the least
disruptive, so it could occur even on trunk. OSGI plugins work (which I'd
like to tackle) means significant refactoring so I'd rather put this on a
branch too.


Thanks for starting the discussion, Andrzej.

Can you detail your OSGI plugin framework design? Maybe I missed the
discussion but
updating the plugin system has been something that I wanted to do for
a long time :)
so I am very much interested in your design.

There's no specific design yet except I can't stand the existing plugin
framework anymore ... ;) I started reading on OSGI and it seems that it
supports the functionality that we need, and much more - it certainly looks
like a better alternative than maintaining our plugin system beyond 1.x ...


I think I remember a conversation a while back about this :)  Not OSGI 
specifically but changing the plugin framework.  I am all for changing 
it to something like OSGI though.


Dennis





Couldn't agree more with the can't stand plugin framework :D

Any good links on OSGI stuff?


Oh, an additional comment about the scoring API: I don't think the claimed
benefits of OPIC outweigh the widespread complications that it caused in the
API. Besides, getting the static scoring right is very very tricky, so from
the engineer's point of view IMHO it's better to do the computation offline,
where you have more control over the process and can easily re-run the
computation, rather than rely on an online unstable algorithm that modifies
scores in place ...



Yeah, I am convinced :) . I am not done yet, but I think OPIC-like scoring will
feel very natural in a hbase-backed nutch. Give me a couple more days to polish
the scoring API then we can change it if you are not happy with it.


Dogacan, you mentioned that you would like to work on Katta integration.
Could you shed some light on how this fits with the abstract indexing 
searching layer that we now have, and how distributed Solr fits into this
picture?


I haven't yet given much thought to Katta integration. But basically,
I am thinking of
indexing newly-crawled documents as lucene shards and uploading them
to katta for searching. This should be very possible with the new
indexing system. But so far, I have neither studied katta too much nor
given much thought to integration. So I may be missing obvious stuff.

Me too..


About distributed solr: I very much like to do this and again, I
think, this should be possible to
do within nutch. However, distributed solr is ultimately uninteresting
to me because (AFAIK) it doesn't have the reliability and
high-availability that hadoophbase have, i.e. if a machine dies you
lose that part of the index.

Grant Ingersoll is doing some initial work on integrating distributed Solr
and Zookeeper, once this is in a usable shape then I think perhaps it's more
or less equivalent to Katta. I have a patch in my queue that adds direct
Hadoop-Solr indexing, using Hadoop OutputFormat. So there will be many
options to push index updates to distributed indexes. We just need to offer
the right API to implement the integration, and the current API is IMHO
quite close.


Are there any projects going on that are live indexing systems like
solr, yet are backed up by hadoop HDFS like katta?

There is the Bailey.sf.net project that fits this description, but it's
dormant - either it was too early, or there were just too many design
questions (or simply the committers moved to other things).


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com








Re: Ranking Scoring Algorithm Pseudocode

2009-05-31 Thread Dennis Kubes
There isn't any pseudocode for this.  The code for the main algorithm is 
in the LinkRank class.  It is similar in nature to PageRank except it 
has the ability to filter reciprocal links.  If the Link Loops program 
is run it also has the ability to filter out link cycles, but that 
program is O(n) running time so not very efficient.


The LinkRank class is just a single score factor though, the setup of 
the new indexing system allows multiple factors to be combined where the 
LinkRank may be only a single factor in that.


If looking for how the algorithm works I suggest looking at the early 
PageRank algorithm papers.  Here are some links which you may find useful:


http://en.wikipedia.org/wiki/PageRank
http://www.ianrogers.net/google-page-rank/


Dennis

atencorps wrote:

Hi,

I came across the Ranking  Score system in Nutch 1.0 ( which includes the
webgraph, linkrank etc).

My question is , where can I find the pseudocode for the Ranking  Scoring
Algorithm/System in place in Nutch 1.0 ?

Thanks 



Re: Ranking Algorithms

2009-05-18 Thread Dennis Kubes
The answer is simple and not so simple at the same time.  Last year we 
put in quite a bit of work to implement a stable PageRank like algorithm 
into Nutch.  This was released as the new scoring and indexing 
frameworks.  That give a good general relevancy score, but it is really 
a starting point.


Many people look at search engines and see a single algorithms, such as 
PageRank.  In reality, a modern search engine, such as google or yahoo, 
will have hundreds of algorithms and jobs that contribute to relevancy 
of search results.  This is because of two factors:


1) After getting good general relevancy (i.e. link analysis and such), 
search relevancy is about handling specific relevancy issues.  For 
example handling reciprocal links, near duplicate detection, 
organizations that own 100k domains, template pages, blogs and echo 
chambers, hacked pages and blogs with link and keyword spam, malware, 
etc.  Each of these types of issues, and there are many more, require 
specific algorithms to handle them.


Google and Yahoo would have algorithms (and people who specialize in 
certain areas) to handle all of these types of issues usually through 
statistical analysis and machine learning jobs.  These jobs would then 
be aggregated together (think pipeline) to form final search engine 
relevancy scores.


In all fairness, this is offline relevancy.  There would also be a 
considerable amount of work done on query parsing and online relevancy.


2) Relevancy scores change over time due to people and companies 
attempting to manipulate search results through SEO (both good and bad), 
 through culture in general, and through search engines working through 
better algorithms.


So this is a long way of explaining that while Nutch has IMO a good 
general relevancy currently, taking it to the next level to where 
results are as good as google is going to take many different 
specialized MapReduce jobs that we currently don't have.


Dennis

atencorps wrote:

Nutch is a great search Engine and was recently pleased when the large multi
national I work for did some trials of Nutch Vs Google when we were
evaluating and looking for Enterprise search, was glad to say Nutch was a
worthy competitor thus Google Enterprise was chosen only due to office
politics (prefering large company over smaller etc ).

In terms of Enterprise Search I think Nutch already has it covered , my
question is towards Internet Search.

Thus Pagerank has been around for over 10 yrs and is what built Google. Are
there any newer more capable Ranking algorithms available, and also are
there any vision in terms of implementing a truely worthy ranking algorithm
into Nutch that could truely deliver quality Internet Search results like
Google ?.






Re: LinkRank why 10 iterations?

2009-03-27 Thread Dennis Kubes
You are running LinkRank and a comparatively small webgraph.  LinkRank 
is meant, in principle, to be run on very large webgraphs, millions or 
perhaps 100s of millions of urls.  On that scale 10 iterations was what 
we saw as a good default for the webgraph to converge while not taking 
an excessive amount of time.  For smaller webgraphs 10 iterations might 
not be necessary.


You can use the link.analyze.num.iterations configuration variable to 
set the number of iterations you would like to run.  As a general rule I 
don't think I would ever go below 5 iterations as all but the very 
smallest webgraphs wouldn't have enough chance to converge.


Here is a good paper on PageRank and convergence.  Principles are the same:

http://www.webworkshop.net/pagerank.html

Dennis



Bartosz Gadzimski wrote:

Hello,

Why you are making so many iterations in linkrank are this is neccessary 
for some amount of websites?


Thanks,
Bartosz


[jira] Closed: (NUTCH-291) OpenSearchServlet should return date as well as lastModified

2009-03-25 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-291.
--

Resolution: Fixed

The open search servlet has been superseded by formatters for serving results 
in xml and json format.  Closing issue.

 OpenSearchServlet should return date as well as lastModified
 

 Key: NUTCH-291
 URL: https://issues.apache.org/jira/browse/NUTCH-291
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.8
Reporter: Stefan Neufeind
Assignee: Dennis Kubes
 Attachments: NUTCH-291-unfinished.patch


 Currently lastModified is provided by OpenSearchServlet - but only in case 
 the date lastModified-date is known.
 Since you can sort by date (which is lastModified or if not present the 
 fetchdate), it might be useful if OpenSearchServlet could provide date as 
 well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-729) NPE in FieldIndexer when BasicFields url doesn't exist

2009-03-25 Thread Dennis Kubes (JIRA)
NPE in FieldIndexer when BasicFields url doesn't exist
--

 Key: NUTCH-729
 URL: https://issues.apache.org/jira/browse/NUTCH-729
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.9.0, 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1


There is a NullPointerException during a logging call in FieldIndexer when 
there isn't a url for a document.  Documents shouldn't be without urls but 
since the FieldIndexer doesn't validate fields it is possible for it to occur.  
Most often this happens when BasicFields is run with the wrong segments 
directory and doesn't complain.  It could also occur if using the FieldIndexer 
to index things other than basic fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-729) NPE in FieldIndexer when BasicFields url doesn't exist

2009-03-25 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-729:
---

Attachment: NUTCH-729-1-20090235.patch

Simple patch.  Changes the logging to use the key (which should be url and 
which should always exist).

 NPE in FieldIndexer when BasicFields url doesn't exist
 --

 Key: NUTCH-729
 URL: https://issues.apache.org/jira/browse/NUTCH-729
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.9.0, 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-729-1-20090235.patch


 There is a NullPointerException during a logging call in FieldIndexer when 
 there isn't a url for a document.  Documents shouldn't be without urls but 
 since the FieldIndexer doesn't validate fields it is possible for it to 
 occur.  Most often this happens when BasicFields is run with the wrong 
 segments directory and doesn't complain.  It could also occur if using the 
 FieldIndexer to index things other than basic fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Release Apache Nutch 1.0

2009-03-25 Thread Dennis Kubes

+1, is this binding? :)

Dog(acan Güney wrote:

Another non-binding +1 from me.

Hope this one is a keeper :D

On Mon, Mar 23, 2009 at 22:28, Sami Siren ssi...@gmail.com 
mailto:ssi...@gmail.com wrote:


Hello,

I have packaged the third release candidate for Apache Nutch 1.0
release at http://people.apache.org/~siren/nutch-1.0/rc2/
http://people.apache.org/%7Esiren/nutch-1.0/rc2/

See the CHANGES.txt[1] file for details on release contents and
latest changes. The release was made from tag:
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/

The following issues that were discovered during the review of last
rc have been fixed:

https://issues.apache.org/jira/browse/NUTCH-722
https://issues.apache.org/jira/browse/NUTCH-723
https://issues.apache.org/jira/browse/NUTCH-725
https://issues.apache.org/jira/browse/NUTCH-726
https://issues.apache.org/jira/browse/NUTCH-727

Please vote on releasing this package as Apache Nutch 1.0. The vote
is open for the next 72 hours. Only votes from Lucene PMC members
are binding, but everyone is welcome to check the release candidate
and voice their approval or disapproval. The vote  passes if at
least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Here's my +1


Thanks!


[1]

http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/CHANGES.txt?revision=757511
-- 
Sami Siren





--
Dog(acan Güney


[jira] Created: (NUTCH-730) NPE in LinkRank if no nodes with which to create the WebGraph

2009-03-25 Thread Dennis Kubes (JIRA)
NPE in LinkRank if no nodes with which to create the WebGraph
-

 Key: NUTCH-730
 URL: https://issues.apache.org/jira/browse/NUTCH-730
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0, 1.1


For LinkRank, if there are no nodes to process, then a NullPointerException is 
thrown when trying to count number of nodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-730) NPE in LinkRank if no nodes with which to create the WebGraph

2009-03-25 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-730:
---

Attachment: NUTCH-730-1-20090325.patch

Throws a more detailed error message if there are no nodes to process.  This 
shouldn't happen on large web graphs but may happen on smaller webgraphs or 
webgraphs that are all inside one domain (including subdomains).

 NPE in LinkRank if no nodes with which to create the WebGraph
 -

 Key: NUTCH-730
 URL: https://issues.apache.org/jira/browse/NUTCH-730
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0, 1.1

 Attachments: NUTCH-730-1-20090325.patch


 For LinkRank, if there are no nodes to process, then a NullPointerException 
 is thrown when trying to count number of nodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Release Apache Nutch 1.0

2009-03-08 Thread Dennis Kubes

Non-binding +1 too :)

Sami Siren wrote:

Hello,

I have packaged the first release candidate for Apache Nutch 1.0 release at

http://people.apache.org/~siren/nutch-1.0/rc0/

See the included CHANGES.txt file for details on release contents and 
latest changes. The release was made from tag: 
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480 



Please vote on releasing this package as Apache Nutch 1.0. The vote is 
open for the next 72 hours. Only votes from Lucene PMC members are 
binding, but everyone is welcome to check the release candidate and 
voice their approval or disapproval. The vote  passes if at least three 
binding +1 votes are cast.


[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Thanks!

--
Sami Siren








Re: planning for nutch-1.0-rc1

2009-03-08 Thread Dennis Kubes
Sorry about the docs being sparse on this.  I will write more about the 
process as time permits.  Don't know about the problem below.  What 
platform are you running on, windows, linux?


Dennis

Bartosz Gadzimski wrote:

Hello,

Thanks Dennis for updateing wiki it helped a lot.

You gave example with indexing but you didn't said a bit about it. Can 
you write some more? :)


Anyways I have problems at the last step (nutch from 07 march):

bin/nutch org.apache.nutch.indexer.field.FieldIndexer

It simply stops somewhere

2009-03-07 16:09:04,432 INFO  field.FieldIndexer - FieldIndexer: starting
2009-03-07 16:09:04,436 INFO  field.FieldIndexer - FieldIndexer: adding 
fields db: crawl/fields/basicfields
2009-03-07 16:09:04,498 INFO  field.FieldIndexer - FieldIndexer: adding 
fields db: crawl/fields/anchorfields
2009-03-07 16:09:05,636 INFO  plugin.PluginRepository - Plugins: looking 
in: /usr/local/nutch/plugins
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Plugin 
Auto-activation mode: [true]

2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Registered Plugins:
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - the 
nutch core extension points (nutch-extensionpoints)
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Basic 
Query Filter (query-basic)

 plugins

2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IFD [Thread-11]: 
setInfoStream 
deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@1b4a74b 

2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IW 0 [Thread-11]: 
setInfoStream: 
dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-root/mapred/local/index/_-884655313 
autoCommit=true 
mergepolicy=org.apache.lucene.index.logbytesizemergepol...@15356d5 
mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@69d02b 
ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 
maxFieldLength=1 index=

2009-03-07 16:09:07,781 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.NullPointerException
   at 
org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139) 

   at 
org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:131) 

   at 
org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
   at 
org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239)
   at 
org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:69)

   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
   at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
2009-03-07 16:09:08,197 FATAL field.FieldIndexer - FieldIndexer: 
java.io.IOException: Job failed!

   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
   at 
org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267)
   at 
org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)

   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at 
org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)





In crawl/indexes is only _temporary folder.

I will try to debug this but have problems with running nutch in eclipse

Thanks,
Bartosz



Dennis Kubes pisze:
I don't know if I would make this primary yet.  I need to check what 
is causing this as it worked fine for me, in fact we currently have it 
in production.  Also we would need to update the shell scripts to 
integrate this more tightly.


Dennis

Bartosz Gadzimski wrote:

Sami Siren pisze:

Andrzej Bialecki wrote:

Sami Siren wrote:
I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
morning (EET). There are still some issues marked as fix for 1.0 
in Jira. Neither of the two remaining _bugs_ seems too important 
to me, actually I only count the issues assigned to developers as 
real candidates to be included in 1.0:


NUTCH-578 (kubes)
NUTCH-477 (ab)
NUTCH-669 (siren)


There's one Critical issue reported, related to NekoHTML 
(NUTCH-700). I'm not sure what are the feature differences 
(pertinent to Nutch) between 0.9.4 and 1.9.11 - perhaps downgrading 
is the safest course of action.

I will take care of that.



I am also volunteering to push all open issues to 1.1 before 
starting the RC build on Tuesday. Any objections on the proposed 
procedure or timing?


Sounds good.

great!

--
Sami Siren



What about new scoring and new indexing? Will it be integrated as a 
primary scoring algorithm? I have problem with it on LinkRank:


2009-03-02 20:43:45,708 INFO  webgraph.LinkRank - Starting link 
counter job
2009-03-02 20:43:47,838 INFO  webgraph.LinkRank - Finished link 
counter job
2009-03-02 20:43:47,839 INFO  webgraph.LinkRank - Reading numlinks 
temp file
2009-03-02 20:43:47,840 INFO  webgraph.LinkRank - Deleting numlinks 
temp file
2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis: 
java.lang.NullPointerException

Re: planning for nutch-1.0-rc1

2009-03-06 Thread Dennis Kubes
NUTCH-578 was a while back but as I remember it worked fine.  No 
objections to either including or pushing it.


Dennis

Sami Siren wrote:
I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
morning (EET). There are still some issues marked as fix for 1.0 in 
Jira. Neither of the two remaining _bugs_ seems too important to me, 
actually I only count the issues assigned to developers as real 
candidates to be included in 1.0:


NUTCH-578 (kubes)
NUTCH-477 (ab)
NUTCH-669 (siren)

I am also volunteering to push all open issues to 1.1 before starting 
the RC build on Tuesday. Any objections on the proposed procedure or 
timing?


--
Sami Siren



Re: planning for nutch-1.0-rc1

2009-03-06 Thread Dennis Kubes
I don't know if I would make this primary yet.  I need to check what is 
causing this as it worked fine for me, in fact we currently have it in 
production.  Also we would need to update the shell scripts to integrate 
this more tightly.


Dennis

Bartosz Gadzimski wrote:

Sami Siren pisze:

Andrzej Bialecki wrote:

Sami Siren wrote:
I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
morning (EET). There are still some issues marked as fix for 1.0 in 
Jira. Neither of the two remaining _bugs_ seems too important to me, 
actually I only count the issues assigned to developers as real 
candidates to be included in 1.0:


NUTCH-578 (kubes)
NUTCH-477 (ab)
NUTCH-669 (siren)


There's one Critical issue reported, related to NekoHTML (NUTCH-700). 
I'm not sure what are the feature differences (pertinent to Nutch) 
between 0.9.4 and 1.9.11 - perhaps downgrading is the safest course 
of action.

I will take care of that.



I am also volunteering to push all open issues to 1.1 before 
starting the RC build on Tuesday. Any objections on the proposed 
procedure or timing?


Sounds good.

great!

--
Sami Siren



What about new scoring and new indexing? Will it be integrated as a 
primary scoring algorithm? I have problem with it on LinkRank:


2009-03-02 20:43:45,708 INFO  webgraph.LinkRank - Starting link counter job
2009-03-02 20:43:47,838 INFO  webgraph.LinkRank - Finished link counter job
2009-03-02 20:43:47,839 INFO  webgraph.LinkRank - Reading numlinks temp 
file
2009-03-02 20:43:47,840 INFO  webgraph.LinkRank - Deleting numlinks temp 
file
2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis: 
java.lang.NullPointerException
   at 
org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113)
   at 
org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)

   at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at 
org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)


Another question what about indexing framework mentioned here:
http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg11764.html


Have all those new scoring and indexing would be real step forward.

Thanks,
Bartosz



[jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains

2009-02-23 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675907#action_12675907
 ] 

Dennis Kubes commented on NUTCH-477:


Same here.  I am not against having extra functionality, but I don't think I 
have ever used the chain options of normalizers either.  I guess the call is do 
we want it in 1.0 or not.  My thinking is we are going to be doing major 
redesign changes post 1.0 so doing lots of code refactoring wouldn't be a big 
deal.

 Extend URLFilters to support different filtering chains
 ---

 Key: NUTCH-477
 URL: https://issues.apache.org/jira/browse/NUTCH-477
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: urlfilters.patch


 I propose to make the following changes to URLFilters:
 * extend URLFilters so that they support different filtering rules depending 
 on the context where they are executed. This functionality mirrors the one 
 that URLNormalizers already support.
 * change their return value to an int code, in order to support early 
 termination of long filtering chains.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-01-23 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-666:
---

Affects Version/s: (was: 1.0.0)
   1.1
Fix Version/s: (was: 1.0.0)
   1.1

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-01-23 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666484#action_12666484
 ] 

Dennis Kubes commented on NUTCH-666:


It is ok to move to 1.1.  

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Site update

2009-01-05 Thread Dennis Kubes

http://www.mail-archive.com/d...@forrest.apache.org/msg15136.html

This might help.

Dennis

Andrzej Bialecki wrote:

Otis Gospodnetic wrote:
Below is what it spits out.  I'm not sure what the cause is.  I did 
try forrest seed  forrest validate as prescribed at 
https://issues.apache.org/jira/browse/FOR-984?focusedCommentId=12649593#action_12649593 
, but forrest validate failed.


validate-sitemap:
/home/otis/apache-forrest/main/webapp/resources/schema/relaxng/sitemap-v06.rng:72:31: 
error: datatype library http://www.w3.org/2001/XMLSchema-datatypes; 
not recognized


[...]

No clue. I'd say that until we figure out what happens we can go forward 
- if it generates a consistent and usable output.





[jira] Closed: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON

2009-01-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-594.
--


 Serve Nutch search results in multiple formats including XML and JSON
 -

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch, 
 NUTCH-594-4-20081230.patch, NUTCH-594-5-20081231.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-572) Scoring and redirected Urls

2009-01-02 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660394#action_12660394
 ] 

Dennis Kubes commented on NUTCH-572:


I would like to close this issue.  Redirect handling has undergone significant 
changes since this issue was opened and we still need to take a hard look at 
redirects and possibly how scores are represented.  However, the newer scoring 
and indexing frameworks do work around this issue.

 Scoring and redirected Urls
 ---

 Key: NUTCH-572
 URL: https://issues.apache.org/jira/browse/NUTCH-572
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0


 When a redirect is found for a given url, the new or end url is stored as the 
 content page and the old CrawlDatum get one of a few redirect codes.  The 
 page that gets indexed in Nutch is the end page and it gets indexed under the 
 end url.  Many times a site will have a significant number of links pointing 
 to start page and very few pointing to the redirected end page.  This is 
 especially true for external links.  Opic scores do not get transfered to the 
 end page but stay with the start page (the one doing the redirecting).  But 
 the start page doesn't get indexed.  Hence the end page will show up in the 
 index but under a usually much reduced score.  A good example of this is 
 cnn.com:
 URL: http://www.cnn.com/
 Version: 6
 Status: 5 (db_redir_perm)
 Fetch time: Tue Dec 04 11:02:09 CST 2007
 Modified time: Wed Dec 31 18:00:00 CST 1969
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 51.19438
 Signature: b5baaf80e9e10aa6205fc39051c362ff
 Metadata: _pst_:success(1), lastModified=0
 which redirects to http://www.cnn.com/?refresh=1
 URL: http://www.cnn.com/?refresh=1
 Version: 6
 Status: 2 (db_fetched)
 Fetch time: Tue Dec 04 11:02:11 CST 2007
 Modified time: Wed Dec 31 18:00:00 CST 1969
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 1.0
 Signature: b5baaf80e9e10aa6205fc39051c362ff
 Metadata: _pst_:success(1), lastModified=0
 Now, cnn which should be one of the highest, if not the highest ranking site 
 in the index for keywords such as news in fact doesn't show up in the index 
 and it's redirected end page appears much farther down in search results.  My 
 proposal is we somehow make OPIC scores follow redirects.  To do this we 
 would most likely need to store a start and end url for redirected urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON

2008-12-30 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659825#action_12659825
 ] 

musepwizard edited comment on NUTCH-594 at 12/30/08 6:56 AM:
--

JSON-LIb and EZMorph are both under Apache.  There is an optional Xom library 
dependency for JSON-Lib which is not included, that is under LGPL, but 
everything else is Apache.

http://json-lib.sourceforge.net/license.html
http://ezmorph.sourceforge.net/license.html

I put comments about these in the plugin.xml file for response-json.  Is there 
anything else I need to do?

  was (Author: musepwizard):
JSON-LIb and EZMorph are both under Apache.  There is an optional Xom 
library dependency for JSON-Lib which is not included, that is under LGPL, but 
everything is Apache.

http://json-lib.sourceforge.net/license.html
http://ezmorph.sourceforge.net/license.html

I put comments about these in the plugin.xml file for response-json.  Is there 
anything else I need to do?
  
 Serve Nutch search results in multiple formats including XML and JSON
 -

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON

2008-12-30 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659825#action_12659825
 ] 

Dennis Kubes commented on NUTCH-594:


JSON-LIb and EZMorph are both under Apache.  There is an optional Xom library 
dependency for JSON-Lib which is not included, that is under LGPL, but 
everything is Apache.

http://json-lib.sourceforge.net/license.html
http://ezmorph.sourceforge.net/license.html

I put comments about these in the plugin.xml file for response-json.  Is there 
anything else I need to do?

 Serve Nutch search results in multiple formats including XML and JSON
 -

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON

2008-12-30 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Attachment: NUTCH-594-4-20081230.patch

Final patch.  Adds the ability to stop summaries from being returned and to 
only return a given set of fields by name.

 Serve Nutch search results in multiple formats including XML and JSON
 -

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch, 
 NUTCH-594-4-20081230.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-668) Domain URL Filter

2008-12-29 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-668.


Resolution: Fixed

Committed with revision 729958.

 Domain URL Filter
 -

 Key: NUTCH-668
 URL: https://issues.apache.org/jira/browse/NUTCH-668
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, 
 NUTCH-668-3-20081213.patch


 A URLFilter that adds the ability to filter out URLs by top level domain or 
 by hostname.  A configuration file with a listing of URLs is used to denote 
 accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-594) Serve Nutch search results in XML and JSON

2008-12-29 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Attachment: ezmorph-1.0.6.jar

ezmorph jar required for framework

 Serve Nutch search results in XML and JSON
 --

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: ezmorph-1.0.6.jar, NUTCH-594-1-20071221.patch, 
 NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-594) Serve Nutch search results in XML and JSON

2008-12-29 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Attachment: NUTCH-594-3-20081229.patch

A completely reworked framework with extension point for serving search results 
in different format.  Included are plugins for serving results in XML and JSON 
format.  XML is the default.  Uses JSON-Lib to convert the results into JSON 
format.

 Serve Nutch search results in XML and JSON
 --

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: ezmorph-1.0.6.jar, NUTCH-594-1-20071221.patch, 
 NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON

2008-12-29 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Summary: Serve Nutch search results in multiple formats including XML and 
JSON  (was: Serve Nutch search results in XML and JSON)

 Serve Nutch search results in multiple formats including XML and JSON
 -

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-594) Serve Nutch search results in XML and JSON

2008-12-29 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Attachment: commons-beanutils-1.8.0.jar

commons beanutils

 Serve Nutch search results in XML and JSON
 --

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-594) Serve Nutch search results in XML and JSON

2008-12-29 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Attachment: commons-collections-3.2.1.jar

commons collections

 Serve Nutch search results in XML and JSON
 --

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-594) Serve Nutch search results in XML and JSON

2008-12-29 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Attachment: json-lib-2.2.2-jdk15.jar

json lib jar

 Serve Nutch search results in XML and JSON
 --

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON

2008-12-29 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Attachment: (was: NUTCH-594-3-20081229.patch)

 Serve Nutch search results in multiple formats including XML and JSON
 -

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON

2008-12-29 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Attachment: NUTCH-594-3-20081229.patch

Fixed some things.  Added the ability to set mime output type using the 
plugin.xml file.  That way people can have application/json or text.json or 
text/plain, however they want for their application.

 Serve Nutch search results in multiple formats including XML and JSON
 -

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (NUTCH-675) Reduce tasks do not report their status and are killed by jobtracker

2008-12-22 Thread Dennis Kubes

This is old.  It has been fixed in more recent versions of hadoop and nutch.

Otis Gospodnetic (JIRA) wrote:
[ https://issues.apache.org/jira/browse/NUTCH-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12658610#action_12658610 ] 


Otis Gospodnetic commented on NUTCH-675:


Sha Feng, could you please bring this up on the Nutch mailing list instead of 
JIRA?
It would also be good if you could upgrade your Nutch (including Hadoop) and 
see if it works then.  0.12 is VERY old version of Hadoop.



Reduce tasks do not report their status and are killed by jobtracker


Key: NUTCH-675
URL: https://issues.apache.org/jira/browse/NUTCH-675
Project: Nutch
 Issue Type: Bug
 Components: fetcher
   Affects Versions: 0.9.0
Environment: OS : Linux
   Reporter: sha feng
Fix For: 0.9.0


We choose Fetcher2 as our fetcher. Map tasks of Fetcher2 fetches about 2,000,000 urls, but at reduce stage, all reduce tasks can not report their status and be killed by jobtracker. Although we change mapred.task.timeout from 60,000 to 1,800,000, it does not work. So, who can tell us why? By the way, the version of Nutch we use is 0.9 and the version of Hadoop is 0.12. 
Thanks for your help!




[jira] Commented: (NUTCH-668) Domain URL Filter

2008-12-19 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12658118#action_12658118
 ] 

Dennis Kubes commented on NUTCH-668:


Anybody have a problem if I commit this today or tommorrow?

 Domain URL Filter
 -

 Key: NUTCH-668
 URL: https://issues.apache.org/jira/browse/NUTCH-668
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, 
 NUTCH-668-3-20081213.patch


 A URLFilter that adds the ability to filter out URLs by top level domain or 
 by hostname.  A configuration file with a listing of URLs is used to denote 
 accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: File system

2008-12-16 Thread Dennis Kubes
If you are talking about Nutch Contents which are stored in the segments 
during fetching of pages, then you would need to write  MapReduce job to 
read in the Contents object and do whatever processing you desire.


Dennis

oSilvio wrote:

Very useful information, thanks!
But in order to extract the data inside those files (like html pages) I can
find no algorithm available by nutch, nor the process used to store the
data. Do you know if it is possible to extract using lucene?

 


Dennis Kubes-2 wrote:
The nutch databases are either SequenceFile or MapFile formats which 
store key and value pairs.  Their keys and values are Writable 
implementations which translate an object into it byte equivalent and 
vice versa.


Data and index files are MapFile format.  Data is a SequenceFile, index 
is an index used by MapFiles for seeking to a specific key.


Please see the hadoop wiki for more information about Sequence and Map 
files and writable formats.


Dennis

oSilvio wrote:
Do somebody know how do the file structure works, briefly? 
It seems that the data are compressed or something, its not possible to

understand whats recorded in the data nor index files.
Thanks
Silvio






Re: File system

2008-12-15 Thread Dennis Kubes
The nutch databases are either SequenceFile or MapFile formats which 
store key and value pairs.  Their keys and values are Writable 
implementations which translate an object into it byte equivalent and 
vice versa.


Data and index files are MapFile format.  Data is a SequenceFile, index 
is an index used by MapFiles for seeking to a specific key.


Please see the hadoop wiki for more information about Sequence and Map 
files and writable formats.


Dennis

oSilvio wrote:
Do somebody know how do the file structure works, briefly? 
It seems that the data are compressed or something, its not possible to

understand whats recorded in the data nor index files.
Thanks
Silvio


[jira] Closed: (NUTCH-448) Allow Plugin Includes and Excludes from File

2008-12-09 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-448.
--

Resolution: Later

This was some old functionality that seemed good at the time.  Not so much now.

 Allow Plugin Includes and Excludes from File
 

 Key: NUTCH-448
 URL: https://issues.apache.org/jira/browse/NUTCH-448
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: all platforms
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: plugin-fromfile.patch


 This functionality allows the plugin.includes and plugin.excludes values to 
 be moved out of the nutch-default.xml and nutch-site.xml files and loaded 
 from one or more text configurtion files found in the classpath.  This is a 
 cleaner implementation then having one big long regular expression in the 
 configuration file as plugin.includes or plugin.excludes.
 Loads plugin configuration from files defined by the plugin.files 
 configurtion variable.  Files must be available to be found in the classpath. 
  The plugin files consist of one regex per line.  Plugins starting with a - 
 will be excluded while lines starting with a # will be ignored.  All other 
 non-blank lines will be included as plugins, one per line. Any plugins 
 configured through plugin.includes and plugin.excludes in the configuration 
 are also added.  Any plugins that are excluded are removed from the includes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-646) New Indexing Framework for Nutch

2008-12-06 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12654154#action_12654154
 ] 

Dennis Kubes commented on NUTCH-646:


Not yet.  I need to write up some serious documentation about how to use both 
the new scoring and indexing systems.  I will try to get to that soon.

 New Indexing Framework for Nutch
 

 Key: NUTCH-646
 URL: https://issues.apache.org/jira/browse/NUTCH-646
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 0.9.0, 1.0.0

 Attachments: arity-1.3.2.jar, NUTCH-646-1-20080818.patch, 
 NUTCH-646-2-20081126.patch


 New indexing framework for Nutch that provides a more generic field 
 abstraction consistent with Lucene index semantics.  Allows multiple MR jobs 
 to be created for different fields and those fields to be aggregated and 
 indexed in the end.  Overcomes limitations of the current indexer that limits 
 what databases are passed into the indexer.  Creates a new extension point as 
 well for field-filters for manipulation of fields during the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Domain URL filter Commit?

2008-12-05 Thread Dennis Kubes
Anybody have a problem with me committing the domain-urlfilter plugin in 
NUTCH-668?


Dennis


[jira] Commented: (NUTCH-668) Domain URL Filter

2008-12-05 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653881#action_12653881
 ] 

Dennis Kubes commented on NUTCH-668:


I agree.  Being able to search for tlds like .com would make it much more 
flexible.  Let me work up the changes and I will post a new patch (without my 
local path :)).  Although I do want to get this in quickly I think the new 
functionality is worth the wait.

 Domain URL Filter
 -

 Key: NUTCH-668
 URL: https://issues.apache.org/jira/browse/NUTCH-668
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch


 A URLFilter that adds the ability to filter out URLs by top level domain or 
 by hostname.  A configuration file with a listing of URLs is used to denote 
 accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Builds are Failing

2008-12-04 Thread Dennis Kubes
After the upgrade to Hadoop, builds are failing because I think we have 
nutch set to build with Java 5 by default but I think Hadoop is built 
with Java 6 (At least the release version that I downloaded and used to 
upgrade Nutch).


I know we aren't requiring Nutch to use Java 6 yet.  This may force the 
point.  I don't know if Hadoop will build with Java 5.  I will test it 
out and post back results.  If it does, then options are:


1) Force Nutch to use Java 6
2) Rebuild Hadoop from source instead of release version using Java 5

Thoughts?

Dennis


Re: Builds are Failing

2008-12-04 Thread Dennis Kubes
I take it back.  Hadoop *requires* java 6 now as of 0.19.  Which means 
we should be making changes to require Nutch to use java 6.


Dennis

Dennis Kubes wrote:
After the upgrade to Hadoop, builds are failing because I think we have 
nutch set to build with Java 5 by default but I think Hadoop is built 
with Java 6 (At least the release version that I downloaded and used to 
upgrade Nutch).


I know we aren't requiring Nutch to use Java 6 yet.  This may force the 
point.  I don't know if Hadoop will build with Java 5.  I will test it 
out and post back results.  If it does, then options are:


1) Force Nutch to use Java 6
2) Rebuild Hadoop from source instead of release version using Java 5

Thoughts?

Dennis


[jira] Updated: (NUTCH-668) Domain URL Filter

2008-12-04 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-668:
---

Attachment: NUTCH-668-2-20081204.patch

Updated to include URLUtil methods that were missing.  Sorry.

 Domain URL Filter
 -

 Key: NUTCH-668
 URL: https://issues.apache.org/jira/browse/NUTCH-668
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch


 A URLFilter that adds the ability to filter out URLs by top level domain or 
 by hostname.  A configuration file with a listing of URLs is used to denote 
 accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-207) Bandwidth target for fetcher rather than a thread count

2008-12-04 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653404#action_12653404
 ] 

Dennis Kubes commented on NUTCH-207:


I think this would be an interesting addition.  It would also need to be ported 
to fetcher2 as well as fetcher.  It you want to take on the task of porting it 
that would be great.  If you have any questions feel free to ask.

 Bandwidth target for fetcher rather than a thread count
 ---

 Key: NUTCH-207
 URL: https://issues.apache.org/jira/browse/NUTCH-207
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8
Reporter: Rod Taylor
 Attachments: ratelimit.patch


 Increases or decreases the number of threads from the starting value 
 (fetcher.threads.fetch) up to a maximum (fetcher.threads.maximum) to achieve 
 a target bandwidth (fetcher.threads.bandwidth).
 It seems to be able to keep within 10% of the target bandwidth even when 
 large numbers of errors are found or when a number of large pages is run 
 across.
 To achieve more accurate tracking Nutch should keep track of protocol 
 overhead as well as the volume of pages downloaded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-635) LinkAnalysis Tool for Nutch

2008-12-04 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-635.
--


 LinkAnalysis Tool for Nutch
 ---

 Key: NUTCH-635
 URL: https://issues.apache.org/jira/browse/NUTCH-635
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, 
 NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, 
 NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, 
 NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch


 This is a basic pagerank type link analysis tool for nutch which simulates a 
 sparse matrix using inlinks and outlinks and converges after a given number 
 of iterations.  This tool is mean to replace the current scoring system in 
 nutch with a system that converges instead of exponentially increasing 
 scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-635) LinkAnalysis Tool for Nutch

2008-12-04 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-635.


Resolution: Fixed

Committed with revision 723441

 LinkAnalysis Tool for Nutch
 ---

 Key: NUTCH-635
 URL: https://issues.apache.org/jira/browse/NUTCH-635
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, 
 NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, 
 NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, 
 NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch


 This is a basic pagerank type link analysis tool for nutch which simulates a 
 sparse matrix using inlinks and outlinks and converges after a given number 
 of iterations.  This tool is mean to replace the current scoring system in 
 nutch with a system that converges instead of exponentially increasing 
 scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-646) New Indexing Framework for Nutch

2008-12-04 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653489#action_12653489
 ] 

Dennis Kubes commented on NUTCH-646:


For the final version of this I have removed the arity dependencies and 
computation functionality.  I still think that type of functionality is needed 
but it didn't feel like the right place for it at this time.

 New Indexing Framework for Nutch
 

 Key: NUTCH-646
 URL: https://issues.apache.org/jira/browse/NUTCH-646
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 0.9.0, 1.0.0

 Attachments: arity-1.3.2.jar, NUTCH-646-1-20080818.patch, 
 NUTCH-646-2-20081126.patch


 New indexing framework for Nutch that provides a more generic field 
 abstraction consistent with Lucene index semantics.  Allows multiple MR jobs 
 to be created for different fields and those fields to be aggregated and 
 indexed in the end.  Overcomes limitations of the current indexer that limits 
 what databases are passed into the indexer.  Creates a new extension point as 
 well for field-filters for manipulation of fields during the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-646) New Indexing Framework for Nutch

2008-12-04 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-646.


Resolution: Fixed

Committed with revision 723447

 New Indexing Framework for Nutch
 

 Key: NUTCH-646
 URL: https://issues.apache.org/jira/browse/NUTCH-646
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0, 0.9.0

 Attachments: arity-1.3.2.jar, NUTCH-646-1-20080818.patch, 
 NUTCH-646-2-20081126.patch


 New indexing framework for Nutch that provides a more generic field 
 abstraction consistent with Lucene index semantics.  Allows multiple MR jobs 
 to be created for different fields and those fields to be aggregated and 
 indexed in the end.  Overcomes limitations of the current indexer that limits 
 what databases are passed into the indexer.  Creates a new extension point as 
 well for field-filters for manipulation of fields during the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-662) Upgrade Nutch to use Lucene 2.4

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-662.


Resolution: Fixed

Committed with revision 722475

 Upgrade Nutch to use Lucene 2.4
 ---

 Key: NUTCH-662
 URL: https://issues.apache.org/jira/browse/NUTCH-662
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: lucene-analyzers-2.4.0.jar, lucene-core-2.4.0.jar, 
 lucene-misc-2.4.0.jar, NUTCH-662-20081121-1.patch


 Upgrade nutch to use Lucene 2.4.  This release changes the lucene file 
 format.  New indexes created by this lucene version will NOT be readable by 
 older versions.  Lucene 2.4 can read and update older index formats although 
 updating an older format will convert it to the new format.  There are also 
 some performance and functionality improvments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-663.
--


 Upgrade Nutch to use Hadoop 0.19
 

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, 
 NUTCH-663-1-20081126.patch


 Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
 performance improvements, bug fixes, and new functionality.  Changes some 
 current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-647) Resolve URLs tool

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-647.
--


 Resolve URLs tool
 -

 Key: NUTCH-647
 URL: https://issues.apache.org/jira/browse/NUTCH-647
 Project: Nutch
  Issue Type: New Feature
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch


 A tool that takes a listing of urls and attempts to resolve their IP 
 addresses.  Useful for running after the fetcher has run to determine if DNS 
 problems exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-647) Resolve URLs tool

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-647.


   Resolution: Fixed
Fix Version/s: 1.0.0

Committed with revision 722478

 Resolve URLs tool
 -

 Key: NUTCH-647
 URL: https://issues.apache.org/jira/browse/NUTCH-647
 Project: Nutch
  Issue Type: New Feature
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch


 A tool that takes a listing of urls and attempts to resolve their IP 
 addresses.  Useful for running after the fetcher has run to determine if DNS 
 problems exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-665) Search Load Testing Tool

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-665.


Resolution: Fixed

Committed with revision 722481

 Search Load Testing Tool
 

 Key: NUTCH-665
 URL: https://issues.apache.org/jira/browse/NUTCH-665
 Project: Nutch
  Issue Type: New Feature
  Components: searcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-665-20081126-1.patch


 A tool which spawn a number of threads and executes searches against 
 configured search servers.  This is used for light load testing of search 
 servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-665) Search Load Testing Tool

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-665.
--


 Search Load Testing Tool
 

 Key: NUTCH-665
 URL: https://issues.apache.org/jira/browse/NUTCH-665
 Project: Nutch
  Issue Type: New Feature
  Components: searcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-665-20081126-1.patch


 A tool which spawn a number of threads and executes searches against 
 configured search servers.  This is used for light load testing of search 
 servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-667) Input Format for working with Content in Hadoop Streaming

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-667.
--


 Input Format for working with Content in Hadoop Streaming
 -

 Key: NUTCH-667
 URL: https://issues.apache.org/jira/browse/NUTCH-667
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-667-1-20081126.patch


 This is a ContextAsText input format that removes line endings with spaces 
 that allow Nutch content to be used more effectively inside of Hadoop 
 streaming jobs that allow MapReduce jobs to be written in any language that 
 can communicate with stdin and stdout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-667) Input Format for working with Content in Hadoop Streaming

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-667.


Resolution: Fixed

Committed with revision 722483

 Input Format for working with Content in Hadoop Streaming
 -

 Key: NUTCH-667
 URL: https://issues.apache.org/jira/browse/NUTCH-667
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-667-1-20081126.patch


 This is a ContextAsText input format that removes line endings with spaces 
 that allow Nutch content to be used more effectively inside of Hadoop 
 streaming jobs that allow MapReduce jobs to be written in any language that 
 can communicate with stdin and stdout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-668) Domain URL Filter

2008-12-02 Thread Dennis Kubes (JIRA)
Domain URL Filter
-

 Key: NUTCH-668
 URL: https://issues.apache.org/jira/browse/NUTCH-668
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0


A URLFilter that adds the ability to filter out URLs by top level domain or by 
hostname.  A configuration file with a listing of URLs is used to denote 
accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-668) Domain URL Filter

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-668:
---

Attachment: NUTCH-668-1-20081202.patch

Includes the DomainURLFilter and test files.  Domains can either be filtered by 
top level domains ignoring subdomains, or by hostnames through configuration.  
There is a configuration file where valid domains are placed one per line.  
Those domains are used to create valid domain set against which we validate 
urls at runtime.  Only urls which match domains in the domain set are 
considered valid.

 Domain URL Filter
 -

 Key: NUTCH-668
 URL: https://issues.apache.org/jira/browse/NUTCH-668
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-668-1-20081202.patch


 A URLFilter that adds the ability to filter out URLs by top level domain or 
 by hostname.  A configuration file with a listing of URLs is used to denote 
 accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Pending Commits for Nutch Issues

2008-11-27 Thread Dennis Kubes



Doğacan Güney wrote:

Hi Dennis,

On Wed, Nov 26, 2008 at 11:42 PM, Dennis Kubes [EMAIL PROTECTED] wrote:

If nobody has a problem with them I would like to commit the following
issues in the next day or two:

NUTCH-663: Upgrade Nutch to the most recent Hadoop version (0.19)
NUTCH-662: Upgrade Nutch to the most recent Lucene version (2.4)
NUTCH-647: Resolve URLs tool
NUTCH-665: Search Load Testing Tool
NUTCH-667: Input Format for working with Content in Hadoop Streaming

And I would like to commit these in  a week:

NUTCH-635: LinkAnalysis Tool for Nutch
NUTCH-646: New Indexing framework for Nutch
NUTCH-594: Serve Nutch search results in XML and JSON
NUTCH-666: Analysis plugins and new language identifier.

There are others too but these are the ones I am trying to get moved into
trunk right now.



I am OK with all but NUTCH-666... Why a new language identifier? (or
if a new one, why keep old one around?)


I haven't got the code pushed out yet.  I do have a production version 
running but I need to make it play nice with the Apache licensing 
requirements.  Current library I am using is under GPL.  The reason I 
switched was because I found that the old one wasn't working correctly 
for me.


I don't know the accuracy levels of the old language identifier but I 
found that with pages that contained both english and another language, 
it would often classify it as english.  The new language identifier I am 
currently using has an accuracy rate of 97% and is trainable as before 
for multiple languages.  Currently we have models for 20-30 languages.


Also the new language identifier works with the new indexing framework 
and with new functionality for custom fields.  The only reason I would 
keep the old one around would be for backwards compatibility for people 
currently using it.


I will push out a patch shortly and we can review.  If we don't want it 
to make it into this release I am ok with that.


Dennis





Dennis







[jira] Updated: (NUTCH-665) Search Load Testing Tool

2008-11-26 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-665:
---

Attachment: NUTCH-665-20081126-1.patch

Search load testing tool.

 Search Load Testing Tool
 

 Key: NUTCH-665
 URL: https://issues.apache.org/jira/browse/NUTCH-665
 Project: Nutch
  Issue Type: New Feature
  Components: searcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-665-20081126-1.patch


 A tool which spawn a number of threads and executes searches against 
 configured search servers.  This is used for light load testing of search 
 servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-647) Resolve URLs tool

2008-11-26 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-647:
---

Attachment: NUTCH-647-2-20081126.patch

Updated patch.

 Resolve URLs tool
 -

 Key: NUTCH-647
 URL: https://issues.apache.org/jira/browse/NUTCH-647
 Project: Nutch
  Issue Type: New Feature
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch


 A tool that takes a listing of urls and attempts to resolve their IP 
 addresses.  Useful for running after the fetcher has run to determine if DNS 
 problems exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2008-11-26 Thread Dennis Kubes (JIRA)
Analysis plugins for multiple language and new Language Identifier Tool
---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0


Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
russian, and thai.  Also includes a new Language Identifier tool that used the 
new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2008-11-26 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-666:
---

Attachment: NUTCH-666-1-20081126.patch

Part one of patch.  This includes the new analyzers for different languages.  
Part two will include the new language identifier tool.

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2

2008-11-26 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-663:
---

Attachment: NUTCH-663-1-20081126.patch

Updates jar and native files

 Upgrade Nutch to use Hadoop 0.18.2
 --

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: hadoop-0.19-native.tar.gz, NUTCH-663-1-20081126.patch


 Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
 performance improvements, bug fixes, and new functionality.  Changes some 
 current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2

2008-11-26 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-663:
---

Attachment: hadoop-0.19.0-core.jar

Hadoop core jar

 Upgrade Nutch to use Hadoop 0.18.2
 --

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, 
 NUTCH-663-1-20081126.patch


 Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
 performance improvements, bug fixes, and new functionality.  Changes some 
 current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2

2008-11-26 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12650982#action_12650982
 ] 

Dennis Kubes commented on NUTCH-663:


hadoop 0.19 was release.  I am integrating it in and should have a patch 
shortly.

 Upgrade Nutch to use Hadoop 0.18.2
 --

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0


 Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
 performance improvements, bug fixes, and new functionality.  Changes some 
 current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19

2008-11-26 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-663:
---

Summary: Upgrade Nutch to use Hadoop 0.19  (was: Upgrade Nutch to use 
Hadoop 0.18.2)

change to 0.19 instead of 0.18.2

 Upgrade Nutch to use Hadoop 0.19
 

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, 
 NUTCH-663-1-20081126.patch


 Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
 performance improvements, bug fixes, and new functionality.  Changes some 
 current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2008-11-26 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-666:
---

Attachment: (was: NUTCH-666-1-20081126.patch)

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19

2008-11-26 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-663:
---

Attachment: NUTCH-663-1-20081126.patch

Updated patch to include API changes in Nutch classes.

 Upgrade Nutch to use Hadoop 0.19
 

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, 
 NUTCH-663-1-20081126.patch


 Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
 performance improvements, bug fixes, and new functionality.  Changes some 
 current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19

2008-11-26 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-663:
---

Attachment: (was: NUTCH-663-1-20081126.patch)

 Upgrade Nutch to use Hadoop 0.19
 

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, 
 NUTCH-663-1-20081126.patch


 Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
 performance improvements, bug fixes, and new functionality.  Changes some 
 current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch

2008-11-26 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-635:
---

Attachment: (was: NUTCH-635-8-20080818.patch)

 LinkAnalysis Tool for Nutch
 ---

 Key: NUTCH-635
 URL: https://issues.apache.org/jira/browse/NUTCH-635
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, 
 NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, 
 NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, 
 NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch


 This is a basic pagerank type link analysis tool for nutch which simulates a 
 sparse matrix using inlinks and outlinks and converges after a given number 
 of iterations.  This tool is mean to replace the current scoring system in 
 nutch with a system that converges instead of exponentially increasing 
 scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch

2008-11-26 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-635:
---

Attachment: NUTCH-635-9-20081126.patch

Updated final patch for new link analysis framework.  I am also going to write 
up some documentation on the wiki for how this new process works.

 LinkAnalysis Tool for Nutch
 ---

 Key: NUTCH-635
 URL: https://issues.apache.org/jira/browse/NUTCH-635
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, 
 NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, 
 NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, 
 NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch


 This is a basic pagerank type link analysis tool for nutch which simulates a 
 sparse matrix using inlinks and outlinks and converges after a given number 
 of iterations.  This tool is mean to replace the current scoring system in 
 nutch with a system that converges instead of exponentially increasing 
 scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-667) Input Forma for working with Content in Hadoop Streaming

2008-11-26 Thread Dennis Kubes (JIRA)
Input Forma for working with Content in Hadoop Streaming


 Key: NUTCH-667
 URL: https://issues.apache.org/jira/browse/NUTCH-667
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0


This is a ContextAsText input format that removes line endings with spaces that 
allow Nutch content to be used more effectively inside of Hadoop streaming jobs 
that allow MapReduce jobs to be written in any language that can communicate 
with stdin and stdout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2008-11-26 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-666:
---

Attachment: NUTCH-666-1-20081126.patch

Fixed patch.  Now includes the changes to AnalyzerFactory to allow multiple 
languages per plugin.

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-667) Input Forma for working with Content in Hadoop Streaming

2008-11-26 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-667:
---

Attachment: NUTCH-667-1-20081126.patch

Input format for working with hadoop streaming.

 Input Forma for working with Content in Hadoop Streaming
 

 Key: NUTCH-667
 URL: https://issues.apache.org/jira/browse/NUTCH-667
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-667-1-20081126.patch


 This is a ContextAsText input format that removes line endings with spaces 
 that allow Nutch content to be used more effectively inside of Hadoop 
 streaming jobs that allow MapReduce jobs to be written in any language that 
 can communicate with stdin and stdout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-667) Input Format for working with Content in Hadoop Streaming

2008-11-26 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-667:
---

Summary: Input Format for working with Content in Hadoop Streaming  (was: 
Input Forma for working with Content in Hadoop Streaming)

 Input Format for working with Content in Hadoop Streaming
 -

 Key: NUTCH-667
 URL: https://issues.apache.org/jira/browse/NUTCH-667
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-667-1-20081126.patch


 This is a ContextAsText input format that removes line endings with spaces 
 that allow Nutch content to be used more effectively inside of Hadoop 
 streaming jobs that allow MapReduce jobs to be written in any language that 
 can communicate with stdin and stdout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-646) New Indexing Framework for Nutch

2008-11-26 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-646:
---

Attachment: NUTCH-646-2-20081126.patch

Updated indexing patch.

 New Indexing Framework for Nutch
 

 Key: NUTCH-646
 URL: https://issues.apache.org/jira/browse/NUTCH-646
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 0.9.0, 1.0.0

 Attachments: arity-1.3.2.jar, NUTCH-646-1-20080818.patch, 
 NUTCH-646-2-20081126.patch


 New indexing framework for Nutch that provides a more generic field 
 abstraction consistent with Lucene index semantics.  Allows multiple MR jobs 
 to be created for different fields and those fields to be aggregated and 
 indexed in the end.  Overcomes limitations of the current indexer that limits 
 what databases are passed into the indexer.  Creates a new extension point as 
 well for field-filters for manipulation of fields during the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2

2008-11-25 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12650713#action_12650713
 ] 

Dennis Kubes commented on NUTCH-663:


@buddha1021
The 1.0 release for Nutch has some of the features for Nutch 2 but it is not a 
complete Nutch 2 Architecture.  We felt it was best to do add some needed 
features into the current version of Nutch and get them deployed to the 
community quickly.  A lot of people have been asking about the development of 
Nutch and releasing.  Truth is we have just been busy adding in needed features 
and patches.  We should have a release out in the next couple of weeks.  That 
will be a 1.0 release for Nutch but will probably contain a 18.2 or 19 release 
for Hadoop. We aren't waiting for hadoop to go to 1.0.

@Doğacan Güney
I am not opposed to waiting for 0.19 as long as it will be released soon.  I 
was looking and it seemed they tried to release a little while back and didn't 
finish because of some big errors.

 Upgrade Nutch to use Hadoop 0.18.2
 --

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0


 Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
 performance improvements, bug fixes, and new functionality.  Changes some 
 current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-662) Upgrade Nutch to use Lucene 2.4

2008-11-23 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12650009#action_12650009
 ] 

Dennis Kubes commented on NUTCH-662:


We had been running in production for about a month and never saw any issues 
with the indexing processes using 2.4.  Then I was doing some work for 
upgrading the trunk and it popped up in delete duplicates unit testing.  We 
don't do delete duplicates in our JobStream, we do it query side.  

First problem was that the old DfsIndexOutput didn't implement the seek method 
(probably because DFS can't seek), so when that was changed to allow it to 
seek, it was throwing Checksum errors on the index when it was trying to open 
it.  Come to find out as above 2.4 is purposefully writing a bad checksum, then 
seeking back, then writing a correct checksum in closing the index as a 
pseudo-two-phase commit.  So I don't think it will affect the indexing process 
because as you noted it writes to local first then just transfers to DFS.  In 
changing DfsIndexOutput to allow DeleteDuplicates to work I just took the same 
approach, local first, then put to DFS.

 Upgrade Nutch to use Lucene 2.4
 ---

 Key: NUTCH-662
 URL: https://issues.apache.org/jira/browse/NUTCH-662
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: lucene-analyzers-2.4.0.jar, lucene-core-2.4.0.jar, 
 lucene-misc-2.4.0.jar, NUTCH-662-20081121-1.patch


 Upgrade nutch to use Lucene 2.4.  This release changes the lucene file 
 format.  New indexes created by this lucene version will NOT be readable by 
 older versions.  Lucene 2.4 can read and update older index formats although 
 updating an older format will convert it to the new format.  There are also 
 some performance and functionality improvments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-662) Upgrade Nutch to use Lucene 2.4

2008-11-21 Thread Dennis Kubes (JIRA)
Upgrade Nutch to use Lucene 2.4
---

 Key: NUTCH-662
 URL: https://issues.apache.org/jira/browse/NUTCH-662
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0


Upgrade nutch to use Lucene 2.4.  This release changes the lucene file format.  
New indexes created by this lucene version will NOT be readable by older 
versions.  Lucene 2.4 can read and update older index formats although updating 
an older format will convert it to the new format.  There are also some 
performance and functionality improvments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2

2008-11-21 Thread Dennis Kubes (JIRA)
Upgrade Nutch to use Hadoop 0.18.2
--

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0


Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes performance 
improvements, bug fixes, and new functionality.  Changes some current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-662) Upgrade Nutch to use Lucene 2.4

2008-11-21 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-662:
---

Attachment: lucene-misc-2.4.0.jar

 Upgrade Nutch to use Lucene 2.4
 ---

 Key: NUTCH-662
 URL: https://issues.apache.org/jira/browse/NUTCH-662
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: lucene-core-2.4.0.jar, lucene-misc-2.4.0.jar


 Upgrade nutch to use Lucene 2.4.  This release changes the lucene file 
 format.  New indexes created by this lucene version will NOT be readable by 
 older versions.  Lucene 2.4 can read and update older index formats although 
 updating an older format will convert it to the new format.  There are also 
 some performance and functionality improvments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-662) Upgrade Nutch to use Lucene 2.4

2008-11-21 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12649679#action_12649679
 ] 

Dennis Kubes commented on NUTCH-662:


The upgrade to Lucene 2.4 causes a weird problem that might need some 
discussion.  The o.a.n.indexer.FsDirectory$DfsIndexOutput class is used to 
interact with an index stored on DFS.  The 2.4 version of Lucene in the 
ChecksumIndexOutput.prepareCommit method and finalizeCommit methods do a pseudo 
two-phase commit.  To do this it writes an intential mismatched checksum (long 
= checkum - 1) then flushes and seeks back and writes the correct checksum in 
the same spot.  They say this is to ensure the commit.  Because DFS doesn't 
have append functionality we can't write to it, seek back to a position, and 
write again.  DFS is write only.

To handle this problem in the attached patch, I first write out to a local 
temporary file that is deleted upon exit, then when close is called on the 
IndexOutput, that file is written out to DFS all at once.  I don't know if this 
is the best way to do this or if there is a better way, but it does handle the 
new write and seek functionality of lucene 2.4.  The previous implementation of 
DfsIndexOutput simply threw an UnsupportedOperationException when the seek 
method was called.  This was fine before 2.4 as lucene wasn't calling that 
method during writing to DFS.  In 2.4 it does and unit tests were failing 
because of it.  What does everybody think about this implementation?

Other than that I don't see any major issues in upgrading to 2.4.  Some people 
have said performance we down in 2.4.  My thoughts are, that might be the case 
but those will be fixed and it would be good to be on the most recent lucene 
version as we move to a 1.0 release for Nutch.  Also we have been using 2.4 in 
production for a month now without any issues.

 Upgrade Nutch to use Lucene 2.4
 ---

 Key: NUTCH-662
 URL: https://issues.apache.org/jira/browse/NUTCH-662
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: lucene-core-2.4.0.jar, lucene-misc-2.4.0.jar, 
 NUTCH-662-20081121-1.patch


 Upgrade nutch to use Lucene 2.4.  This release changes the lucene file 
 format.  New indexes created by this lucene version will NOT be readable by 
 older versions.  Lucene 2.4 can read and update older index formats although 
 updating an older format will convert it to the new format.  There are also 
 some performance and functionality improvments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-662) Upgrade Nutch to use Lucene 2.4

2008-11-21 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-662:
---

Attachment: lucene-analyzers-2.4.0.jar

 Upgrade Nutch to use Lucene 2.4
 ---

 Key: NUTCH-662
 URL: https://issues.apache.org/jira/browse/NUTCH-662
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: lucene-analyzers-2.4.0.jar, lucene-core-2.4.0.jar, 
 lucene-misc-2.4.0.jar, NUTCH-662-20081121-1.patch


 Upgrade nutch to use Lucene 2.4.  This release changes the lucene file 
 format.  New indexes created by this lucene version will NOT be readable by 
 older versions.  Lucene 2.4 can read and update older index formats although 
 updating an older format will convert it to the new format.  There are also 
 some performance and functionality improvments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   4   >