from:"Dennis Kubes"


 [ 
https://issues.apache.org/jira/browse/NUTCH-291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-291.
--

Resolution: Fixed

The open search servlet has been superseded by formatters for serving results 
in xml and json format.  Closing issue.

 OpenSearchServlet should return date as well as lastModified
 

 Key: NUTCH-291
 URL: https://issues.apache.org/jira/browse/NUTCH-291
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.8
Reporter: Stefan Neufeind
Assignee: Dennis Kubes
 Attachments: NUTCH-291-unfinished.patch


 Currently lastModified is provided by OpenSearchServlet - but only in case 
 the date lastModified-date is known.
 Since you can sort by date (which is lastModified or if not present the 
 fetchdate), it might be useful if OpenSearchServlet could provide date as 
 well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-729) NPE in FieldIndexer when BasicFields url doesn't exist

NPE in FieldIndexer when BasicFields url doesn't exist
--

 Key: NUTCH-729
 URL: https://issues.apache.org/jira/browse/NUTCH-729
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.9.0, 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1


There is a NullPointerException during a logging call in FieldIndexer when 
there isn't a url for a document.  Documents shouldn't be without urls but 
since the FieldIndexer doesn't validate fields it is possible for it to occur.  
Most often this happens when BasicFields is run with the wrong segments 
directory and doesn't complain.  It could also occur if using the FieldIndexer 
to index things other than basic fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-729) NPE in FieldIndexer when BasicFields url doesn't exist


 [ 
https://issues.apache.org/jira/browse/NUTCH-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-729:
---

Attachment: NUTCH-729-1-20090235.patch

Simple patch.  Changes the logging to use the key (which should be url and 
which should always exist).

 NPE in FieldIndexer when BasicFields url doesn't exist
 --

 Key: NUTCH-729
 URL: https://issues.apache.org/jira/browse/NUTCH-729
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.9.0, 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-729-1-20090235.patch


 There is a NullPointerException during a logging call in FieldIndexer when 
 there isn't a url for a document.  Documents shouldn't be without urls but 
 since the FieldIndexer doesn't validate fields it is possible for it to 
 occur.  Most often this happens when BasicFields is run with the wrong 
 segments directory and doesn't complain.  It could also occur if using the 
 FieldIndexer to index things other than basic fields.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [VOTE] Release Apache Nutch 1.0

2009-03-25 Thread Dennis Kubes


+1, is this binding? :)

Dog(acan Güney wrote:

Another non-binding +1 from me.

Hope this one is a keeper :D

On Mon, Mar 23, 2009 at 22:28, Sami Siren ssi...@gmail.com 
mailto:ssi...@gmail.com wrote:


Hello,

I have packaged the third release candidate for Apache Nutch 1.0
release at http://people.apache.org/~siren/nutch-1.0/rc2/
http://people.apache.org/%7Esiren/nutch-1.0/rc2/

See the CHANGES.txt[1] file for details on release contents and
latest changes. The release was made from tag:
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/

The following issues that were discovered during the review of last
rc have been fixed:

https://issues.apache.org/jira/browse/NUTCH-722
https://issues.apache.org/jira/browse/NUTCH-723
https://issues.apache.org/jira/browse/NUTCH-725
https://issues.apache.org/jira/browse/NUTCH-726
https://issues.apache.org/jira/browse/NUTCH-727

Please vote on releasing this package as Apache Nutch 1.0. The vote
is open for the next 72 hours. Only votes from Lucene PMC members
are binding, but everyone is welcome to check the release candidate
and voice their approval or disapproval. The vote  passes if at
least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Here's my +1


Thanks!


[1]

http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/CHANGES.txt?revision=757511
-- 
Sami Siren





--
Dog(acan Güney

[jira] Created: (NUTCH-730) NPE in LinkRank if no nodes with which to create the WebGraph

NPE in LinkRank if no nodes with which to create the WebGraph
-

 Key: NUTCH-730
 URL: https://issues.apache.org/jira/browse/NUTCH-730
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0, 1.1


For LinkRank, if there are no nodes to process, then a NullPointerException is 
thrown when trying to count number of nodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-730) NPE in LinkRank if no nodes with which to create the WebGraph


 [ 
https://issues.apache.org/jira/browse/NUTCH-730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-730:
---

Attachment: NUTCH-730-1-20090325.patch

Throws a more detailed error message if there are no nodes to process.  This 
shouldn't happen on large web graphs but may happen on smaller webgraphs or 
webgraphs that are all inside one domain (including subdomains).

 NPE in LinkRank if no nodes with which to create the WebGraph
 -

 Key: NUTCH-730
 URL: https://issues.apache.org/jira/browse/NUTCH-730
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0, 1.1

 Attachments: NUTCH-730-1-20090325.patch


 For LinkRank, if there are no nodes to process, then a NullPointerException 
 is thrown when trying to count number of nodes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [VOTE] Release Apache Nutch 1.0

2009-03-08 Thread Dennis Kubes


Non-binding +1 too :)

Sami Siren wrote:

Hello,

I have packaged the first release candidate for Apache Nutch 1.0 release at

http://people.apache.org/~siren/nutch-1.0/rc0/

See the included CHANGES.txt file for details on release contents and 
latest changes. The release was made from tag: 
http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480 



Please vote on releasing this package as Apache Nutch 1.0. The vote is 
open for the next 72 hours. Only votes from Lucene PMC members are 
binding, but everyone is welcome to check the release candidate and 
voice their approval or disapproval. The vote  passes if at least three 
binding +1 votes are cast.


[ ] +1 Release the packages as Apache Nutch 1.0
[ ] -1 Do not release the packages because...

Thanks!

--
Sami Siren

Re: planning for nutch-1.0-rc1

2009-03-08 Thread Dennis Kubes

Sorry about the docs being sparse on this.  I will write more about the 
process as time permits.  Don't know about the problem below.  What 
platform are you running on, windows, linux?


Dennis

Bartosz Gadzimski wrote:

Hello,

Thanks Dennis for updateing wiki it helped a lot.

You gave example with indexing but you didn't said a bit about it. Can 
you write some more? :)


Anyways I have problems at the last step (nutch from 07 march):

bin/nutch org.apache.nutch.indexer.field.FieldIndexer

It simply stops somewhere

2009-03-07 16:09:04,432 INFO  field.FieldIndexer - FieldIndexer: starting
2009-03-07 16:09:04,436 INFO  field.FieldIndexer - FieldIndexer: adding 
fields db: crawl/fields/basicfields
2009-03-07 16:09:04,498 INFO  field.FieldIndexer - FieldIndexer: adding 
fields db: crawl/fields/anchorfields
2009-03-07 16:09:05,636 INFO  plugin.PluginRepository - Plugins: looking 
in: /usr/local/nutch/plugins
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Plugin 
Auto-activation mode: [true]

2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Registered Plugins:
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - the 
nutch core extension points (nutch-extensionpoints)
2009-03-07 16:09:06,437 INFO  plugin.PluginRepository - Basic 
Query Filter (query-basic)

 plugins

2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IFD [Thread-11]: 
setInfoStream 
deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@1b4a74b 

2009-03-07 16:09:07,769 INFO  field.FieldIndexer - IW 0 [Thread-11]: 
setInfoStream: 
dir=org.apache.lucene.store.FSDirectory@/tmp/hadoop-root/mapred/local/index/_-884655313 
autoCommit=true 
mergepolicy=org.apache.lucene.index.logbytesizemergepol...@15356d5 
mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@69d02b 
ramBufferSizeMB=16.0 maxBufferedDocs=50 maxBuffereDeleteTerms=-1 
maxFieldLength=1 index=

2009-03-07 16:09:07,781 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.NullPointerException
   at 
org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:139) 

   at 
org.apache.nutch.indexer.field.FieldIndexer$OutputFormat$1.write(FieldIndexer.java:131) 

   at 
org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
   at 
org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:239)
   at 
org.apache.nutch.indexer.field.FieldIndexer.reduce(FieldIndexer.java:69)

   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
   at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
2009-03-07 16:09:08,197 FATAL field.FieldIndexer - FieldIndexer: 
java.io.IOException: Job failed!

   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
   at 
org.apache.nutch.indexer.field.FieldIndexer.index(FieldIndexer.java:267)
   at 
org.apache.nutch.indexer.field.FieldIndexer.run(FieldIndexer.java:312)

   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at 
org.apache.nutch.indexer.field.FieldIndexer.main(FieldIndexer.java:275)





In crawl/indexes is only _temporary folder.

I will try to debug this but have problems with running nutch in eclipse

Thanks,
Bartosz



Dennis Kubes pisze:
I don't know if I would make this primary yet.  I need to check what 
is causing this as it worked fine for me, in fact we currently have it 
in production.  Also we would need to update the shell scripts to 
integrate this more tightly.


Dennis

Bartosz Gadzimski wrote:

Sami Siren pisze:

Andrzej Bialecki wrote:

Sami Siren wrote:
I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
morning (EET). There are still some issues marked as fix for 1.0 
in Jira. Neither of the two remaining _bugs_ seems too important 
to me, actually I only count the issues assigned to developers as 
real candidates to be included in 1.0:


NUTCH-578 (kubes)
NUTCH-477 (ab)
NUTCH-669 (siren)


There's one Critical issue reported, related to NekoHTML 
(NUTCH-700). I'm not sure what are the feature differences 
(pertinent to Nutch) between 0.9.4 and 1.9.11 - perhaps downgrading 
is the safest course of action.

I will take care of that.



I am also volunteering to push all open issues to 1.1 before 
starting the RC build on Tuesday. Any objections on the proposed 
procedure or timing?


Sounds good.

great!

--
Sami Siren



What about new scoring and new indexing? Will it be integrated as a 
primary scoring algorithm? I have problem with it on LinkRank:


2009-03-02 20:43:45,708 INFO  webgraph.LinkRank - Starting link 
counter job
2009-03-02 20:43:47,838 INFO  webgraph.LinkRank - Finished link 
counter job
2009-03-02 20:43:47,839 INFO  webgraph.LinkRank - Reading numlinks 
temp file
2009-03-02 20:43:47,840 INFO  webgraph.LinkRank - Deleting numlinks 
temp file
2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis: 
java.lang.NullPointerException

Re: planning for nutch-1.0-rc1

2009-03-06 Thread Dennis Kubes

NUTCH-578 was a while back but as I remember it worked fine.  No 
objections to either including or pushing it.


Dennis

Sami Siren wrote:
I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
morning (EET). There are still some issues marked as fix for 1.0 in 
Jira. Neither of the two remaining _bugs_ seems too important to me, 
actually I only count the issues assigned to developers as real 
candidates to be included in 1.0:


NUTCH-578 (kubes)
NUTCH-477 (ab)
NUTCH-669 (siren)

I am also volunteering to push all open issues to 1.1 before starting 
the RC build on Tuesday. Any objections on the proposed procedure or 
timing?


--
Sami Siren

Re: planning for nutch-1.0-rc1

2009-03-06 Thread Dennis Kubes

I don't know if I would make this primary yet.  I need to check what is 
causing this as it worked fine for me, in fact we currently have it in 
production.  Also we would need to update the shell scripts to integrate 
this more tightly.


Dennis

Bartosz Gadzimski wrote:

Sami Siren pisze:

Andrzej Bialecki wrote:

Sami Siren wrote:
I am planning to build the first rc for nutch 1.0 at Tue 3.3.2009 
morning (EET). There are still some issues marked as fix for 1.0 in 
Jira. Neither of the two remaining _bugs_ seems too important to me, 
actually I only count the issues assigned to developers as real 
candidates to be included in 1.0:


NUTCH-578 (kubes)
NUTCH-477 (ab)
NUTCH-669 (siren)


There's one Critical issue reported, related to NekoHTML (NUTCH-700). 
I'm not sure what are the feature differences (pertinent to Nutch) 
between 0.9.4 and 1.9.11 - perhaps downgrading is the safest course 
of action.

I will take care of that.



I am also volunteering to push all open issues to 1.1 before 
starting the RC build on Tuesday. Any objections on the proposed 
procedure or timing?


Sounds good.

great!

--
Sami Siren



What about new scoring and new indexing? Will it be integrated as a 
primary scoring algorithm? I have problem with it on LinkRank:


2009-03-02 20:43:45,708 INFO  webgraph.LinkRank - Starting link counter job
2009-03-02 20:43:47,838 INFO  webgraph.LinkRank - Finished link counter job
2009-03-02 20:43:47,839 INFO  webgraph.LinkRank - Reading numlinks temp 
file
2009-03-02 20:43:47,840 INFO  webgraph.LinkRank - Deleting numlinks temp 
file
2009-03-02 20:43:47,842 FATAL webgraph.LinkRank - LinkAnalysis: 
java.lang.NullPointerException
   at 
org.apache.nutch.scoring.webgraph.LinkRank.runCounter(LinkRank.java:113)
   at 
org.apache.nutch.scoring.webgraph.LinkRank.analyze(LinkRank.java:582)

   at org.apache.nutch.scoring.webgraph.LinkRank.run(LinkRank.java:657)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at 
org.apache.nutch.scoring.webgraph.LinkRank.main(LinkRank.java:627)


Another question what about indexing framework mentioned here:
http://www.mail-archive.com/nutch-u...@lucene.apache.org/msg11764.html


Have all those new scoring and indexing would be real step forward.

Thanks,
Bartosz

[jira] Commented: (NUTCH-477) Extend URLFilters to support different filtering chains

2009-02-23 Thread Dennis Kubes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675907#action_12675907
 ] 

Dennis Kubes commented on NUTCH-477:


Same here.  I am not against having extra functionality, but I don't think I 
have ever used the chain options of normalizers either.  I guess the call is do 
we want it in 1.0 or not.  My thinking is we are going to be doing major 
redesign changes post 1.0 so doing lots of code refactoring wouldn't be a big 
deal.

 Extend URLFilters to support different filtering chains
 ---

 Key: NUTCH-477
 URL: https://issues.apache.org/jira/browse/NUTCH-477
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: urlfilters.patch


 I propose to make the following changes to URLFilters:
 * extend URLFilters so that they support different filtering rules depending 
 on the context where they are executed. This functionality mirrors the one 
 that URLNormalizers already support.
 * change their return value to an int code, in order to support early 
 termination of long filtering chains.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-01-23 Thread Dennis Kubes (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-666:
---

Affects Version/s: (was: 1.0.0)
   1.1
Fix Version/s: (was: 1.0.0)
   1.1

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-01-23 Thread Dennis Kubes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666484#action_12666484
 ] 

Dennis Kubes commented on NUTCH-666:


It is ok to move to 1.1.  

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Site update

2009-01-05 Thread Dennis Kubes


http://www.mail-archive.com/d...@forrest.apache.org/msg15136.html

This might help.

Dennis

Andrzej Bialecki wrote:

Otis Gospodnetic wrote:
Below is what it spits out.  I'm not sure what the cause is.  I did 
try forrest seed  forrest validate as prescribed at 
https://issues.apache.org/jira/browse/FOR-984?focusedCommentId=12649593#action_12649593 
, but forrest validate failed.


validate-sitemap:
/home/otis/apache-forrest/main/webapp/resources/schema/relaxng/sitemap-v06.rng:72:31: 
error: datatype library http://www.w3.org/2001/XMLSchema-datatypes; 
not recognized


[...]

No clue. I'd say that until we figure out what happens we can go forward 
- if it generates a consistent and usable output.

[jira] Closed: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON

2009-01-02 Thread Dennis Kubes (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-594.
--


 Serve Nutch search results in multiple formats including XML and JSON
 -

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch, 
 NUTCH-594-4-20081230.patch, NUTCH-594-5-20081231.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-572) Scoring and redirected Urls

2009-01-02 Thread Dennis Kubes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660394#action_12660394
 ] 

Dennis Kubes commented on NUTCH-572:


I would like to close this issue.  Redirect handling has undergone significant 
changes since this issue was opened and we still need to take a hard look at 
redirects and possibly how scores are represented.  However, the newer scoring 
and indexing frameworks do work around this issue.

 Scoring and redirected Urls
 ---

 Key: NUTCH-572
 URL: https://issues.apache.org/jira/browse/NUTCH-572
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0


 When a redirect is found for a given url, the new or end url is stored as the 
 content page and the old CrawlDatum get one of a few redirect codes.  The 
 page that gets indexed in Nutch is the end page and it gets indexed under the 
 end url.  Many times a site will have a significant number of links pointing 
 to start page and very few pointing to the redirected end page.  This is 
 especially true for external links.  Opic scores do not get transfered to the 
 end page but stay with the start page (the one doing the redirecting).  But 
 the start page doesn't get indexed.  Hence the end page will show up in the 
 index but under a usually much reduced score.  A good example of this is 
 cnn.com:
 URL: http://www.cnn.com/
 Version: 6
 Status: 5 (db_redir_perm)
 Fetch time: Tue Dec 04 11:02:09 CST 2007
 Modified time: Wed Dec 31 18:00:00 CST 1969
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 51.19438
 Signature: b5baaf80e9e10aa6205fc39051c362ff
 Metadata: _pst_:success(1), lastModified=0
 which redirects to http://www.cnn.com/?refresh=1
 URL: http://www.cnn.com/?refresh=1
 Version: 6
 Status: 2 (db_fetched)
 Fetch time: Tue Dec 04 11:02:11 CST 2007
 Modified time: Wed Dec 31 18:00:00 CST 1969
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 1.0
 Signature: b5baaf80e9e10aa6205fc39051c362ff
 Metadata: _pst_:success(1), lastModified=0
 Now, cnn which should be one of the highest, if not the highest ranking site 
 in the index for keywords such as news in fact doesn't show up in the index 
 and it's redirected end page appears much farther down in search results.  My 
 proposal is we somehow make OPIC scores follow redirects.  To do this we 
 would most likely need to store a start and end url for redirected urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON

2008-12-30 Thread Dennis Kubes (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659825#action_12659825
]

musepwizard edited comment on NUTCH-594 at 12/30/08 6:56 AM:
--

JSON-LIb and EZMorph are both under Apache. There is an optional Xom library
dependency for JSON-Lib which is not included, that is under LGPL, but
everything else is Apache.

http://json-lib.sourceforge.net/license.html
http://ezmorph.sourceforge.net/license.html

I put comments about these in the plugin.xml file for response-json. Is there
anything else I need to do?

was (Author: musepwizard):
JSON-LIb and EZMorph are both under Apache. There is an optional Xom
library dependency for JSON-Lib which is not included, that is under LGPL, but
everything is Apache.

http://json-lib.sourceforge.net/license.html
http://ezmorph.sourceforge.net/license.html

I put comments about these in the plugin.xml file for response-json. Is there
anything else I need to do?

Serve Nutch search results in multiple formats including XML and JSON
-

Key: NUTCH-594
URL: https://issues.apache.org/jira/browse/NUTCH-594
Project: Nutch
Issue Type: New Feature
Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Attachments: commons-beanutils-1.8.0.jar,
commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar,
NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch

Allow search results to be served in XML, JSON, and other configurable
formats. Right now there is an OpenSearch servlet that returns returns in
RSS. I would like something that has more flexibility in terms of the XML
being served and also supports other formats such as JSON or plain text.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON

2008-12-30 Thread Dennis Kubes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12659825#action_12659825
 ] 

Dennis Kubes commented on NUTCH-594:


JSON-LIb and EZMorph are both under Apache.  There is an optional Xom library 
dependency for JSON-Lib which is not included, that is under LGPL, but 
everything is Apache.

http://json-lib.sourceforge.net/license.html
http://ezmorph.sourceforge.net/license.html

I put comments about these in the plugin.xml file for response-json.  Is there 
anything else I need to do?

 Serve Nutch search results in multiple formats including XML and JSON
 -

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON

2008-12-30 Thread Dennis Kubes (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Attachment: NUTCH-594-4-20081230.patch

Final patch.  Adds the ability to stop summaries from being returned and to 
only return a given set of fields by name.

 Serve Nutch search results in multiple formats including XML and JSON
 -

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch, 
 NUTCH-594-4-20081230.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-668) Domain URL Filter


 [ 
https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-668.


Resolution: Fixed

Committed with revision 729958.

 Domain URL Filter
 -

 Key: NUTCH-668
 URL: https://issues.apache.org/jira/browse/NUTCH-668
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, 
 NUTCH-668-3-20081213.patch


 A URLFilter that adds the ability to filter out URLs by top level domain or 
 by hostname.  A configuration file with a listing of URLs is used to denote 
 accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-594) Serve Nutch search results in XML and JSON


 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Attachment: ezmorph-1.0.6.jar

ezmorph jar required for framework

 Serve Nutch search results in XML and JSON
 --

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: ezmorph-1.0.6.jar, NUTCH-594-1-20071221.patch, 
 NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-594) Serve Nutch search results in XML and JSON


 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Attachment: NUTCH-594-3-20081229.patch

A completely reworked framework with extension point for serving search results 
in different format.  Included are plugins for serving results in XML and JSON 
format.  XML is the default.  Uses JSON-Lib to convert the results into JSON 
format.

 Serve Nutch search results in XML and JSON
 --

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: ezmorph-1.0.6.jar, NUTCH-594-1-20071221.patch, 
 NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON


 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Summary: Serve Nutch search results in multiple formats including XML and 
JSON  (was: Serve Nutch search results in XML and JSON)

 Serve Nutch search results in multiple formats including XML and JSON
 -

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-594) Serve Nutch search results in XML and JSON


 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Attachment: commons-beanutils-1.8.0.jar

commons beanutils

 Serve Nutch search results in XML and JSON
 --

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-594) Serve Nutch search results in XML and JSON


 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Attachment: commons-collections-3.2.1.jar

commons collections

 Serve Nutch search results in XML and JSON
 --

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-594) Serve Nutch search results in XML and JSON


 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Attachment: json-lib-2.2.2-jdk15.jar

json lib jar

 Serve Nutch search results in XML and JSON
 --

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON


 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Attachment: (was: NUTCH-594-3-20081229.patch)

 Serve Nutch search results in multiple formats including XML and JSON
 -

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-594) Serve Nutch search results in multiple formats including XML and JSON


 [ 
https://issues.apache.org/jira/browse/NUTCH-594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-594:
---

Attachment: NUTCH-594-3-20081229.patch

Fixed some things.  Added the ability to set mime output type using the 
plugin.xml file.  That way people can have application/json or text.json or 
text/plain, however they want for their application.

 Serve Nutch search results in multiple formats including XML and JSON
 -

 Key: NUTCH-594
 URL: https://issues.apache.org/jira/browse/NUTCH-594
 Project: Nutch
  Issue Type: New Feature
 Environment: all
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: commons-beanutils-1.8.0.jar, 
 commons-collections-3.2.1.jar, ezmorph-1.0.6.jar, json-lib-2.2.2-jdk15.jar, 
 NUTCH-594-1-20071221.patch, NUTCH-594-3-20081229.patch


 Allow search results to be served in XML, JSON, and other configurable 
 formats.  Right now there is an OpenSearch servlet that returns returns in 
 RSS. I would like something that has more flexibility in terms of the XML 
 being served and also supports other formats such as JSON or plain text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (NUTCH-675) Reduce tasks do not report their status and are killed by jobtracker

2008-12-22 Thread Dennis Kubes


This is old.  It has been fixed in more recent versions of hadoop and nutch.

Otis Gospodnetic (JIRA) wrote:
[ https://issues.apache.org/jira/browse/NUTCH-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12658610#action_12658610 ] 


Otis Gospodnetic commented on NUTCH-675:


Sha Feng, could you please bring this up on the Nutch mailing list instead of 
JIRA?
It would also be good if you could upgrade your Nutch (including Hadoop) and 
see if it works then.  0.12 is VERY old version of Hadoop.



Reduce tasks do not report their status and are killed by jobtracker


Key: NUTCH-675
URL: https://issues.apache.org/jira/browse/NUTCH-675
Project: Nutch
 Issue Type: Bug
 Components: fetcher
   Affects Versions: 0.9.0
Environment: OS : Linux
   Reporter: sha feng
Fix For: 0.9.0


We choose Fetcher2 as our fetcher. Map tasks of Fetcher2 fetches about 2,000,000 urls, but at reduce stage, all reduce tasks can not report their status and be killed by jobtracker. Although we change mapred.task.timeout from 60,000 to 1,800,000, it does not work. So, who can tell us why? By the way, the version of Nutch we use is 0.9 and the version of Hadoop is 0.12. 
Thanks for your help!

[jira] Commented: (NUTCH-668) Domain URL Filter

2008-12-19 Thread Dennis Kubes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12658118#action_12658118
 ] 

Dennis Kubes commented on NUTCH-668:


Anybody have a problem if I commit this today or tommorrow?

 Domain URL Filter
 -

 Key: NUTCH-668
 URL: https://issues.apache.org/jira/browse/NUTCH-668
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch, 
 NUTCH-668-3-20081213.patch


 A URLFilter that adds the ability to filter out URLs by top level domain or 
 by hostname.  A configuration file with a listing of URLs is used to denote 
 accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: File system

2008-12-16 Thread Dennis Kubes

If you are talking about Nutch Contents which are stored in the segments 
during fetching of pages, then you would need to write  MapReduce job to 
read in the Contents object and do whatever processing you desire.


Dennis

oSilvio wrote:

Very useful information, thanks!
But in order to extract the data inside those files (like html pages) I can
find no algorithm available by nutch, nor the process used to store the
data. Do you know if it is possible to extract using lucene?

 


Dennis Kubes-2 wrote:
The nutch databases are either SequenceFile or MapFile formats which 
store key and value pairs.  Their keys and values are Writable 
implementations which translate an object into it byte equivalent and 
vice versa.


Data and index files are MapFile format.  Data is a SequenceFile, index 
is an index used by MapFiles for seeking to a specific key.


Please see the hadoop wiki for more information about Sequence and Map 
files and writable formats.


Dennis

oSilvio wrote:
Do somebody know how do the file structure works, briefly? 
It seems that the data are compressed or something, its not possible to

understand whats recorded in the data nor index files.
Thanks
Silvio

Re: File system

2008-12-15 Thread Dennis Kubes

The nutch databases are either SequenceFile or MapFile formats which 
store key and value pairs.  Their keys and values are Writable 
implementations which translate an object into it byte equivalent and 
vice versa.


Data and index files are MapFile format.  Data is a SequenceFile, index 
is an index used by MapFiles for seeking to a specific key.


Please see the hadoop wiki for more information about Sequence and Map 
files and writable formats.


Dennis

oSilvio wrote:
Do somebody know how do the file structure works, briefly? 
It seems that the data are compressed or something, its not possible to

understand whats recorded in the data nor index files.
Thanks
Silvio

[jira] Closed: (NUTCH-448) Allow Plugin Includes and Excludes from File

2008-12-09 Thread Dennis Kubes (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dennis Kubes closed NUTCH-448.
--

Resolution: Later

This was some old functionality that seemed good at the time. Not so much now.

Allow Plugin Includes and Excludes from File

Key: NUTCH-448
URL: https://issues.apache.org/jira/browse/NUTCH-448
Project: Nutch
Issue Type: Improvement
Affects Versions: 0.9.0
Environment: all platforms
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
Fix For: 1.0.0

Attachments: plugin-fromfile.patch

This functionality allows the plugin.includes and plugin.excludes values to
be moved out of the nutch-default.xml and nutch-site.xml files and loaded
from one or more text configurtion files found in the classpath. This is a
cleaner implementation then having one big long regular expression in the
configuration file as plugin.includes or plugin.excludes.
Loads plugin configuration from files defined by the plugin.files
configurtion variable. Files must be available to be found in the classpath.
The plugin files consist of one regex per line. Plugins starting with a -
will be excluded while lines starting with a # will be ignored. All other
non-blank lines will be included as plugins, one per line. Any plugins
configured through plugin.includes and plugin.excludes in the configuration
are also added. Any plugins that are excluded are removed from the includes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-646) New Indexing Framework for Nutch

2008-12-06 Thread Dennis Kubes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12654154#action_12654154
 ] 

Dennis Kubes commented on NUTCH-646:


Not yet.  I need to write up some serious documentation about how to use both 
the new scoring and indexing systems.  I will try to get to that soon.

 New Indexing Framework for Nutch
 

 Key: NUTCH-646
 URL: https://issues.apache.org/jira/browse/NUTCH-646
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 0.9.0, 1.0.0

 Attachments: arity-1.3.2.jar, NUTCH-646-1-20080818.patch, 
 NUTCH-646-2-20081126.patch


 New indexing framework for Nutch that provides a more generic field 
 abstraction consistent with Lucene index semantics.  Allows multiple MR jobs 
 to be created for different fields and those fields to be aggregated and 
 indexed in the end.  Overcomes limitations of the current indexer that limits 
 what databases are passed into the indexer.  Creates a new extension point as 
 well for field-filters for manipulation of fields during the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Domain URL filter Commit?

2008-12-05 Thread Dennis Kubes

Anybody have a problem with me committing the domain-urlfilter plugin in 
NUTCH-668?


Dennis

[jira] Commented: (NUTCH-668) Domain URL Filter

2008-12-05 Thread Dennis Kubes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653881#action_12653881
 ] 

Dennis Kubes commented on NUTCH-668:


I agree.  Being able to search for tlds like .com would make it much more 
flexible.  Let me work up the changes and I will post a new patch (without my 
local path :)).  Although I do want to get this in quickly I think the new 
functionality is worth the wait.

 Domain URL Filter
 -

 Key: NUTCH-668
 URL: https://issues.apache.org/jira/browse/NUTCH-668
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch


 A URLFilter that adds the ability to filter out URLs by top level domain or 
 by hostname.  A configuration file with a listing of URLs is used to denote 
 accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Builds are Failing

2008-12-04 Thread Dennis Kubes

After the upgrade to Hadoop, builds are failing because I think we have 
nutch set to build with Java 5 by default but I think Hadoop is built 
with Java 6 (At least the release version that I downloaded and used to 
upgrade Nutch).


I know we aren't requiring Nutch to use Java 6 yet.  This may force the 
point.  I don't know if Hadoop will build with Java 5.  I will test it 
out and post back results.  If it does, then options are:


1) Force Nutch to use Java 6
2) Rebuild Hadoop from source instead of release version using Java 5

Thoughts?

Dennis

Re: Builds are Failing

2008-12-04 Thread Dennis Kubes

I take it back.  Hadoop *requires* java 6 now as of 0.19.  Which means 
we should be making changes to require Nutch to use java 6.


Dennis

Dennis Kubes wrote:
After the upgrade to Hadoop, builds are failing because I think we have 
nutch set to build with Java 5 by default but I think Hadoop is built 
with Java 6 (At least the release version that I downloaded and used to 
upgrade Nutch).


I know we aren't requiring Nutch to use Java 6 yet.  This may force the 
point.  I don't know if Hadoop will build with Java 5.  I will test it 
out and post back results.  If it does, then options are:


1) Force Nutch to use Java 6
2) Rebuild Hadoop from source instead of release version using Java 5

Thoughts?

Dennis

[jira] Updated: (NUTCH-668) Domain URL Filter


 [ 
https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-668:
---

Attachment: NUTCH-668-2-20081204.patch

Updated to include URLUtil methods that were missing.  Sorry.

 Domain URL Filter
 -

 Key: NUTCH-668
 URL: https://issues.apache.org/jira/browse/NUTCH-668
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-668-1-20081202.patch, NUTCH-668-2-20081204.patch


 A URLFilter that adds the ability to filter out URLs by top level domain or 
 by hostname.  A configuration file with a listing of URLs is used to denote 
 accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-207) Bandwidth target for fetcher rather than a thread count


[ 
https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653404#action_12653404
 ] 

Dennis Kubes commented on NUTCH-207:


I think this would be an interesting addition.  It would also need to be ported 
to fetcher2 as well as fetcher.  It you want to take on the task of porting it 
that would be great.  If you have any questions feel free to ask.

 Bandwidth target for fetcher rather than a thread count
 ---

 Key: NUTCH-207
 URL: https://issues.apache.org/jira/browse/NUTCH-207
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8
Reporter: Rod Taylor
 Attachments: ratelimit.patch


 Increases or decreases the number of threads from the starting value 
 (fetcher.threads.fetch) up to a maximum (fetcher.threads.maximum) to achieve 
 a target bandwidth (fetcher.threads.bandwidth).
 It seems to be able to keep within 10% of the target bandwidth even when 
 large numbers of errors are found or when a number of large pages is run 
 across.
 To achieve more accurate tracking Nutch should keep track of protocol 
 overhead as well as the volume of pages downloaded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-635) LinkAnalysis Tool for Nutch


 [ 
https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-635.
--


 LinkAnalysis Tool for Nutch
 ---

 Key: NUTCH-635
 URL: https://issues.apache.org/jira/browse/NUTCH-635
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, 
 NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, 
 NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, 
 NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch


 This is a basic pagerank type link analysis tool for nutch which simulates a 
 sparse matrix using inlinks and outlinks and converges after a given number 
 of iterations.  This tool is mean to replace the current scoring system in 
 nutch with a system that converges instead of exponentially increasing 
 scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-635) LinkAnalysis Tool for Nutch


 [ 
https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-635.


Resolution: Fixed

Committed with revision 723441

 LinkAnalysis Tool for Nutch
 ---

 Key: NUTCH-635
 URL: https://issues.apache.org/jira/browse/NUTCH-635
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, 
 NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, 
 NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, 
 NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch


 This is a basic pagerank type link analysis tool for nutch which simulates a 
 sparse matrix using inlinks and outlinks and converges after a given number 
 of iterations.  This tool is mean to replace the current scoring system in 
 nutch with a system that converges instead of exponentially increasing 
 scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-646) New Indexing Framework for Nutch


[ 
https://issues.apache.org/jira/browse/NUTCH-646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653489#action_12653489
 ] 

Dennis Kubes commented on NUTCH-646:


For the final version of this I have removed the arity dependencies and 
computation functionality.  I still think that type of functionality is needed 
but it didn't feel like the right place for it at this time.

 New Indexing Framework for Nutch
 

 Key: NUTCH-646
 URL: https://issues.apache.org/jira/browse/NUTCH-646
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 0.9.0, 1.0.0

 Attachments: arity-1.3.2.jar, NUTCH-646-1-20080818.patch, 
 NUTCH-646-2-20081126.patch


 New indexing framework for Nutch that provides a more generic field 
 abstraction consistent with Lucene index semantics.  Allows multiple MR jobs 
 to be created for different fields and those fields to be aggregated and 
 indexed in the end.  Overcomes limitations of the current indexer that limits 
 what databases are passed into the indexer.  Creates a new extension point as 
 well for field-filters for manipulation of fields during the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-646) New Indexing Framework for Nutch


 [ 
https://issues.apache.org/jira/browse/NUTCH-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-646.


Resolution: Fixed

Committed with revision 723447

 New Indexing Framework for Nutch
 

 Key: NUTCH-646
 URL: https://issues.apache.org/jira/browse/NUTCH-646
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0, 0.9.0

 Attachments: arity-1.3.2.jar, NUTCH-646-1-20080818.patch, 
 NUTCH-646-2-20081126.patch


 New indexing framework for Nutch that provides a more generic field 
 abstraction consistent with Lucene index semantics.  Allows multiple MR jobs 
 to be created for different fields and those fields to be aggregated and 
 indexed in the end.  Overcomes limitations of the current indexer that limits 
 what databases are passed into the indexer.  Creates a new extension point as 
 well for field-filters for manipulation of fields during the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-662) Upgrade Nutch to use Lucene 2.4


 [ 
https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-662.


Resolution: Fixed

Committed with revision 722475

 Upgrade Nutch to use Lucene 2.4
 ---

 Key: NUTCH-662
 URL: https://issues.apache.org/jira/browse/NUTCH-662
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: lucene-analyzers-2.4.0.jar, lucene-core-2.4.0.jar, 
 lucene-misc-2.4.0.jar, NUTCH-662-20081121-1.patch


 Upgrade nutch to use Lucene 2.4.  This release changes the lucene file 
 format.  New indexes created by this lucene version will NOT be readable by 
 older versions.  Lucene 2.4 can read and update older index formats although 
 updating an older format will convert it to the new format.  There are also 
 some performance and functionality improvments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19


 [ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-663.
--


 Upgrade Nutch to use Hadoop 0.19
 

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, 
 NUTCH-663-1-20081126.patch


 Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
 performance improvements, bug fixes, and new functionality.  Changes some 
 current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-647) Resolve URLs tool


 [ 
https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-647.
--


 Resolve URLs tool
 -

 Key: NUTCH-647
 URL: https://issues.apache.org/jira/browse/NUTCH-647
 Project: Nutch
  Issue Type: New Feature
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch


 A tool that takes a listing of urls and attempts to resolve their IP 
 addresses.  Useful for running after the fetcher has run to determine if DNS 
 problems exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-647) Resolve URLs tool


 [ 
https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-647.


   Resolution: Fixed
Fix Version/s: 1.0.0

Committed with revision 722478

 Resolve URLs tool
 -

 Key: NUTCH-647
 URL: https://issues.apache.org/jira/browse/NUTCH-647
 Project: Nutch
  Issue Type: New Feature
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch


 A tool that takes a listing of urls and attempts to resolve their IP 
 addresses.  Useful for running after the fetcher has run to determine if DNS 
 problems exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-665) Search Load Testing Tool


 [ 
https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-665.


Resolution: Fixed

Committed with revision 722481

 Search Load Testing Tool
 

 Key: NUTCH-665
 URL: https://issues.apache.org/jira/browse/NUTCH-665
 Project: Nutch
  Issue Type: New Feature
  Components: searcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-665-20081126-1.patch


 A tool which spawn a number of threads and executes searches against 
 configured search servers.  This is used for light load testing of search 
 servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-665) Search Load Testing Tool


 [ 
https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-665.
--


 Search Load Testing Tool
 

 Key: NUTCH-665
 URL: https://issues.apache.org/jira/browse/NUTCH-665
 Project: Nutch
  Issue Type: New Feature
  Components: searcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-665-20081126-1.patch


 A tool which spawn a number of threads and executes searches against 
 configured search servers.  This is used for light load testing of search 
 servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-667) Input Format for working with Content in Hadoop Streaming


 [ 
https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-667.
--


 Input Format for working with Content in Hadoop Streaming
 -

 Key: NUTCH-667
 URL: https://issues.apache.org/jira/browse/NUTCH-667
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-667-1-20081126.patch


 This is a ContextAsText input format that removes line endings with spaces 
 that allow Nutch content to be used more effectively inside of Hadoop 
 streaming jobs that allow MapReduce jobs to be written in any language that 
 can communicate with stdin and stdout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-667) Input Format for working with Content in Hadoop Streaming


 [ 
https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-667.


Resolution: Fixed

Committed with revision 722483

 Input Format for working with Content in Hadoop Streaming
 -

 Key: NUTCH-667
 URL: https://issues.apache.org/jira/browse/NUTCH-667
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-667-1-20081126.patch


 This is a ContextAsText input format that removes line endings with spaces 
 that allow Nutch content to be used more effectively inside of Hadoop 
 streaming jobs that allow MapReduce jobs to be written in any language that 
 can communicate with stdin and stdout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-668) Domain URL Filter

Domain URL Filter
-

 Key: NUTCH-668
 URL: https://issues.apache.org/jira/browse/NUTCH-668
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0


A URLFilter that adds the ability to filter out URLs by top level domain or by 
hostname.  A configuration file with a listing of URLs is used to denote 
accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-668) Domain URL Filter


 [ 
https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-668:
---

Attachment: NUTCH-668-1-20081202.patch

Includes the DomainURLFilter and test files.  Domains can either be filtered by 
top level domains ignoring subdomains, or by hostnames through configuration.  
There is a configuration file where valid domains are placed one per line.  
Those domains are used to create valid domain set against which we validate 
urls at runtime.  Only urls which match domains in the domain set are 
considered valid.

 Domain URL Filter
 -

 Key: NUTCH-668
 URL: https://issues.apache.org/jira/browse/NUTCH-668
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-668-1-20081202.patch


 A URLFilter that adds the ability to filter out URLs by top level domain or 
 by hostname.  A configuration file with a listing of URLs is used to denote 
 accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Pending Commits for Nutch Issues

2008-11-27 Thread Dennis Kubes




Doğacan Güney wrote:

Hi Dennis,

On Wed, Nov 26, 2008 at 11:42 PM, Dennis Kubes [EMAIL PROTECTED] wrote:

If nobody has a problem with them I would like to commit the following
issues in the next day or two:

NUTCH-663: Upgrade Nutch to the most recent Hadoop version (0.19)
NUTCH-662: Upgrade Nutch to the most recent Lucene version (2.4)
NUTCH-647: Resolve URLs tool
NUTCH-665: Search Load Testing Tool
NUTCH-667: Input Format for working with Content in Hadoop Streaming

And I would like to commit these in  a week:

NUTCH-635: LinkAnalysis Tool for Nutch
NUTCH-646: New Indexing framework for Nutch
NUTCH-594: Serve Nutch search results in XML and JSON
NUTCH-666: Analysis plugins and new language identifier.

There are others too but these are the ones I am trying to get moved into
trunk right now.



I am OK with all but NUTCH-666... Why a new language identifier? (or
if a new one, why keep old one around?)


I haven't got the code pushed out yet.  I do have a production version 
running but I need to make it play nice with the Apache licensing 
requirements.  Current library I am using is under GPL.  The reason I 
switched was because I found that the old one wasn't working correctly 
for me.


I don't know the accuracy levels of the old language identifier but I 
found that with pages that contained both english and another language, 
it would often classify it as english.  The new language identifier I am 
currently using has an accuracy rate of 97% and is trainable as before 
for multiple languages.  Currently we have models for 20-30 languages.


Also the new language identifier works with the new indexing framework 
and with new functionality for custom fields.  The only reason I would 
keep the old one around would be for backwards compatibility for people 
currently using it.


I will push out a patch shortly and we can review.  If we don't want it 
to make it into this release I am ok with that.


Dennis





Dennis

[jira] Updated: (NUTCH-665) Search Load Testing Tool


 [ 
https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-665:
---

Attachment: NUTCH-665-20081126-1.patch

Search load testing tool.

 Search Load Testing Tool
 

 Key: NUTCH-665
 URL: https://issues.apache.org/jira/browse/NUTCH-665
 Project: Nutch
  Issue Type: New Feature
  Components: searcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-665-20081126-1.patch


 A tool which spawn a number of threads and executes searches against 
 configured search servers.  This is used for light load testing of search 
 servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-647) Resolve URLs tool


 [ 
https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-647:
---

Attachment: NUTCH-647-2-20081126.patch

Updated patch.

 Resolve URLs tool
 -

 Key: NUTCH-647
 URL: https://issues.apache.org/jira/browse/NUTCH-647
 Project: Nutch
  Issue Type: New Feature
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch


 A tool that takes a listing of urls and attempts to resolve their IP 
 addresses.  Useful for running after the fetcher has run to determine if DNS 
 problems exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

Analysis plugins for multiple language and new Language Identifier Tool
---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0


Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
russian, and thai.  Also includes a new Language Identifier tool that used the 
new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool


 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-666:
---

Attachment: NUTCH-666-1-20081126.patch

Part one of patch.  This includes the new analyzers for different languages.  
Part two will include the new language identifier tool.

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2


 [ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-663:
---

Attachment: NUTCH-663-1-20081126.patch

Updates jar and native files

 Upgrade Nutch to use Hadoop 0.18.2
 --

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: hadoop-0.19-native.tar.gz, NUTCH-663-1-20081126.patch


 Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
 performance improvements, bug fixes, and new functionality.  Changes some 
 current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2


 [ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-663:
---

Attachment: hadoop-0.19.0-core.jar

Hadoop core jar

 Upgrade Nutch to use Hadoop 0.18.2
 --

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, 
 NUTCH-663-1-20081126.patch


 Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
 performance improvements, bug fixes, and new functionality.  Changes some 
 current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2


[ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12650982#action_12650982
 ] 

Dennis Kubes commented on NUTCH-663:


hadoop 0.19 was release.  I am integrating it in and should have a patch 
shortly.

 Upgrade Nutch to use Hadoop 0.18.2
 --

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0


 Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
 performance improvements, bug fixes, and new functionality.  Changes some 
 current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19


 [ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-663:
---

Summary: Upgrade Nutch to use Hadoop 0.19  (was: Upgrade Nutch to use 
Hadoop 0.18.2)

change to 0.19 instead of 0.18.2

 Upgrade Nutch to use Hadoop 0.19
 

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, 
 NUTCH-663-1-20081126.patch


 Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
 performance improvements, bug fixes, and new functionality.  Changes some 
 current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool


 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-666:
---

Attachment: (was: NUTCH-666-1-20081126.patch)

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19


 [ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-663:
---

Attachment: NUTCH-663-1-20081126.patch

Updated patch to include API changes in Nutch classes.

 Upgrade Nutch to use Hadoop 0.19
 

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, 
 NUTCH-663-1-20081126.patch


 Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
 performance improvements, bug fixes, and new functionality.  Changes some 
 current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19


 [ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-663:
---

Attachment: (was: NUTCH-663-1-20081126.patch)

 Upgrade Nutch to use Hadoop 0.19
 

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, 
 NUTCH-663-1-20081126.patch


 Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
 performance improvements, bug fixes, and new functionality.  Changes some 
 current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch


 [ 
https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-635:
---

Attachment: (was: NUTCH-635-8-20080818.patch)

 LinkAnalysis Tool for Nutch
 ---

 Key: NUTCH-635
 URL: https://issues.apache.org/jira/browse/NUTCH-635
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, 
 NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, 
 NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, 
 NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch


 This is a basic pagerank type link analysis tool for nutch which simulates a 
 sparse matrix using inlinks and outlinks and converges after a given number 
 of iterations.  This tool is mean to replace the current scoring system in 
 nutch with a system that converges instead of exponentially increasing 
 scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-635) LinkAnalysis Tool for Nutch


 [ 
https://issues.apache.org/jira/browse/NUTCH-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-635:
---

Attachment: NUTCH-635-9-20081126.patch

Updated final patch for new link analysis framework.  I am also going to write 
up some documentation on the wiki for how this new process works.

 LinkAnalysis Tool for Nutch
 ---

 Key: NUTCH-635
 URL: https://issues.apache.org/jira/browse/NUTCH-635
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-635-1-20080612.patch, NUTCH-635-2-20080613.patch, 
 NUTCH-635-3-20080614.patch, NUTCH-635-4-20080615.patch, 
 NUTCH-635-5-20080620.patch, NUTCH-635-6-20080725.patch, 
 NUTCH-635-7-20080808.patch, NUTCH-635-9-20081126.patch


 This is a basic pagerank type link analysis tool for nutch which simulates a 
 sparse matrix using inlinks and outlinks and converges after a given number 
 of iterations.  This tool is mean to replace the current scoring system in 
 nutch with a system that converges instead of exponentially increasing 
 scores.  Also includes a tool to create an outlinkdb.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-667) Input Forma for working with Content in Hadoop Streaming

Input Forma for working with Content in Hadoop Streaming


 Key: NUTCH-667
 URL: https://issues.apache.org/jira/browse/NUTCH-667
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0


This is a ContextAsText input format that removes line endings with spaces that 
allow Nutch content to be used more effectively inside of Hadoop streaming jobs 
that allow MapReduce jobs to be written in any language that can communicate 
with stdin and stdout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool


 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-666:
---

Attachment: NUTCH-666-1-20081126.patch

Fixed patch.  Now includes the changes to AnalyzerFactory to allow multiple 
languages per plugin.

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-667) Input Forma for working with Content in Hadoop Streaming


 [ 
https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-667:
---

Attachment: NUTCH-667-1-20081126.patch

Input format for working with hadoop streaming.

 Input Forma for working with Content in Hadoop Streaming
 

 Key: NUTCH-667
 URL: https://issues.apache.org/jira/browse/NUTCH-667
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-667-1-20081126.patch


 This is a ContextAsText input format that removes line endings with spaces 
 that allow Nutch content to be used more effectively inside of Hadoop 
 streaming jobs that allow MapReduce jobs to be written in any language that 
 can communicate with stdin and stdout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-667) Input Format for working with Content in Hadoop Streaming


 [ 
https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-667:
---

Summary: Input Format for working with Content in Hadoop Streaming  (was: 
Input Forma for working with Content in Hadoop Streaming)

 Input Format for working with Content in Hadoop Streaming
 -

 Key: NUTCH-667
 URL: https://issues.apache.org/jira/browse/NUTCH-667
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-667-1-20081126.patch


 This is a ContextAsText input format that removes line endings with spaces 
 that allow Nutch content to be used more effectively inside of Hadoop 
 streaming jobs that allow MapReduce jobs to be written in any language that 
 can communicate with stdin and stdout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-646) New Indexing Framework for Nutch


 [ 
https://issues.apache.org/jira/browse/NUTCH-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-646:
---

Attachment: NUTCH-646-2-20081126.patch

Updated indexing patch.

 New Indexing Framework for Nutch
 

 Key: NUTCH-646
 URL: https://issues.apache.org/jira/browse/NUTCH-646
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 0.9.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 0.9.0, 1.0.0

 Attachments: arity-1.3.2.jar, NUTCH-646-1-20080818.patch, 
 NUTCH-646-2-20081126.patch


 New indexing framework for Nutch that provides a more generic field 
 abstraction consistent with Lucene index semantics.  Allows multiple MR jobs 
 to be created for different fields and those fields to be aggregated and 
 indexed in the end.  Overcomes limitations of the current indexer that limits 
 what databases are passed into the indexer.  Creates a new extension point as 
 well for field-filters for manipulation of fields during the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2

2008-11-25 Thread Dennis Kubes (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12650713#action_12650713
]

Dennis Kubes commented on NUTCH-663:

@buddha1021
The 1.0 release for Nutch has some of the features for Nutch 2 but it is not a
complete Nutch 2 Architecture. We felt it was best to do add some needed
features into the current version of Nutch and get them deployed to the
community quickly. A lot of people have been asking about the development of
Nutch and releasing. Truth is we have just been busy adding in needed features
and patches. We should have a release out in the next couple of weeks. That
will be a 1.0 release for Nutch but will probably contain a 18.2 or 19 release
for Hadoop. We aren't waiting for hadoop to go to 1.0.

@Doğacan Güney
I am not opposed to waiting for 0.19 as long as it will be released soon. I
was looking and it seemed they tried to release a little while back and didn't
finish because of some big errors.

Upgrade Nutch to use Hadoop 0.18.2
--

Key: NUTCH-663
URL: https://issues.apache.org/jira/browse/NUTCH-663
Project: Nutch
Issue Type: Improvement
Affects Versions: 1.0.0
Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Fix For: 1.0.0

Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes
performance improvements, bug fixes, and new functionality. Changes some
current APIs.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-662) Upgrade Nutch to use Lucene 2.4

2008-11-23 Thread Dennis Kubes (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12650009#action_12650009
]

Dennis Kubes commented on NUTCH-662:

We had been running in production for about a month and never saw any issues
with the indexing processes using 2.4. Then I was doing some work for
upgrading the trunk and it popped up in delete duplicates unit testing. We
don't do delete duplicates in our JobStream, we do it query side.

First problem was that the old DfsIndexOutput didn't implement the seek method
(probably because DFS can't seek), so when that was changed to allow it to
seek, it was throwing Checksum errors on the index when it was trying to open
it. Come to find out as above 2.4 is purposefully writing a bad checksum, then
seeking back, then writing a correct checksum in closing the index as a
pseudo-two-phase commit. So I don't think it will affect the indexing process
because as you noted it writes to local first then just transfers to DFS. In
changing DfsIndexOutput to allow DeleteDuplicates to work I just took the same
approach, local first, then put to DFS.

Upgrade Nutch to use Lucene 2.4
---

Key: NUTCH-662
URL: https://issues.apache.org/jira/browse/NUTCH-662
Project: Nutch
Issue Type: Improvement
Affects Versions: 1.0.0
Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Fix For: 1.0.0

Attachments: lucene-analyzers-2.4.0.jar, lucene-core-2.4.0.jar,
lucene-misc-2.4.0.jar, NUTCH-662-20081121-1.patch

Upgrade nutch to use Lucene 2.4. This release changes the lucene file
format. New indexes created by this lucene version will NOT be readable by
older versions. Lucene 2.4 can read and update older index formats although
updating an older format will convert it to the new format. There are also
some performance and functionality improvments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-662) Upgrade Nutch to use Lucene 2.4

Upgrade Nutch to use Lucene 2.4
---

 Key: NUTCH-662
 URL: https://issues.apache.org/jira/browse/NUTCH-662
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0


Upgrade nutch to use Lucene 2.4.  This release changes the lucene file format.  
New indexes created by this lucene version will NOT be readable by older 
versions.  Lucene 2.4 can read and update older index formats although updating 
an older format will convert it to the new format.  There are also some 
performance and functionality improvments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-663) Upgrade Nutch to use Hadoop 0.18.2

Upgrade Nutch to use Hadoop 0.18.2
--

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0


Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes performance 
improvements, bug fixes, and new functionality.  Changes some current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-662) Upgrade Nutch to use Lucene 2.4


 [ 
https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-662:
---

Attachment: lucene-misc-2.4.0.jar

 Upgrade Nutch to use Lucene 2.4
 ---

 Key: NUTCH-662
 URL: https://issues.apache.org/jira/browse/NUTCH-662
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: lucene-core-2.4.0.jar, lucene-misc-2.4.0.jar


 Upgrade nutch to use Lucene 2.4.  This release changes the lucene file 
 format.  New indexes created by this lucene version will NOT be readable by 
 older versions.  Lucene 2.4 can read and update older index formats although 
 updating an older format will convert it to the new format.  There are also 
 some performance and functionality improvments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-662) Upgrade Nutch to use Lucene 2.4

[
https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12649679#action_12649679
]

Dennis Kubes commented on NUTCH-662:

The upgrade to Lucene 2.4 causes a weird problem that might need some
discussion. The o.a.n.indexer.FsDirectory$DfsIndexOutput class is used to
interact with an index stored on DFS. The 2.4 version of Lucene in the
ChecksumIndexOutput.prepareCommit method and finalizeCommit methods do a pseudo
two-phase commit. To do this it writes an intential mismatched checksum (long
= checkum - 1) then flushes and seeks back and writes the correct checksum in
the same spot. They say this is to ensure the commit. Because DFS doesn't
have append functionality we can't write to it, seek back to a position, and
write again. DFS is write only.

To handle this problem in the attached patch, I first write out to a local
temporary file that is deleted upon exit, then when close is called on the
IndexOutput, that file is written out to DFS all at once. I don't know if this
is the best way to do this or if there is a better way, but it does handle the
new write and seek functionality of lucene 2.4. The previous implementation of
DfsIndexOutput simply threw an UnsupportedOperationException when the seek
method was called. This was fine before 2.4 as lucene wasn't calling that
method during writing to DFS. In 2.4 it does and unit tests were failing
because of it. What does everybody think about this implementation?

Other than that I don't see any major issues in upgrading to 2.4. Some people
have said performance we down in 2.4. My thoughts are, that might be the case
but those will be fixed and it would be good to be on the most recent lucene
version as we move to a 1.0 release for Nutch. Also we have been using 2.4 in
production for a month now without any issues.

Upgrade Nutch to use Lucene 2.4
---

Attachments: lucene-core-2.4.0.jar, lucene-misc-2.4.0.jar,
NUTCH-662-20081121-1.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-662) Upgrade Nutch to use Lucene 2.4