[jira] Issue Comment Edited: (NUTCH-664) Possibility to update already stored documents.

2008-12-02 Thread Sergey Khilkov (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651458#action_12651458
 ] 

skhil edited comment on NUTCH-664 at 12/2/08 1:29 AM:
---

Good news! So, I'll wait until 1.0 and prepare project for hbase-solr!

  was (Author: skhil):
Good news! So, I'll wait until 1.0 and prepare project for 
hbase-solr/katta/etc!
  
 Possibility to update already stored documents.
 ---

 Key: NUTCH-664
 URL: https://issues.apache.org/jira/browse/NUTCH-664
 Project: Nutch
  Issue Type: Wish
Reporter: Sergey Khilkov
Priority: Minor

 We have huge index of stored documents. It is high cost procedure to fetch 
 page, merge indexes any time we update some information about page. The 
 information can be changed 1-3 times per day. At this moment we have to store 
 changed info in database, but in this case we have lots of problems with 
 sorting, search restricions and so on. Lucene itself allows delete single 
 document and add new one into existing index. But there is a problem with 
 hadoop... As I understand hadoop filesystem has no possibility to write in 
 random positions. But it will be great feature if nutch will be able to 
 update created index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Pending Commits for Nutch Issues

2008-12-02 Thread Susam Pal
I agree with John too. Probably you meant $ 0.02, since 0.02 cents is too
less. It is usually 2 cents. :-P

Regards,
Susam Pal

On Tue, Dec 2, 2008 at 6:09 PM, John Martyniak [EMAIL PROTECTED] wrote:

 Is NUTCH-442 going to be part of the 1.0 release?  I hope so, Nutch/Solr
 integration would be a huge.

 just my .02 cents.

 -John

 On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote:

  And here is a list of issues from me that needs more discussion/review:

 NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to
 review for people, for now we can just write a SolrIndexer like Sami
 Siren's and deal with 442 after 1.0. I would be happy to provide such
 a patch.

 NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I
 don't know how to fix this one but indexing almost always fails with
 index-more enabled.

 NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
 fetch interval correctly: I botched it once so now I am afraid to
 commit it :D

 NUTCH-626 - fetcher2 breaks out the domain with
 db.ignore.external.links set at cross domain redirects: I am going to
 update the patch and commit it if no objections.

 Also, I think NUTCH-658 would be a nice feature for 1.0.

 There are some others but these are the most recent and we really
 should push 1.0 out the door already :D

 Oh and finally we should do a review of all libraries in nutch
 (libraries in plugins included) and update them to latest versions. I
 am going to open an issue with the intenton of updating all the
 libraries that do not require code changes.

 --
 Doğacan Güney





Re: Pending Commits for Nutch Issues

2008-12-02 Thread Julien Nioche
I agree with John. NUTCH-442 is by far the most popular/watched item in JIRA
and, I think, has been already used by quite a lot of different people to be
deemed reliable.

Julien


2008/12/2 John Martyniak [EMAIL PROTECTED]

 Is NUTCH-442 going to be part of the 1.0 release?  I hope so, Nutch/Solr
 integration would be a huge.

 just my .02 cents.

 -John


 On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote:

  And here is a list of issues from me that needs more discussion/review:

 NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to
 review for people, for now we can just write a SolrIndexer like Sami
 Siren's and deal with 442 after 1.0. I would be happy to provide such
 a patch.

 NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I
 don't know how to fix this one but indexing almost always fails with
 index-more enabled.

 NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
 fetch interval correctly: I botched it once so now I am afraid to
 commit it :D

 NUTCH-626 - fetcher2 breaks out the domain with
 db.ignore.external.links set at cross domain redirects: I am going to
 update the patch and commit it if no objections.

 Also, I think NUTCH-658 would be a nice feature for 1.0.

 There are some others but these are the most recent and we really
 should push 1.0 out the door already :D

 Oh and finally we should do a review of all libraries in nutch
 (libraries in plugins included) and update them to latest versions. I
 am going to open an issue with the intenton of updating all the
 libraries that do not require code changes.

 --
 Doğacan Güney





-- 
DigitalPebble Ltd
http://www.digitalpebble.com


[jira] Resolved: (NUTCH-662) Upgrade Nutch to use Lucene 2.4

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-662.


Resolution: Fixed

Committed with revision 722475

 Upgrade Nutch to use Lucene 2.4
 ---

 Key: NUTCH-662
 URL: https://issues.apache.org/jira/browse/NUTCH-662
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: lucene-analyzers-2.4.0.jar, lucene-core-2.4.0.jar, 
 lucene-misc-2.4.0.jar, NUTCH-662-20081121-1.patch


 Upgrade nutch to use Lucene 2.4.  This release changes the lucene file 
 format.  New indexes created by this lucene version will NOT be readable by 
 older versions.  Lucene 2.4 can read and update older index formats although 
 updating an older format will convert it to the new format.  There are also 
 some performance and functionality improvments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-663.
--


 Upgrade Nutch to use Hadoop 0.19
 

 Key: NUTCH-663
 URL: https://issues.apache.org/jira/browse/NUTCH-663
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, 
 NUTCH-663-1-20081126.patch


 Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
 performance improvements, bug fixes, and new functionality.  Changes some 
 current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-647) Resolve URLs tool

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-647.
--


 Resolve URLs tool
 -

 Key: NUTCH-647
 URL: https://issues.apache.org/jira/browse/NUTCH-647
 Project: Nutch
  Issue Type: New Feature
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch


 A tool that takes a listing of urls and attempts to resolve their IP 
 addresses.  Useful for running after the fetcher has run to determine if DNS 
 problems exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-647) Resolve URLs tool

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-647.


   Resolution: Fixed
Fix Version/s: 1.0.0

Committed with revision 722478

 Resolve URLs tool
 -

 Key: NUTCH-647
 URL: https://issues.apache.org/jira/browse/NUTCH-647
 Project: Nutch
  Issue Type: New Feature
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch


 A tool that takes a listing of urls and attempts to resolve their IP 
 addresses.  Useful for running after the fetcher has run to determine if DNS 
 problems exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-665) Search Load Testing Tool

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-665.


Resolution: Fixed

Committed with revision 722481

 Search Load Testing Tool
 

 Key: NUTCH-665
 URL: https://issues.apache.org/jira/browse/NUTCH-665
 Project: Nutch
  Issue Type: New Feature
  Components: searcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-665-20081126-1.patch


 A tool which spawn a number of threads and executes searches against 
 configured search servers.  This is used for light load testing of search 
 servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-665) Search Load Testing Tool

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-665.
--


 Search Load Testing Tool
 

 Key: NUTCH-665
 URL: https://issues.apache.org/jira/browse/NUTCH-665
 Project: Nutch
  Issue Type: New Feature
  Components: searcher
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-665-20081126-1.patch


 A tool which spawn a number of threads and executes searches against 
 configured search servers.  This is used for light load testing of search 
 servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-667) Input Format for working with Content in Hadoop Streaming

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-667.
--


 Input Format for working with Content in Hadoop Streaming
 -

 Key: NUTCH-667
 URL: https://issues.apache.org/jira/browse/NUTCH-667
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-667-1-20081126.patch


 This is a ContextAsText input format that removes line endings with spaces 
 that allow Nutch content to be used more effectively inside of Hadoop 
 streaming jobs that allow MapReduce jobs to be written in any language that 
 can communicate with stdin and stdout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-667) Input Format for working with Content in Hadoop Streaming

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-667.


Resolution: Fixed

Committed with revision 722483

 Input Format for working with Content in Hadoop Streaming
 -

 Key: NUTCH-667
 URL: https://issues.apache.org/jira/browse/NUTCH-667
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-667-1-20081126.patch


 This is a ContextAsText input format that removes line endings with spaces 
 that allow Nutch content to be used more effectively inside of Hadoop 
 streaming jobs that allow MapReduce jobs to be written in any language that 
 can communicate with stdin and stdout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Pending Commits for Nutch Issues

2008-12-02 Thread John Martyniak
Is NUTCH-442 going to be part of the 1.0 release?  I hope so, Nutch/ 
Solr integration would be a huge.


just my .02 cents.

-John

On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote:

And here is a list of issues from me that needs more discussion/ 
review:


NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to
review for people, for now we can just write a SolrIndexer like Sami
Siren's and deal with 442 after 1.0. I would be happy to provide such
a patch.

NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I
don't know how to fix this one but indexing almost always fails with
index-more enabled.

NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
fetch interval correctly: I botched it once so now I am afraid to
commit it :D

NUTCH-626 - fetcher2 breaks out the domain with
db.ignore.external.links set at cross domain redirects: I am going to
update the patch and commit it if no objections.

Also, I think NUTCH-658 would be a nice feature for 1.0.

There are some others but these are the most recent and we really
should push 1.0 out the door already :D

Oh and finally we should do a review of all libraries in nutch
(libraries in plugins included) and update them to latest versions. I
am going to open an issue with the intenton of updating all the
libraries that do not require code changes.

--
Doğacan Güney




named parameters in crawl command

2008-12-02 Thread Koch Martina
Hi all,

I've defined a couple of custom parameters for the usage of bin/nutch like for 
example the parameter -conf to set the conf dir from the command line.
To be able to use the crawl command, I have to adjust the for-loop and if/else 
statements for the command line arguments args[] in the crawl.java in order to 
make my new parameters known to the class, because otherwise it takes the last 
unknown parameter as URL input directory (last else if statement). Wouldn't 
it be better to use a named parameter for the URL directory like for all the 
other parameters? By this, one wouldn't have to change Nutch core classes to 
use custom input parameters because they would simply be discarded, if the JAVA 
program has no use for them.
What do you think? In my opinion the change to version 1.0 would be a good 
point in time to introduce a slightly different usage of the standard crawl 
command.

Kind regards,
Martina



[jira] Created: (NUTCH-668) Domain URL Filter

2008-12-02 Thread Dennis Kubes (JIRA)
Domain URL Filter
-

 Key: NUTCH-668
 URL: https://issues.apache.org/jira/browse/NUTCH-668
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0


A URLFilter that adds the ability to filter out URLs by top level domain or by 
hostname.  A configuration file with a listing of URLs is used to denote 
accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-668) Domain URL Filter

2008-12-02 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-668:
---

Attachment: NUTCH-668-1-20081202.patch

Includes the DomainURLFilter and test files.  Domains can either be filtered by 
top level domains ignoring subdomains, or by hostnames through configuration.  
There is a configuration file where valid domains are placed one per line.  
Those domains are used to create valid domain set against which we validate 
urls at runtime.  Only urls which match domains in the domain set are 
considered valid.

 Domain URL Filter
 -

 Key: NUTCH-668
 URL: https://issues.apache.org/jira/browse/NUTCH-668
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-668-1-20081202.patch


 A URLFilter that adds the ability to filter out URLs by top level domain or 
 by hostname.  A configuration file with a listing of URLs is used to denote 
 accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Build failed in Hudson: Nutch-trunk #649

2008-12-02 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/649/changes

Changes:

[kubes] NUTCH-667: Input Format for working with Content in Hadoop Streaming

[kubes] NUTCH-665: Search Load Testing Tool

[kubes] NUTCH-647: Resolve URLs tool

[kubes] NUTCH-647: Resolve URLs tool

[kubes] NUTCH-663: Upgrade Nutch to use Hadoop 0.19

[kubes] NUTCH-662: Upgrade Nutch to use Lucene 2.4

--
[...truncated 2151 lines...]
A src/plugin/protocol-http/src/test/org/apache
A src/plugin/protocol-http/src/test/org/apache/nutch
A src/plugin/protocol-http/src/test/org/apache/nutch/protocol
A src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http
A src/plugin/protocol-http/src/java
A src/plugin/protocol-http/src/java/org
A src/plugin/protocol-http/src/java/org/apache
A src/plugin/protocol-http/src/java/org/apache/nutch
A src/plugin/protocol-http/src/java/org/apache/nutch/protocol
A src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http
AU
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.java
A 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
A 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/package.html
AUsrc/plugin/protocol-http/plugin.xml
AUsrc/plugin/protocol-http/build.xml
A bin
AUbin/nutch
A docs
A docs/ms
A docs/ms/search.html
A docs/ms/help.html
A docs/ms/about.html
A docs/zh
A docs/zh/search.html
A docs/zh/help.html
A docs/zh/about.html
A docs/ca
A docs/ca/search.html
A docs/ca/help.html
A docs/ca/about.html
A docs/pt
A docs/pt/search.html
A docs/pt/help.html
A docs/pt/about.html
A docs/sr
AUdocs/sr/search.html
AUdocs/sr/help.html
AUdocs/sr/about.html
A docs/sv
A docs/sv/search.html
A docs/sv/help.html
A docs/sv/about.html
A docs/de
A docs/de/search.html
A docs/de/help.html
A docs/de/about.html
A docs/fi
A docs/fi/search.html
A docs/fi/help.html
A docs/fi/about.html
A docs/en
A docs/en/search.html
A docs/en/help.html
A docs/en/about.html
A docs/es
A docs/es/search.html
A docs/es/help.html
A docs/es/about.html
A docs/fr
A docs/fr/search.html
AUdocs/fr/help.html
A docs/fr/about.html
A docs/jp
A docs/jp/search.html
A docs/jp/help.html
A docs/jp/about.html
A docs/nl
A docs/nl/search.html
A docs/nl/help.html
A docs/nl/about.html
A docs/sh
AUdocs/sh/search.html
AUdocs/sh/help.html
AUdocs/sh/about.html
A docs/th
A docs/th/search.html
A docs/th/help.html
A docs/th/about.html
A docs/pl
A docs/pl/search.html
A docs/pl/help.html
A docs/pl/about.html
A docs/it
AUdocs/it/search.html
AUdocs/it/help.html
AUdocs/it/about.html
A docs/img
A docs/img/lang
AUdocs/img/lang/romanian.png
AUdocs/img/lang/bulgarian.png
AUdocs/img/lang/spanish.png
AUdocs/img/lang/danish.png
AUdocs/img/lang/dutch.png
AUdocs/img/lang/icelandic.png
AUdocs/img/lang/hungarian.png
AUdocs/img/lang/russian.png
AUdocs/img/lang/japanese.png
AUdocs/img/lang/turkish.png
AUdocs/img/lang/suomi.png
AUdocs/img/lang/lithuanian.png
AUdocs/img/lang/czech.png
AUdocs/img/lang/greek.png
AUdocs/img/lang/galego.png
AUdocs/img/lang/polish.png
AUdocs/img/lang/latvian.png
AUdocs/img/lang/croatian.png
AUdocs/img/lang/portuguese.png
AUdocs/img/lang/french.png
AUdocs/img/lang/swedish.png
AUdocs/img/lang/german.png
AUdocs/img/lang/chinese.png
AUdocs/img/lang/malaysian.png
AUdocs/img/lang/korean.png
AUdocs/img/lang/arabic.png
AUdocs/img/lang/italian.png
AUdocs/img/lang/brazil.png
AUdocs/img/lang/catala.png
AUdocs/img/lang/thai.png
AUdocs/img/lang/indonesian.png
AUdocs/img/lang/norwegian.png
AUdocs/img/lang/english.png
AUdocs/img/poweredbynutch_01.gif
AUdocs/img/poweredbynutch_02.gif
A docs/img/reiter
AUdocs/img/reiter/reiter_inactive_le.gif
AUdocs/img/reiter/_spacer_cc.gif
AUdocs/img/reiter/reiter_inactive_le1.gif
AUdocs/img/reiter/bg_subnavi.gif
AUdocs/img/reiter/002bg_fle.gif
AUdocs/img/reiter/spacer_66.gif
AUdocs/img/reiter/ul.gif
AUdocs/img/reiter/_bg_reiter.gif
AUdocs/img/reiter/logo_nutch.gif
AU