[jira] Issue Comment Edited: (NUTCH-664) Possibility to update already stored documents.
[ https://issues.apache.org/jira/browse/NUTCH-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651458#action_12651458 ] skhil edited comment on NUTCH-664 at 12/2/08 1:29 AM: --- Good news! So, I'll wait until 1.0 and prepare project for hbase-solr! was (Author: skhil): Good news! So, I'll wait until 1.0 and prepare project for hbase-solr/katta/etc! Possibility to update already stored documents. --- Key: NUTCH-664 URL: https://issues.apache.org/jira/browse/NUTCH-664 Project: Nutch Issue Type: Wish Reporter: Sergey Khilkov Priority: Minor We have huge index of stored documents. It is high cost procedure to fetch page, merge indexes any time we update some information about page. The information can be changed 1-3 times per day. At this moment we have to store changed info in database, but in this case we have lots of problems with sorting, search restricions and so on. Lucene itself allows delete single document and add new one into existing index. But there is a problem with hadoop... As I understand hadoop filesystem has no possibility to write in random positions. But it will be great feature if nutch will be able to update created index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Pending Commits for Nutch Issues
I agree with John too. Probably you meant $ 0.02, since 0.02 cents is too less. It is usually 2 cents. :-P Regards, Susam Pal On Tue, Dec 2, 2008 at 6:09 PM, John Martyniak [EMAIL PROTECTED] wrote: Is NUTCH-442 going to be part of the 1.0 release? I hope so, Nutch/Solr integration would be a huge. just my .02 cents. -John On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote: And here is a list of issues from me that needs more discussion/review: NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to review for people, for now we can just write a SolrIndexer like Sami Siren's and deal with 442 after 1.0. I would be happy to provide such a patch. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I don't know how to fix this one but indexing almost always fails with index-more enabled. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate fetch interval correctly: I botched it once so now I am afraid to commit it :D NUTCH-626 - fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects: I am going to update the patch and commit it if no objections. Also, I think NUTCH-658 would be a nice feature for 1.0. There are some others but these are the most recent and we really should push 1.0 out the door already :D Oh and finally we should do a review of all libraries in nutch (libraries in plugins included) and update them to latest versions. I am going to open an issue with the intenton of updating all the libraries that do not require code changes. -- Doğacan Güney
Re: Pending Commits for Nutch Issues
I agree with John. NUTCH-442 is by far the most popular/watched item in JIRA and, I think, has been already used by quite a lot of different people to be deemed reliable. Julien 2008/12/2 John Martyniak [EMAIL PROTECTED] Is NUTCH-442 going to be part of the 1.0 release? I hope so, Nutch/Solr integration would be a huge. just my .02 cents. -John On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote: And here is a list of issues from me that needs more discussion/review: NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to review for people, for now we can just write a SolrIndexer like Sami Siren's and deal with 442 after 1.0. I would be happy to provide such a patch. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I don't know how to fix this one but indexing almost always fails with index-more enabled. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate fetch interval correctly: I botched it once so now I am afraid to commit it :D NUTCH-626 - fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects: I am going to update the patch and commit it if no objections. Also, I think NUTCH-658 would be a nice feature for 1.0. There are some others but these are the most recent and we really should push 1.0 out the door already :D Oh and finally we should do a review of all libraries in nutch (libraries in plugins included) and update them to latest versions. I am going to open an issue with the intenton of updating all the libraries that do not require code changes. -- Doğacan Güney -- DigitalPebble Ltd http://www.digitalpebble.com
[jira] Resolved: (NUTCH-662) Upgrade Nutch to use Lucene 2.4
[ https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-662. Resolution: Fixed Committed with revision 722475 Upgrade Nutch to use Lucene 2.4 --- Key: NUTCH-662 URL: https://issues.apache.org/jira/browse/NUTCH-662 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: lucene-analyzers-2.4.0.jar, lucene-core-2.4.0.jar, lucene-misc-2.4.0.jar, NUTCH-662-20081121-1.patch Upgrade nutch to use Lucene 2.4. This release changes the lucene file format. New indexes created by this lucene version will NOT be readable by older versions. Lucene 2.4 can read and update older index formats although updating an older format will convert it to the new format. There are also some performance and functionality improvments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19
[ https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-663. -- Upgrade Nutch to use Hadoop 0.19 Key: NUTCH-663 URL: https://issues.apache.org/jira/browse/NUTCH-663 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, NUTCH-663-1-20081126.patch Upgrade Nutch to use a newer hadoop, version 0.18.2. This includes performance improvements, bug fixes, and new functionality. Changes some current APIs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-647) Resolve URLs tool
[ https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-647. -- Resolve URLs tool - Key: NUTCH-647 URL: https://issues.apache.org/jira/browse/NUTCH-647 Project: Nutch Issue Type: New Feature Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch A tool that takes a listing of urls and attempts to resolve their IP addresses. Useful for running after the fetcher has run to determine if DNS problems exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-647) Resolve URLs tool
[ https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-647. Resolution: Fixed Fix Version/s: 1.0.0 Committed with revision 722478 Resolve URLs tool - Key: NUTCH-647 URL: https://issues.apache.org/jira/browse/NUTCH-647 Project: Nutch Issue Type: New Feature Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch A tool that takes a listing of urls and attempts to resolve their IP addresses. Useful for running after the fetcher has run to determine if DNS problems exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-665) Search Load Testing Tool
[ https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-665. Resolution: Fixed Committed with revision 722481 Search Load Testing Tool Key: NUTCH-665 URL: https://issues.apache.org/jira/browse/NUTCH-665 Project: Nutch Issue Type: New Feature Components: searcher Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-665-20081126-1.patch A tool which spawn a number of threads and executes searches against configured search servers. This is used for light load testing of search servers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-665) Search Load Testing Tool
[ https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-665. -- Search Load Testing Tool Key: NUTCH-665 URL: https://issues.apache.org/jira/browse/NUTCH-665 Project: Nutch Issue Type: New Feature Components: searcher Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-665-20081126-1.patch A tool which spawn a number of threads and executes searches against configured search servers. This is used for light load testing of search servers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-667) Input Format for working with Content in Hadoop Streaming
[ https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-667. -- Input Format for working with Content in Hadoop Streaming - Key: NUTCH-667 URL: https://issues.apache.org/jira/browse/NUTCH-667 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-667-1-20081126.patch This is a ContextAsText input format that removes line endings with spaces that allow Nutch content to be used more effectively inside of Hadoop streaming jobs that allow MapReduce jobs to be written in any language that can communicate with stdin and stdout. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-667) Input Format for working with Content in Hadoop Streaming
[ https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-667. Resolution: Fixed Committed with revision 722483 Input Format for working with Content in Hadoop Streaming - Key: NUTCH-667 URL: https://issues.apache.org/jira/browse/NUTCH-667 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-667-1-20081126.patch This is a ContextAsText input format that removes line endings with spaces that allow Nutch content to be used more effectively inside of Hadoop streaming jobs that allow MapReduce jobs to be written in any language that can communicate with stdin and stdout. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Pending Commits for Nutch Issues
Is NUTCH-442 going to be part of the 1.0 release? I hope so, Nutch/ Solr integration would be a huge. just my .02 cents. -John On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote: And here is a list of issues from me that needs more discussion/ review: NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to review for people, for now we can just write a SolrIndexer like Sami Siren's and deal with 442 after 1.0. I would be happy to provide such a patch. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I don't know how to fix this one but indexing almost always fails with index-more enabled. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate fetch interval correctly: I botched it once so now I am afraid to commit it :D NUTCH-626 - fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects: I am going to update the patch and commit it if no objections. Also, I think NUTCH-658 would be a nice feature for 1.0. There are some others but these are the most recent and we really should push 1.0 out the door already :D Oh and finally we should do a review of all libraries in nutch (libraries in plugins included) and update them to latest versions. I am going to open an issue with the intenton of updating all the libraries that do not require code changes. -- Doğacan Güney
named parameters in crawl command
Hi all, I've defined a couple of custom parameters for the usage of bin/nutch like for example the parameter -conf to set the conf dir from the command line. To be able to use the crawl command, I have to adjust the for-loop and if/else statements for the command line arguments args[] in the crawl.java in order to make my new parameters known to the class, because otherwise it takes the last unknown parameter as URL input directory (last else if statement). Wouldn't it be better to use a named parameter for the URL directory like for all the other parameters? By this, one wouldn't have to change Nutch core classes to use custom input parameters because they would simply be discarded, if the JAVA program has no use for them. What do you think? In my opinion the change to version 1.0 would be a good point in time to introduce a slightly different usage of the standard crawl command. Kind regards, Martina
[jira] Created: (NUTCH-668) Domain URL Filter
Domain URL Filter - Key: NUTCH-668 URL: https://issues.apache.org/jira/browse/NUTCH-668 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-668) Domain URL Filter
[ https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-668: --- Attachment: NUTCH-668-1-20081202.patch Includes the DomainURLFilter and test files. Domains can either be filtered by top level domains ignoring subdomains, or by hostnames through configuration. There is a configuration file where valid domains are placed one per line. Those domains are used to create valid domain set against which we validate urls at runtime. Only urls which match domains in the domain set are considered valid. Domain URL Filter - Key: NUTCH-668 URL: https://issues.apache.org/jira/browse/NUTCH-668 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-668-1-20081202.patch A URLFilter that adds the ability to filter out URLs by top level domain or by hostname. A configuration file with a listing of URLs is used to denote accepted urls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Nutch-trunk #649
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/649/changes Changes: [kubes] NUTCH-667: Input Format for working with Content in Hadoop Streaming [kubes] NUTCH-665: Search Load Testing Tool [kubes] NUTCH-647: Resolve URLs tool [kubes] NUTCH-647: Resolve URLs tool [kubes] NUTCH-663: Upgrade Nutch to use Hadoop 0.19 [kubes] NUTCH-662: Upgrade Nutch to use Lucene 2.4 -- [...truncated 2151 lines...] A src/plugin/protocol-http/src/test/org/apache A src/plugin/protocol-http/src/test/org/apache/nutch A src/plugin/protocol-http/src/test/org/apache/nutch/protocol A src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http A src/plugin/protocol-http/src/java A src/plugin/protocol-http/src/java/org A src/plugin/protocol-http/src/java/org/apache A src/plugin/protocol-http/src/java/org/apache/nutch A src/plugin/protocol-http/src/java/org/apache/nutch/protocol A src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http AU src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.java A src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java A src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/package.html AUsrc/plugin/protocol-http/plugin.xml AUsrc/plugin/protocol-http/build.xml A bin AUbin/nutch A docs A docs/ms A docs/ms/search.html A docs/ms/help.html A docs/ms/about.html A docs/zh A docs/zh/search.html A docs/zh/help.html A docs/zh/about.html A docs/ca A docs/ca/search.html A docs/ca/help.html A docs/ca/about.html A docs/pt A docs/pt/search.html A docs/pt/help.html A docs/pt/about.html A docs/sr AUdocs/sr/search.html AUdocs/sr/help.html AUdocs/sr/about.html A docs/sv A docs/sv/search.html A docs/sv/help.html A docs/sv/about.html A docs/de A docs/de/search.html A docs/de/help.html A docs/de/about.html A docs/fi A docs/fi/search.html A docs/fi/help.html A docs/fi/about.html A docs/en A docs/en/search.html A docs/en/help.html A docs/en/about.html A docs/es A docs/es/search.html A docs/es/help.html A docs/es/about.html A docs/fr A docs/fr/search.html AUdocs/fr/help.html A docs/fr/about.html A docs/jp A docs/jp/search.html A docs/jp/help.html A docs/jp/about.html A docs/nl A docs/nl/search.html A docs/nl/help.html A docs/nl/about.html A docs/sh AUdocs/sh/search.html AUdocs/sh/help.html AUdocs/sh/about.html A docs/th A docs/th/search.html A docs/th/help.html A docs/th/about.html A docs/pl A docs/pl/search.html A docs/pl/help.html A docs/pl/about.html A docs/it AUdocs/it/search.html AUdocs/it/help.html AUdocs/it/about.html A docs/img A docs/img/lang AUdocs/img/lang/romanian.png AUdocs/img/lang/bulgarian.png AUdocs/img/lang/spanish.png AUdocs/img/lang/danish.png AUdocs/img/lang/dutch.png AUdocs/img/lang/icelandic.png AUdocs/img/lang/hungarian.png AUdocs/img/lang/russian.png AUdocs/img/lang/japanese.png AUdocs/img/lang/turkish.png AUdocs/img/lang/suomi.png AUdocs/img/lang/lithuanian.png AUdocs/img/lang/czech.png AUdocs/img/lang/greek.png AUdocs/img/lang/galego.png AUdocs/img/lang/polish.png AUdocs/img/lang/latvian.png AUdocs/img/lang/croatian.png AUdocs/img/lang/portuguese.png AUdocs/img/lang/french.png AUdocs/img/lang/swedish.png AUdocs/img/lang/german.png AUdocs/img/lang/chinese.png AUdocs/img/lang/malaysian.png AUdocs/img/lang/korean.png AUdocs/img/lang/arabic.png AUdocs/img/lang/italian.png AUdocs/img/lang/brazil.png AUdocs/img/lang/catala.png AUdocs/img/lang/thai.png AUdocs/img/lang/indonesian.png AUdocs/img/lang/norwegian.png AUdocs/img/lang/english.png AUdocs/img/poweredbynutch_01.gif AUdocs/img/poweredbynutch_02.gif A docs/img/reiter AUdocs/img/reiter/reiter_inactive_le.gif AUdocs/img/reiter/_spacer_cc.gif AUdocs/img/reiter/reiter_inactive_le1.gif AUdocs/img/reiter/bg_subnavi.gif AUdocs/img/reiter/002bg_fle.gif AUdocs/img/reiter/spacer_66.gif AUdocs/img/reiter/ul.gif AUdocs/img/reiter/_bg_reiter.gif AUdocs/img/reiter/logo_nutch.gif AU