[jira] [Commented] (NUTCH-1363) Make parsing in FetcherJob actually work.

2012-05-10 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272150#comment-13272150 ] Ferdy Galema commented on NUTCH-1363: - I'm not sure I follow. What makes this property

[jira] [Created] (NUTCH-1365) Fix crawlId functionalilty by making using of new gora configuration

2012-05-10 Thread Ferdy Galema (JIRA)
Ferdy Galema created NUTCH-1365: --- Summary: Fix crawlId functionalilty by making using of new gora configuration Key: NUTCH-1365 URL: https://issues.apache.org/jira/browse/NUTCH-1365 Project: Nutch

[jira] [Updated] (NUTCH-1365) Fix crawlId functionalilty by making using of new gora configuration

2012-05-10 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1365: Attachment: NUTCH-1365.patch Fix crawlId functionalilty by making using of new gora

Re: store additional information from page at outlinks - topic specific crawl

2012-05-10 Thread Armin Nagel
Hi Markus, thanks for your reply, but that is not what I want. Why store data into solr that I do not need? I do not want use solr. My goal is to crawl terra byte of data, store data in hbase or other store and do some processing an it, so this unneeded data causes pain. I have to filter the

Re:

2012-05-10 Thread Markus Jelsma
Hi What do you mean by `the function of learning to rank` ? Cheers, On Thu, 10 May 2012 16:37:00 +0800, 柳胜兵 colin.liu1...@gmail.com wrote: hello,all. I want to know whether nutch project is to plan to implement the function of learning to rank.

Re:

2012-05-10 Thread 柳胜兵
that is to say ,could we use some data of document relevance to query generated by expert or log of users'clickthrough to get a more complex ,but better , ranking model by machine learning . 2012/5/10 Markus Jelsma markus.jel...@openindex.io Hi What do you mean by `the function of learning

Re:

2012-05-10 Thread Markus Jelsma
Ah i see. Well, no. Nutch is not a search engine anymore. You can do this with Solr and some custom script parsing it's log and emitting external file fields but not with Nutch. On Thu, 10 May 2012 17:02:12 +0800, 柳胜兵 colin.liu1...@gmail.com wrote: that is to say ,could we use some data of

Re:

2012-05-10 Thread 柳胜兵
Well, I see. Thanks. 2012/5/10 Markus Jelsma markus.jel...@openindex.io Ah i see. Well, no. Nutch is not a search engine anymore. You can do this with Solr and some custom script parsing it's log and emitting external file fields but not with Nutch. On Thu, 10 May 2012 17:02:12 +0800, 柳胜兵

[jira] [Commented] (NUTCH-1306) Commit after finished writing to solr index

2012-05-10 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272231#comment-13272231 ] Ferdy Galema commented on NUTCH-1306: - Lewis, Do you suggest to add the commit as

[jira] [Updated] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-05-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1360: Attachment: NUTCH-1360-nutchgora.patch This is a real WIP for nutchgora. It would

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-05-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272234#comment-13272234 ] Lewis John McGibbney commented on NUTCH-1360: - As all protocol plugins try to

[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2012-05-10 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1325: - Attachment: NUTCH-1325-1.6-1.patch Initial patch. This introduces a HostDB that keeps track of

[jira] [Closed] (NUTCH-1026) Strip UTF-8 non-character codepoints

2012-05-10 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema closed NUTCH-1026. --- Resolution: Fixed Fix Version/s: (was: 2.1) nutchgora When indexing a

[jira] [Commented] (NUTCH-1026) Strip UTF-8 non-character codepoints

2012-05-10 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272328#comment-13272328 ] Markus Jelsma commented on NUTCH-1026: -- Great! Strip UTF-8

[jira] [Commented] (NUTCH-1306) Commit after finished writing to solr index

2012-05-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272341#comment-13272341 ] Lewis John McGibbney commented on NUTCH-1306: - This is exactly the viewpoint I

Re: store additional information from page at outlinks - topic specific crawl

2012-05-10 Thread Armin Nagel
Hi all, I found a solution to store metadata at outlinks. The metadata is attached to crawldatum, so fetcher could read the information stored there. Solution is, to implement a custom score filter - method distributeScoreToOutlinks. In this method it is possible to do something like this,

[jira] [Updated] (NUTCH-1306) Commit after finished writing to solr index

2012-05-10 Thread Ferdy Galema (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ferdy Galema updated NUTCH-1306: Attachment: NUTCH-1306-v2.patch NUTCH-1306-trunk.patch Agree with trying to make

[jira] [Commented] (NUTCH-1306) Commit after finished writing to solr index

2012-05-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272599#comment-13272599 ] Lewis John McGibbney commented on NUTCH-1306: - I've just stumbled across

[jira] [Updated] (NUTCH-1077) Nutch 2 DbUpdateMapper throws ArrayOutOfBoundsException when running update

2012-05-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1077: Fix Version/s: 2.1 Nutch 2 DbUpdateMapper throws ArrayOutOfBoundsException

[jira] [Updated] (NUTCH-1357) All gora mapreduce functionality should go through StorageUtils

2012-05-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1357: Affects Version/s: nutchgora Fix Version/s: 2.1 All gora mapreduce

[jira] [Commented] (NUTCH-1363) Make parsing in FetcherJob actually work.

2012-05-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272622#comment-13272622 ] Lewis John McGibbney commented on NUTCH-1363: - So just to summarize here... we

[jira] [Commented] (NUTCH-1363) Make parsing in FetcherJob actually work.

2012-05-10 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272633#comment-13272633 ] Markus Jelsma commented on NUTCH-1363: -- I'm fine with not having a -parse switch for

[jira] [Resolved] (NUTCH-1363) Make parsing in FetcherJob actually work.

2012-05-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1363. - Resolution: Not A Problem Yeah, you guys win :0) Closing as this is not an

[jira] [Commented] (NUTCH-1363) Make parsing in FetcherJob actually work.

2012-05-10 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272732#comment-13272732 ] Markus Jelsma commented on NUTCH-1363: -- Good work anyway :) I had the same confusing

[jira] [Commented] (NUTCH-1358) Do not accept bogus arguments

2012-05-10 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273026#comment-13273026 ] Hudson commented on NUTCH-1358: --- Integrated in Nutch-nutchgora #249 (See

[jira] [Commented] (NUTCH-1026) Strip UTF-8 non-character codepoints

2012-05-10 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273027#comment-13273027 ] Hudson commented on NUTCH-1026: --- Integrated in Nutch-nutchgora #249 (See