RE: How to get score in search.jsp
I have found solution. I've add variable score into Hit -Original Message- From: Anton Potekhin [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 14, 2007 10:48 AM To: nutch-dev@lucene.apache.org Subject: How to get score in search.jsp Importance: High Hi Nutch Gurus! I have a small problem. I need to add some changes into search.jsp. I need to get first 50 results and to sort them in different way. I will change the score of each result with formula "new_score = nutch_score + domain_score_from_my_db" to sort. But i don't understand how to get nutch_score in search.jsp Now I use a makeshift. I get the nutch_score using getValue() method of org.apache.lucene.search.Explanation class. But i think it is a very slow way. Can anybody help me to find a solution for this problem? P.S. I hope that I described my problem clearly. Thanks in advance. Sorry for the duplicated mail. I think I had some problems with my mail account
Re: Injector checking for other than STATUS_INJECTED
Gal Nitzan wrote: Hi Andrzej, Does it mean that when you inject an existing (in crawldb) a URL it changes its status to STATUS_DB_UNFETCHED? With the current version of Injector - it won't. With previous versions - it might, depending on the order of values received in reduce(). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Injector checking for other than STATUS_INJECTED
Hi Andrzej, Does it mean that when you inject an existing (in crawldb) a URL it changes its status to STATUS_DB_UNFETCHED? Gal -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Thursday, February 15, 2007 8:47 AM To: nutch-dev@lucene.apache.org Subject: Re: Injector checking for other than STATUS_INJECTED [EMAIL PROTECTED] wrote: > Hi All, > > I think I am missing something. In the Injector reduce code we have the > following. > > > while (values.hasNext()) { > CrawlDatum val = (CrawlDatum)values.next(); > if (val.getStatus() == CrawlDatum.STATUS_INJECTED) { > injected = val; > injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED); > } else { > old = val; > } > } > > CrawlDatum res = null; > if (old != null) res = old; // don't overwrite existing value > else res = injected; > > > Basically if it is not just injected then don't overwrite. But I am not > seeing where the input could be such that the CrawlDatum wasn't just > injected and could have previous values. Is this just in case someone > uses the Injector as a Reducer and not a Mapper or am I missing how this > condition can occur. > This handles an important case, when you inject URLs that already exist in the DB - then you have both the old value and the newly created value under the same key. In previous versions of Injector CrawlDatum-s for such URLs could be overwritten with new values, and you could lose valuable metadata accumulated in old values. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Injector checking for other than STATUS_INJECTED
[EMAIL PROTECTED] wrote: Hi All, I think I am missing something. In the Injector reduce code we have the following. while (values.hasNext()) { CrawlDatum val = (CrawlDatum)values.next(); if (val.getStatus() == CrawlDatum.STATUS_INJECTED) { injected = val; injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED); } else { old = val; } } CrawlDatum res = null; if (old != null) res = old; // don't overwrite existing value else res = injected; Basically if it is not just injected then don't overwrite. But I am not seeing where the input could be such that the CrawlDatum wasn't just injected and could have previous values. Is this just in case someone uses the Injector as a Reducer and not a Mapper or am I missing how this condition can occur. This handles an important case, when you inject URLs that already exist in the DB - then you have both the old value and the newly created value under the same key. In previous versions of Injector CrawlDatum-s for such URLs could be overwritten with new values, and you could lose valuable metadata accumulated in old values. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Injector checking for other than STATUS_INJECTED
Hi All, I think I am missing something. In the Injector reduce code we have the following. while (values.hasNext()) { CrawlDatum val = (CrawlDatum)values.next(); if (val.getStatus() == CrawlDatum.STATUS_INJECTED) { injected = val; injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED); } else { old = val; } } CrawlDatum res = null; if (old != null) res = old; // don't overwrite existing value else res = injected; Basically if it is not just injected then don't overwrite. But I am not seeing where the input could be such that the CrawlDatum wasn't just injected and could have previous values. Is this just in case someone uses the Injector as a Reducer and not a Mapper or am I missing how this condition can occur. Dennis Kubes
[jira] Commented: (NUTCH-247) robot parser to restrict.
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473295 ] Dennis Kubes commented on NUTCH-247: I think the idea here is to NOT allow people to run fetchers for which they haven't configured an agent name and email, etc. There may be a better way to do this then simply logging severe and then stopping. I think it would be best to provide some sort of feedback mechanism to the user either via the command line or an explicit exception that tells the user to configure the agent name and email in their nutch-*.xml file. If this is the direction that we want to go, I can come up with a patch for this. > robot parser to restrict. > - > > Key: NUTCH-247 > URL: https://issues.apache.org/jira/browse/NUTCH-247 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 0.8 >Reporter: Stefan Groschupf >Priority: Minor > Fix For: 0.9.0 > > > If the agent name and the robots agents are not proper configure the Robot > rule parser uses LOG.severe to log the problem but solve it also. > Later on the fetcher thread checks for severe errors and stop if there is one. > RobotRulesParser: > if (agents.size() == 0) { > agents.add(agentName); > LOG.severe("No agents listed in 'http.robots.agents' property!"); > } else if (!((String)agents.get(0)).equalsIgnoreCase(agentName)) { > agents.add(0, agentName); > LOG.severe("Agent we advertise (" + agentName > + ") not listed first in 'http.robots.agents' property!"); > } > Fetcher.FetcherThread: > if (LogFormatter.hasLoggedSevere()) // something bad happened > break; > I suggest to use warn or something similar instead of severe to log this > problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue
It may fix the problem it may not. There have been many changes in hadoop since 0.4. I think they are now on .11.x. So if you are upgrading existing dfs implementations that currently have content that is something to take into consideration. That being said the changes in hadoop from .4 to present may very well have fixed the error you are seeing and to use the most recent version of hadoop you will need to use the NUTCH-437 patch. Looking at your output below though my first thought would be that this is something in the PDF parser and not hadoop causing the error. Nutch uses pdfbox software to parse PDF files so you may want to take the specific file and see if it parses correctly outside of nutch using pdfbox. Dennis Kubes Armel T. Nene wrote: Dennis I was wondering if this patch could fix my problem which is, if not the same, very similar to this one. I am using Nutch 0.8.2-dev, I have made checkout awhile ago from SVN but never updated again. I was able to crawl 1 xml files before with no error whatsoever. This is the following errors that I get when I'm fetching: INFO parser.custom: Custom-parse: Parsing content file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf 07/02/12 22:09:16 INFO fetcher.Fetcher: fetch of file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf failed with: java.lang.NullPointerException 07/02/12 22:09:17 INFO mapred.LocalJobRunner: 0 pages, 0 errors, 0.0 pages/s, 0 kb/s, 07/02/12 22:09:17 FATAL fetcher.Fetcher: java.lang.NullPointerException 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:198) 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:189) 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91) 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:314) 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:232) 07/02/12 22:09:17 FATAL fetcher.Fetcher: fetcher caught:java.lang.NullPointerException One of the problem is that my hadoop version says the following: hadoop-0.4.0-patched. Now I don't know if it means that I am running the 0.4.0 version but it seems a little bit confusing. Once you can clarify that for me, then I will be able to apply the patch to my version. Best Regards, Armel -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: 13 February 2007 21:09 To: nutch-dev@lucene.apache.org Subject: Re: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue Actually I take it back. I don't think it is the same problem but I do think it is the right solution. Dennis Kubes Dennis Kubes wrote: This has to do with HADOOP-964. Replace the jar files in your Nutch versions with the most recent versions from Hadoop. You will also need to apply NUTCH-437 patch to get Nutch to work with the most recent changes to the Hadoop codebase. Dennis Kubes Gal Nitzan wrote: Hi, Does anybody uses Nutch trunk? I am running nutch 0.9 and unable to fetch. after 50-60K urls I get NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue every time. I was wandering if anyone have a work around or maybe something is wrong with my setup. I have opened a new issue in jira http://issues.apache.org/jira/browse/hadoop-1008 for this. Any clue? Gal
[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v7.patch > allow parsers to return multiple Parse object, this will speed up the rss > parser > > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann >Priority: Minor > Fix For: 0.9.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, > parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff > > > allow Parser#parse to return a Map. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473184 ] Doğacan Güney commented on NUTCH-443: - Andrzej: Why does fetcher need to synchronize? Why does the order fetcher outputs pairs matters? Sami: > I opened an issue for this NUTCH-434 and I am now recommending that the patch > in this issue > doesn't try to take the world in one piece :) Right. I just realized just how much this patch changes and most of them are not necessary for the proposed API change. So I am going to post a version that uses ObjectWritable in Fetcher, doesn't remove FetcherOutputFormat and only changes parse-rss so that it works with the new API (sorry about that Renaud, but parse-rss can be updated after this patch) > allow parsers to return multiple Parse object, this will speed up the rss > parser > > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann >Priority: Minor > Fix For: 0.9.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, > parse-map-core-untested.patch, parsers.diff > > > allow Parser#parse to return a Map. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473148 ] Sami Siren commented on NUTCH-443: -- >> Didn't know this, will change this too. (Why is Nutch not using this class >> in Indexer?) >Inertia, and lack of committer time ... ;) IIRC you actually cannot use GenericWritable because it requires wrapped objects to be Writables, Lucene objects obviously aren't. But you are able to imitate it and make similar object capable of storing Objects (as those writables are not persisted in indexer). I opened an issue for this NUTCH-434 and I am now recommending that the patch in this issue doesn't try to take the world in one piece :) > allow parsers to return multiple Parse object, this will speed up the rss > parser > > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann >Priority: Minor > Fix For: 0.9.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, > parse-map-core-untested.patch, parsers.diff > > > allow Parser#parse to return a Map. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473147 ] Doğacan Güney commented on NUTCH-443: - > Hmm, actually this is an important question. I don't think FetcherOutput is > persisted anywhere, it's just an aggregate class to > keep things together before they hit the disk. I propose to leave a comment > in MapWritable like this "// code -123 was > reserved for FetcherOutput - no longer in use". As for the class itself - > again, since it's not persisted we don't have to keep it > around, just remove it. I implemented this approach in one of the earlier patches. The problem is that, the code in MapWritable does this: addIdEntry((byte) (-128 + CLASS_ID_MAP.size() + ++fIdCount), // ... Now, I don't claim to understand the code perfectly but because of the "-128 + CLASS_ID_MAP.size()" part I think CLASS_ID_MAP must have consecutive values always, so not having -123 breaks it. IIRC, removing that line and running TestMapWritable fails. > Sections in Fetcher.FetcherThread.output() and similar in Fetcher2 that > output the data need to be synchronized now - > output.collect() is no longer a single atomic operation. Perhaps it's better > to leave FetcherOutput after all? This causes key ordering problems. See my admittedly-could-have-been-clearer 2nd comment. Anyway, I am assumming that you are OK with removing ParseUtil.getFirstParseEntry and just using Map.get? > allow parsers to return multiple Parse object, this will speed up the rss > parser > > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann >Priority: Minor > Fix For: 0.9.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, > parse-map-core-untested.patch, parsers.diff > > > allow Parser#parse to return a Map. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473141 ] Andrzej Bialecki commented on NUTCH-443: - > Didn't know this, will change this too. (Why is Nutch not using this class in > Indexer?) Inertia, and lack of committer time ... ;) > Since this patch removes the FetcherOutput class, what to put there instead > of it? Hmm, actually this is an important question. I don't think FetcherOutput is persisted anywhere, it's just an aggregate class to keep things together before they hit the disk. I propose to leave a comment in MapWritable like this "// code -123 was reserved for FetcherOutput - no longer in use". As for the class itself - again, since it's not persisted we don't have to keep it around, just remove it. Sections in Fetcher.FetcherThread.output() and similar in Fetcher2 that output the data need to be synchronized now - output.collect() is no longer a single atomic operation. Perhaps it's better to leave FetcherOutput after all? > allow parsers to return multiple Parse object, this will speed up the rss > parser > > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann >Priority: Minor > Fix For: 0.9.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, > parse-map-core-untested.patch, parsers.diff > > > allow Parser#parse to return a Map. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473129 ] Doğacan Güney commented on NUTCH-443: - Andrzej: Thanks for taking the time to review this. > The contract for ParseUtil.getFirstParseEntry() seems unclear - since in most > cases this is a HashMap, there is no predictable > way to get the first entry > added to the map ... I propose also that we should use a specialized class > instead of > general-purpose Map; and then we can record in that class which entry was the > first. ParseUtil.getFirstParseEntry is only a convenience method used by plugins to get the first(and only) entry in a map when it knows that it will create a one-entry parse map(with original url as the key) and it is mostly used in a plugin's main method to get the parse and print it. It is not used in any core part of Nutch. Anyway, it is very incorrectly named. What we meant was ParseUtil.getOnlyParseEntry. Hmm, that doesn't make any sense either :D Instead of creating a specialized class, how about removing the method and just using parseMap.get(key)? Most plugins will use it like parseMap.get(content.getUrl()). > Also, the naming of some methods > seems a bit awkward - why should we insist that we createSingleEntryMap while > we create an ordinary Map, and we don't use > this special-case knowledge > later? I suggest to simply name it createParseMap. You are right, I will change this in the next patch. > In recent versions of Hadoop there is a GenericWritable class - it replaces > ObjectWritable when classes are known in advance, > and provides a more > compact representation. Didn't know this, will change this too. (Why is Nutch not using this class in Indexer?) > Changes to MapWritable must preserve old code values, at most adding some new > ones - otherwise the new code will get > confused when working with older data. I see your point but I am not sure how to fix this. Since this patch removes the FetcherOutput class, what to put there instead of it? I guess we can just keep FetcherOutput as it is, and update its javadoc to reflect the fact that it is not used anymore. > CrawlDbReducer, TODO item: this should be the time stored under > Nutch.FETCH_TIME_KEY, no? > If I'm not mistaken, ParseUtil doesn't need the import of HashMap, only Map. I will remove the TODO item and fix the imports in the next patch. > allow parsers to return multiple Parse object, this will speed up the rss > parser > > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann >Priority: Minor > Fix For: 0.9.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, > parse-map-core-untested.patch, parsers.diff > > > allow Parser#parse to return a Map. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473114 ] Andrzej Bialecki commented on NUTCH-443: - The contract for ParseUtil.getFirstParseEntry() seems unclear - since in most cases this is a HashMap, there is no predictable way to get the first entry added to the map ... I propose also that we should use a specialized class instead of general-purpose Map; and then we can record in that class which entry was the first. Also, the naming of some methods seems a bit awkward - why should we insist that we createSingleEntryMap while we create an ordinary Map, and we don't use this special-case knowledge later? I suggest to simply name it createParseMap. In recent versions of Hadoop there is a GenericWritable class - it replaces ObjectWritable when classes are known in advance, and provides a more compact representation. Changes to MapWritable must preserve old code values, at most adding some new ones - otherwise the new code will get confused when working with older data. CrawlDbReducer, TODO item: this should be the time stored under Nutch.FETCH_TIME_KEY, no? If I'm not mistaken, ParseUtil doesn't need the import of HashMap, only Map. The new model for returning results from parse plugins allows a much better approach to parsing archives (eg. zip files) containing multiple documents in supported formats - although this should be a separate patch. > allow parsers to return multiple Parse object, this will speed up the rss > parser > > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann >Priority: Minor > Fix For: 0.9.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, > parse-map-core-untested.patch, parsers.diff > > > allow Parser#parse to return a Map. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-437) MapFile in Hadoop Trunk has changed, must update references
[ https://issues.apache.org/jira/browse/NUTCH-437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473064 ] Armel Nene commented on NUTCH-437: -- I was wondering if this patch could fix my problem which is, if not the same, very similar to this one. I am using Nutch 0.8.2-dev, I have made checkout awhile ago from SVN but never updated again. I was able to crawl 1 xml files before with no error whatsoever. This is the following errors that I get when I'm fetching: INFO parser.custom: Custom-parse: Parsing content file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf 07/02/12 22:09:16 INFO fetcher.Fetcher: fetch of file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf failed with: java.lang.NullPointerException 07/02/12 22:09:17 INFO mapred.LocalJobRunner: 0 pages, 0 errors, 0.0 pages/s, 0 kb/s, 07/02/12 22:09:17 FATAL fetcher.Fetcher: java.lang.NullPointerException 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:198) 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:189) 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91) 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:314) 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:232) 07/02/12 22:09:17 FATAL fetcher.Fetcher: fetcher caught:java.lang.NullPointerException One of the problem is that my hadoop version says the following: hadoop-0.4.0-patched. Now I don't know if it means that I am running the 0.4.0 version but it seems a little bit confusing. Once you can clarify that for me, then I will be able to apply the patch to my version. Best Regards, Armel > MapFile in Hadoop Trunk has changed, must update references > --- > > Key: NUTCH-437 > URL: https://issues.apache.org/jira/browse/NUTCH-437 > Project: Nutch > Issue Type: Bug >Affects Versions: 0.8.2, 0.9.0 > Environment: windows xp and java >Reporter: Dennis Kubes > Assigned To: Andrzej Bialecki > Fix For: 0.8.2, 0.9.0 > > Attachments: nutch-hadoop-0.10.2-mapfile.patch > > > The MapFile.Writer signature has changed in hadoop trunk (version 10.x +) to > include a Configuration object. Object in the Nutch codebase that reference > MapFile.Writer will need to be updated. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.