[jira] Closed: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled
[ http://issues.apache.org/jira/browse/NUTCH-354?page=all ] Andrzej Bialecki closed NUTCH-354. --- Resolution: Fixed Applied to trunk and branch-0.8 - thanks! It would be good to have a specific junit test case for this. > MapWritable, nextEntry is not reset when Entries are recycled > -- > > Key: NUTCH-354 > URL: http://issues.apache.org/jira/browse/NUTCH-354 > Project: Nutch > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Stefan Groschupf >Priority: Blocker > Fix For: 0.8.1, 0.9.0 > > Attachments: resetNextEntryInMapWritableV1.patch > > > MapWritables recycle entries from it internal linked-List for performance > reasons. The nextEntry of a entry is not reseted in case a recyclable entry > is found. This can cause wrong data in a MapWritable. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled
[ http://issues.apache.org/jira/browse/NUTCH-354?page=all ] Stefan Groschupf updated NUTCH-354: --- Attachment: resetNextEntryInMapWritableV1.patch Resets the next Entry of a recycled entry. > MapWritable, nextEntry is not reset when Entries are recycled > -- > > Key: NUTCH-354 > URL: http://issues.apache.org/jira/browse/NUTCH-354 > Project: Nutch > Issue Type: Bug >Affects Versions: 0.8 >Reporter: Stefan Groschupf >Priority: Blocker > Fix For: 0.9.0, 0.8.1 > > Attachments: resetNextEntryInMapWritableV1.patch > > > MapWritables recycle entries from it internal linked-List for performance > reasons. The nextEntry of a entry is not reseted in case a recyclable entry > is found. This can cause wrong data in a MapWritable. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled
MapWritable, nextEntry is not reset when Entries are recycled --- Key: NUTCH-354 URL: http://issues.apache.org/jira/browse/NUTCH-354 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8.1, 0.9.0 MapWritables recycle entries from it internal linked-List for performance reasons. The nextEntry of a entry is not reseted in case a recyclable entry is found. This can cause wrong data in a MapWritable. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
show new data in search result page
Hi there, I wonder if nutch has flexibility to show more parsed information. In my case, I will carry extra information in crawlDatum, such as, company name, to parsed segment. Then, I wish this information could be carried to Lucene index and then show in search result page. For example, I saw Summarizer is a factory that showing segement text and nutch call its' real class in lucene. Does that mean I have to add/modify code in lucene instead of nutch? thanks your suggestions, Michael,
Re: Thoughts on Parser design and dependencies
Jukka Zitting wrote: Hi, On 8/19/06, Sami Siren <[EMAIL PROTECTED]> wrote: So far nutch has been build to deal mainly with text type documents. There's however need also to deal with non textual object eg. images, movies, sound which will provide content only in form of metadata (ok, perhaps some text also about the context of object if applicable), so the metadata names we have today are only a subset of what might be. I really would not want to restrict the metadata the interface can carry to a fixed set. But if it's an open Map, how do you index and search using that, i.e. what is the mapping between the Map keys used by a parser component and the field names in the resulting Lucene index? How do we enforce that an MPEG parser uses the same Map keys as a JPEG parser when encountering metadata with the same semantics? I'm not opposed to using a Map for truly variable metadata, like HTML tags with unknown names, but if we want common handling for example for Dublin Core metadata, it would be better to enforce that on the interface level. Well, Nutch already does this in a way, but it's a "soft" endorsement rather than a hard enforcement .. ;) We define keys for all common metadata sets (DC, Office, HttpHeaders), and plugin writers are supposed to use them, unless they can't find any metadata key with matching semantics. Then, other indexing plugins expect certain metadata to be available under these keys, and create appropriate Lucene fields, again using predefined field names. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Thoughts on Parser design and dependencies
Hi, On 8/19/06, Sami Siren <[EMAIL PROTECTED]> wrote: So far nutch has been build to deal mainly with text type documents. There's however need also to deal with non textual object eg. images, movies, sound which will provide content only in form of metadata (ok, perhaps some text also about the context of object if applicable), so the metadata names we have today are only a subset of what might be. I really would not want to restrict the metadata the interface can carry to a fixed set. But if it's an open Map, how do you index and search using that, i.e. what is the mapping between the Map keys used by a parser component and the field names in the resulting Lucene index? How do we enforce that an MPEG parser uses the same Map keys as a JPEG parser when encountering metadata with the same semantics? I'm not opposed to using a Map for truly variable metadata, like HTML tags with unknown names, but if we want common handling for example for Dublin Core metadata, it would be better to enforce that on the interface level. BR, Jukka Zitting -- Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED] Software craftsmanship, JCR consulting, and Java development
Re: Thoughts on Parser design and dependencies
Jukka Zitting wrote: Hi, On 8/18/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: A very important aspect of the Parser interface (or actually, the Parse and Content classes) is that they each may contain arbitrary metadata. This is required for discovering and passing around both the original metadata (such as protocol headers, document properties, etc), and other secondary content (such as data from external sources, or derived metadata). Is there a list of all the different metadata items that get passed in or out of the parser components? My hunch is that the list of items is relatively short and that even though different parsers might input or output different metadata, it still might make sense to come up with a general content model that serves the needs of everyone. > Simply returning a String doesn't cut it. Returning a java.util.Map may be an option, if you use standard Metadata constants as keys - still, Nutch would have to repackage this anyway into a Writable. And we would lose a nice property of the current Metadata class, which is the ability to tolerate minor syntax variations and to store multiple values per key. The problem I see with a Map or a similar keyed solution is that you only get to specify the metadata contract as documentated (if ever) keys instead of as a compile-time interface. Using a Map is fine if the set of managed information truly varies at runtime, but not when the set is fixed or at least well bounded. So far nutch has been build to deal mainly with text type documents. There's however need also to deal with non textual object eg. images, movies, sound which will provide content only in form of metadata (ok, perhaps some text also about the context of object if applicable), so the metadata names we have today are only a subset of what might be. I really would not want to restrict the metadata the interface can carry to a fixed set. -- Sami Siren
Re: Thoughts on Parser design and dependencies
Hi, On 8/18/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: A very important aspect of the Parser interface (or actually, the Parse and Content classes) is that they each may contain arbitrary metadata. This is required for discovering and passing around both the original metadata (such as protocol headers, document properties, etc), and other secondary content (such as data from external sources, or derived metadata). Is there a list of all the different metadata items that get passed in or out of the parser components? My hunch is that the list of items is relatively short and that even though different parsers might input or output different metadata, it still might make sense to come up with a general content model that serves the needs of everyone. Simply returning a String doesn't cut it. Returning a java.util.Map may be an option, if you use standard Metadata constants as keys - still, Nutch would have to repackage this anyway into a Writable. And we would lose a nice property of the current Metadata class, which is the ability to tolerate minor syntax variations and to store multiple values per key. The problem I see with a Map or a similar keyed solution is that you only get to specify the metadata contract as documentated (if ever) keys instead of as a compile-time interface. Using a Map is fine if the set of managed information truly varies at runtime, but not when the set is fixed or at least well bounded. Another concern with both the Parce class in Nutch and my TextExtractor interface is that the body content is returned as a single text stream (a String and a Reader respectively). This doesn't allow the parser to pass along extra information like the emphasis of certain parts (think of headings or links in html) or the language of the text (e.g. xml:lang). I'm not too familiar with Lucene to know if it could use such information, so this might be a YAGNI, but inversion of control with a Builder interface would be a pretty powerful solution for passing such information. BR, Jukka Zitting -- Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED] Software craftsmanship, JCR consulting, and Java development