Re: Thoughts on Parser design and dependencies
Jukka Zitting wrote: Hi, On 8/19/06, Sami Siren [EMAIL PROTECTED] wrote: So far nutch has been build to deal mainly with text type documents. There's however need also to deal with non textual object eg. images, movies, sound which will provide content only in form of metadata (ok, perhaps some text also about the context of object if applicable), so the metadata names we have today are only a subset of what might be. I really would not want to restrict the metadata the interface can carry to a fixed set. But if it's an open Map, how do you index and search using that, i.e. what is the mapping between the Map keys used by a parser component and the field names in the resulting Lucene index? How do we enforce that an MPEG parser uses the same Map keys as a JPEG parser when encountering metadata with the same semantics? I'm not opposed to using a Map for truly variable metadata, like HTML meta/ tags with unknown names, but if we want common handling for example for Dublin Core metadata, it would be better to enforce that on the interface level. Well, Nutch already does this in a way, but it's a soft endorsement rather than a hard enforcement .. ;) We define keys for all common metadata sets (DC, Office, HttpHeaders), and plugin writers are supposed to use them, unless they can't find any metadata key with matching semantics. Then, other indexing plugins expect certain metadata to be available under these keys, and create appropriate Lucene fields, again using predefined field names. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
show new data in search result page
Hi there, I wonder if nutch has flexibility to show more parsed information. In my case, I will carry extra information in crawlDatum, such as, company name, to parsed segment. Then, I wish this information could be carried to Lucene index and then show in search result page. For example, I saw Summarizer is a factory that showing segement text and nutch call its' real class in lucene. Does that mean I have to add/modify code in lucene instead of nutch? thanks your suggestions, Michael,
[jira] Created: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled
MapWritable, nextEntry is not reset when Entries are recycled --- Key: NUTCH-354 URL: http://issues.apache.org/jira/browse/NUTCH-354 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8.1, 0.9.0 MapWritables recycle entries from it internal linked-List for performance reasons. The nextEntry of a entry is not reseted in case a recyclable entry is found. This can cause wrong data in a MapWritable. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled
[ http://issues.apache.org/jira/browse/NUTCH-354?page=all ] Stefan Groschupf updated NUTCH-354: --- Attachment: resetNextEntryInMapWritableV1.patch Resets the next Entry of a recycled entry. MapWritable, nextEntry is not reset when Entries are recycled -- Key: NUTCH-354 URL: http://issues.apache.org/jira/browse/NUTCH-354 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.9.0, 0.8.1 Attachments: resetNextEntryInMapWritableV1.patch MapWritables recycle entries from it internal linked-List for performance reasons. The nextEntry of a entry is not reseted in case a recyclable entry is found. This can cause wrong data in a MapWritable. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled
[ http://issues.apache.org/jira/browse/NUTCH-354?page=all ] Andrzej Bialecki closed NUTCH-354. --- Resolution: Fixed Applied to trunk and branch-0.8 - thanks! It would be good to have a specific junit test case for this. MapWritable, nextEntry is not reset when Entries are recycled -- Key: NUTCH-354 URL: http://issues.apache.org/jira/browse/NUTCH-354 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8.1, 0.9.0 Attachments: resetNextEntryInMapWritableV1.patch MapWritables recycle entries from it internal linked-List for performance reasons. The nextEntry of a entry is not reseted in case a recyclable entry is found. This can cause wrong data in a MapWritable. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira