Re: Thoughts on Parser design and dependencies

2006-08-19 Thread Andrzej Bialecki

Jukka Zitting wrote:

Hi,

On 8/19/06, Sami Siren [EMAIL PROTECTED] wrote:

So far nutch has been build to deal mainly with text type documents.
There's however need also to deal with non textual object eg.  images,
movies, sound which will provide content only in form of metadata (ok,
perhaps some text also about the context of object if applicable), so
the metadata names we have today are only a subset of what might be.

I really would not want to restrict the metadata the interface can carry
to a fixed set.


But if it's an open Map, how do you index and search using that, i.e.
what is the mapping between the Map keys used by a parser component
and the field names in the resulting Lucene index? How do we enforce
that an MPEG parser uses the same Map keys as a JPEG parser when
encountering metadata with the same semantics?

I'm not opposed to using a Map for truly variable metadata, like HTML
meta/ tags with unknown names, but if we want common handling for
example for Dublin Core metadata, it would be better to enforce that
on the interface level.


Well, Nutch already does this in a way, but it's a soft endorsement 
rather than a hard enforcement .. ;) We define keys for all common 
metadata sets (DC, Office, HttpHeaders), and plugin writers are supposed 
to use them, unless they can't find any metadata key with matching 
semantics.


Then, other indexing plugins expect certain metadata to be available 
under these keys, and create appropriate Lucene fields, again using 
predefined field names.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




show new data in search result page

2006-08-19 Thread Feng Ji

Hi there,

I wonder if nutch has flexibility to show more parsed information.

In my case, I will carry extra information in crawlDatum, such as, company
name, to parsed segment. Then, I wish this information could be carried to
Lucene index and then show in search result page.

For example, I saw Summarizer is a factory that showing segement text and
nutch call its' real class in lucene. Does that mean I have to add/modify
code in lucene instead of nutch?

thanks your suggestions,

Michael,


[jira] Created: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)
MapWritable,  nextEntry is not reset when Entries are recycled 
---

 Key: NUTCH-354
 URL: http://issues.apache.org/jira/browse/NUTCH-354
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8.1, 0.9.0


MapWritables recycle entries from it internal linked-List for performance 
reasons. The nextEntry of a entry is not reseted in case a recyclable entry is 
found. This can cause wrong data in a MapWritable. 


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-354?page=all ]

Stefan Groschupf updated NUTCH-354:
---

Attachment: resetNextEntryInMapWritableV1.patch

Resets the next Entry of a recycled entry.

 MapWritable,  nextEntry is not reset when Entries are recycled
 --

 Key: NUTCH-354
 URL: http://issues.apache.org/jira/browse/NUTCH-354
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.9.0, 0.8.1

 Attachments: resetNextEntryInMapWritableV1.patch


 MapWritables recycle entries from it internal linked-List for performance 
 reasons. The nextEntry of a entry is not reseted in case a recyclable entry 
 is found. This can cause wrong data in a MapWritable. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Closed: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Andrzej Bialecki (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-354?page=all ]

Andrzej Bialecki  closed NUTCH-354.
---

Resolution: Fixed

Applied to trunk and branch-0.8 - thanks!

It would be good to have a specific junit test case for this.

 MapWritable,  nextEntry is not reset when Entries are recycled
 --

 Key: NUTCH-354
 URL: http://issues.apache.org/jira/browse/NUTCH-354
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8.1, 0.9.0

 Attachments: resetNextEntryInMapWritableV1.patch


 MapWritables recycle entries from it internal linked-List for performance 
 reasons. The nextEntry of a entry is not reseted in case a recyclable entry 
 is found. This can cause wrong data in a MapWritable. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira