[jira] Closed: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Andrzej Bialecki (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-354?page=all ]

Andrzej Bialecki  closed NUTCH-354.
---

Resolution: Fixed

Applied to trunk and branch-0.8 - thanks!

It would be good to have a specific junit test case for this.

> MapWritable,  nextEntry is not reset when Entries are recycled
> --
>
> Key: NUTCH-354
> URL: http://issues.apache.org/jira/browse/NUTCH-354
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Stefan Groschupf
>Priority: Blocker
> Fix For: 0.8.1, 0.9.0
>
> Attachments: resetNextEntryInMapWritableV1.patch
>
>
> MapWritables recycle entries from it internal linked-List for performance 
> reasons. The nextEntry of a entry is not reseted in case a recyclable entry 
> is found. This can cause wrong data in a MapWritable. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-354?page=all ]

Stefan Groschupf updated NUTCH-354:
---

Attachment: resetNextEntryInMapWritableV1.patch

Resets the next Entry of a recycled entry.

> MapWritable,  nextEntry is not reset when Entries are recycled
> --
>
> Key: NUTCH-354
> URL: http://issues.apache.org/jira/browse/NUTCH-354
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Stefan Groschupf
>Priority: Blocker
> Fix For: 0.9.0, 0.8.1
>
> Attachments: resetNextEntryInMapWritableV1.patch
>
>
> MapWritables recycle entries from it internal linked-List for performance 
> reasons. The nextEntry of a entry is not reseted in case a recyclable entry 
> is found. This can cause wrong data in a MapWritable. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)
MapWritable,  nextEntry is not reset when Entries are recycled 
---

 Key: NUTCH-354
 URL: http://issues.apache.org/jira/browse/NUTCH-354
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8.1, 0.9.0


MapWritables recycle entries from it internal linked-List for performance 
reasons. The nextEntry of a entry is not reseted in case a recyclable entry is 
found. This can cause wrong data in a MapWritable. 


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




show new data in search result page

2006-08-19 Thread Feng Ji

Hi there,

I wonder if nutch has flexibility to show more parsed information.

In my case, I will carry extra information in crawlDatum, such as, company
name, to parsed segment. Then, I wish this information could be carried to
Lucene index and then show in search result page.

For example, I saw Summarizer is a factory that showing segement text and
nutch call its' real class in lucene. Does that mean I have to add/modify
code in lucene instead of nutch?

thanks your suggestions,

Michael,


Re: Thoughts on Parser design and dependencies

2006-08-19 Thread Andrzej Bialecki

Jukka Zitting wrote:

Hi,

On 8/19/06, Sami Siren <[EMAIL PROTECTED]> wrote:

So far nutch has been build to deal mainly with text type documents.
There's however need also to deal with non textual object eg.  images,
movies, sound which will provide content only in form of metadata (ok,
perhaps some text also about the context of object if applicable), so
the metadata names we have today are only a subset of what might be.

I really would not want to restrict the metadata the interface can carry
to a fixed set.


But if it's an open Map, how do you index and search using that, i.e.
what is the mapping between the Map keys used by a parser component
and the field names in the resulting Lucene index? How do we enforce
that an MPEG parser uses the same Map keys as a JPEG parser when
encountering metadata with the same semantics?

I'm not opposed to using a Map for truly variable metadata, like HTML
 tags with unknown names, but if we want common handling for
example for Dublin Core metadata, it would be better to enforce that
on the interface level.


Well, Nutch already does this in a way, but it's a "soft" endorsement 
rather than a hard enforcement .. ;) We define keys for all common 
metadata sets (DC, Office, HttpHeaders), and plugin writers are supposed 
to use them, unless they can't find any metadata key with matching 
semantics.


Then, other indexing plugins expect certain metadata to be available 
under these keys, and create appropriate Lucene fields, again using 
predefined field names.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Thoughts on Parser design and dependencies

2006-08-19 Thread Jukka Zitting

Hi,

On 8/19/06, Sami Siren <[EMAIL PROTECTED]> wrote:

So far nutch has been build to deal mainly with text type documents.
There's however need also to deal with non textual object eg.  images,
movies, sound which will provide content only in form of metadata (ok,
perhaps some text also about the context of object if applicable), so
the metadata names we have today are only a subset of what might be.

I really would not want to restrict the metadata the interface can carry
to a fixed set.


But if it's an open Map, how do you index and search using that, i.e.
what is the mapping between the Map keys used by a parser component
and the field names in the resulting Lucene index? How do we enforce
that an MPEG parser uses the same Map keys as a JPEG parser when
encountering metadata with the same semantics?

I'm not opposed to using a Map for truly variable metadata, like HTML
 tags with unknown names, but if we want common handling for
example for Dublin Core metadata, it would be better to enforce that
on the interface level.

BR,

Jukka Zitting

--
Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED]
Software craftsmanship, JCR consulting, and Java development


Re: Thoughts on Parser design and dependencies

2006-08-19 Thread Sami Siren

Jukka Zitting wrote:

Hi,

On 8/18/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

A very important aspect of the Parser interface (or actually, the Parse
and Content classes) is that they each may contain arbitrary metadata.
This is required for discovering and passing around both the original
metadata (such as protocol headers, document properties, etc), and other
secondary content (such as data from external sources, or derived 
metadata).


Is there a list of all the different metadata items that get passed in
or out of the parser components? My hunch is that the list of items is
relatively short and that even though different parsers might input or
output different metadata, it still might make sense to come up with a
general content model that serves the needs of everyone.

>

Simply returning a String doesn't cut it. Returning a java.util.Map may
be an option, if you use standard Metadata constants as keys - still,
Nutch would have to repackage this anyway into a Writable. And we would
lose a nice property of the current Metadata class, which is the ability
to tolerate minor syntax variations and to store multiple values per key.


The problem I see with a Map or a similar keyed solution is that you
only get to specify the metadata contract as documentated (if ever)
keys instead of as a compile-time interface. Using a Map is fine if
the set of managed information truly varies at runtime, but not when
the set is fixed or at least well bounded.


So far nutch has been build to deal mainly with text type documents. 
There's however need also to deal with non textual object eg.  images, 
movies, sound which will provide content only in form of metadata (ok, 
perhaps some text also about the context of object if applicable), so 
the metadata names we have today are only a subset of what might be.


I really would not want to restrict the metadata the interface can carry 
to a fixed set.


--
 Sami Siren



Re: Thoughts on Parser design and dependencies

2006-08-19 Thread Jukka Zitting

Hi,

On 8/18/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

A very important aspect of the Parser interface (or actually, the Parse
and Content classes) is that they each may contain arbitrary metadata.
This is required for discovering and passing around both the original
metadata (such as protocol headers, document properties, etc), and other
secondary content (such as data from external sources, or derived metadata).


Is there a list of all the different metadata items that get passed in
or out of the parser components? My hunch is that the list of items is
relatively short and that even though different parsers might input or
output different metadata, it still might make sense to come up with a
general content model that serves the needs of everyone.


Simply returning a String doesn't cut it. Returning a java.util.Map may
be an option, if you use standard Metadata constants as keys - still,
Nutch would have to repackage this anyway into a Writable. And we would
lose a nice property of the current Metadata class, which is the ability
to tolerate minor syntax variations and to store multiple values per key.


The problem I see with a Map or a similar keyed solution is that you
only get to specify the metadata contract as documentated (if ever)
keys instead of as a compile-time interface. Using a Map is fine if
the set of managed information truly varies at runtime, but not when
the set is fixed or at least well bounded.

Another concern with both the Parce class in Nutch and my
TextExtractor interface is that the body content is returned as a
single text stream (a String and a Reader respectively). This doesn't
allow the parser to pass along extra information like the emphasis of
certain parts (think of headings or links in html) or the language of
the text (e.g. xml:lang). I'm not too familiar with Lucene to know if
it could use such information, so this might be a YAGNI, but inversion
of control with a Builder interface would be a pretty powerful
solution for passing such information.

BR,

Jukka Zitting

--
Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED]
Software craftsmanship, JCR consulting, and Java development