ParserFactory does not work as expected
---------------------------------------

         Key: NUTCH-133
         URL: http://issues.apache.org/jira/browse/NUTCH-133
     Project: Nutch
        Type: Bug
    Versions: 0.8-dev, 0.7.1, 0.7.2-dev    
    Reporter: Stefan Groschupf
    Priority: Blocker


Marcel Schnippe detect a set of problems until working with different content 
and parser types, we worked together to identify the problem source.
>From our point of view this described problems could be the source for many 
>other problems daily described in the mailing lists.
Find a conclusion of the problems below.

Problem:
Some servers returns mixed case but correct header keys like 'Content-type' or 
'content-Length'  in the http response header.
That's why for example a get("Content-Type") fails and a page is detected as 
zip using the magic content type detection mechanism. 
Also we note that this a common reason why pdf parsing fails since 
Content-Length does return the correct value. 

Sample:
returns "text/HTML" or "application/PDF" or Content-length
or this url:
http://www.lanka.info/dictionary/EnglishToSinhala.jsp

Solution:
First just write only lower case keys into the properties and later convert all 
keys that are used to query the metadata to lower case as well.
e.g.:
HttpResponse.java, line 353:
use lower cases here and for all keys used to query header properties (also 
content-length) change:  String key = line.substring(0, colonIndex); to  String 
key = line.substring(0, colonIndex) .toLowerCase();


Problem:
MimeTypes based discovery (magic and url based) is only done in case the 
content type was not delivered by the web server, this happens not that often, 
mostly this was a problem with mixed case keys in the header.

see:
 public Content toContent() {
    String contentType = getHeader("Content-Type");
    if (contentType == null) {
      MimeType type = null;
      if (MAGIC) {
        type = MIME.getMimeType(orig, content);
      } else {
        type = MIME.getMimeType(orig);
      }
      if (type != null) {
          contentType = type.getName();
      } else {
          contentType = "";
      }
    }
    return new Content(orig, base, content, contentType, headers);
  }
Solution:
Use the content-type information as it is from the webserver and move the 
content type discovering from Protocol plugins to the Component where the 
parsing is done - to the ParseFactory.
Than just create a list of parsers for the content type returned by the server 
and the custom detected content type. In the end we can iterate over all parser 
until we got a successfully parsed status.



Problem:
Content will be parsed also if the protocol reports a exception and has a non 
successful status, in such a case the content is new byte[0] in any case.

Solution:
Fetcher.java, line 243.
Change:   if (!Fetcher.this.parsing ) { .. to 
 if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
       // TODO we may should not write out here emthy parse text and parse 
date, i suggest give outputpage a parameter parsed true / false
          outputPage(new FetcherOutput(fle, hash, protocolStatus),
                content, new ParseText(""),
                new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new 
Outlink[0], new Properties()));
        return null;
      }


Problem:
Actually the configuration of parser is done based on plugin id's, but one 
plugin can have several extentions, so  normally a plugin can provide several 
parser, but this is no limited just wrong values are used in the configuration 
process. 

Solution:
Change plugin id to  extension id in the parser configuration file and also 
change this code in the parser factory to use extension id's everywhere.


Problem:
there is not a clear differentiation between content type and mime type. 
I'm notice that some plugins call metaData.get("Content-Type) or 
content.getContentType();
Actually in theory this can return different values, since the content type 
could be detected by the MimesTypes util and is not the same as delivered in 
the http response header.
As mentioned actually content type is only detected by the MimeTypes util in 
case the header does not contains any content type informations or had problems 
with mixed case keys.

Solution:
Take the content type property out of the meta data and clearly restrict the 
access of this meta data into the own getter method.



Problem:
Most protocol plugins  checking if content type is null only in this case the 
MimeTypes util is used. Since my patch move the mime type detection to the 
parser factory - where from my point of view - is the right place, it is now 
unneccary code we can remove from the protocol plugins. I never found a case 
where no content type was returned just mixed case keys was used. 

Solution. 
Remove this detection code, since it is now in the parser factory.
I didn't change this since more code I  change, I guess there is a  less chance 
to get the patch into the sources, I suggest we open a low priority issue and 
once we change the plugins we can remove it.


Problem:
This is not a problem, but a 'code smells' (Martin Fowler) There are empty test 
methods in TestMimeType
  /** Test of <code>getExtensions</code> method. */
    public void testGetExtensions() {
    }
Solution:
Implement these tests or remove the test methods.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to