Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Walter Tietze Mon, 17 Sep 2012 10:31:21 -0700

Hi,


I had the same problems and couldn't get around in a proper way
satisfyingly.

I also tried nutch-2.0 with CDH4 and Yarn / MR_v2 and without
MR_v1 and couldn't make it simply work.


But I found a workaround to make nutch 1.5.1 work on CDH4.


Since MR_v2 it is no longer allowed to pack a project as *nutch*.job
altogether and since the former TaskManager is divided into
the ResourceManager and the NodeManager, the NodeManager seems not to
be able to handle the packed nutch-project.

(see also:
http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
)


Something one can do, is to unpack the job in the Nodemanager manually
and to load the classes from within the code into the current
classloader.

I modified the org/apache/nutch/plugin/PluginManifestParser.java
slightly and everything works fine at least for the moment.


I attached the modified file.


Please remark, I don't have experience yet, if CDH4 removes the
application directories and the unpacked files properly.
You should consider to check the directories, if they are still
needed after the crawl succeeded.



Hope this helps, cheers, Walter




Am 17.09.2012 18:31, schrieb Casey McTaggart:
> I would also like to add that I can run the same crawl locally and it's
> successful. So, it's just the distributed mode that's not working. can
> anyone offer any advice? Do you think it might be something with CDH4?
> 
> On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart
> <[email protected]>wrote:
> 
>> Hi everyone,
>>
>> I'm using Hadoop as installed by Cloudera (CDH4)... I think it's version
>> 1.0.1. I can run a local filesystem crawl with Nutch, and it returns what
>> I'd expect. However, I need to take advantage of the mapreduce
>> functionality, since I want to crawl a local filesystem with many GB of
>> files. I'm going to put all of these files on an apache server so they can
>> be crawled. First, though, I want to just crawl a simple website, and I
>> can't make it work.
>>
>> My urls/seed.txt is on hdfs and is this:
>> http://lucene.apache.org
>>
>> I run this command:
>> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job
>> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl
>>
>> Sometimes, it fetches the URL, but does not go beyond depth 1... and when
>> I examine the CrawlDatum that's in
>> /user/hdfs/crawl/crawldb/current/part-00000/data, it has one entry: the
>> seed url as the key, and the value of the CrawlDatum is
>> _pst_=exception(16), lastModified=0: java.lang.NoClassDefFoundError:
>> org/apache/tika/mime/MimeTypeException
>>
>> Okay, so I tried running the command again with -libjars nutch1.5.1.jar,
>> and it fails with an ArrayIndexOutOfBoundsException. I tried running it
>> with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails with:
>>
>> 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records selected for
>> fetching, exiting ...
>> 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to
>> fetch.
>> 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your seed
>> list and URL filters.
>> 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl
>>
>> I tried copying lib/tika-core-1.1.jar to /usr/local/hadoop-1.0.1/lib, and
>> still 0 URLs are fetched.
>>
>> I'm totally at a loss. can someone help?
>>
>> Here's my regex-urlfilter:
>>
>> # skip file: ftp: and mailto: urls
>> -^(file|ftp|mailto):
>> # skip image and other suffixes we can't yet parse
>> # for a more extensive coverage use the urlfilter-suffix plugin
>>
>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ
>> |mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>> # skip URLs containing certain characters as probable queries, etc.
>> -[?*!@=]
>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops
>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>> # accept anything else
>> +.
>>
>>
>> here's my nutch-site.xml:
>>
>> <configuration>
>>   <property>
>>     <name>http.agent.name</name>
>>     <value>nutchtest</value>
>>   </property>
>>   <property>
>>     <name>plugin.folders</name>
>>
>> <value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value>
>>   </property>
>> </configuration>
>>
>>
>> which also does not work if I include this part:
>>
>> <property>
>>     <name>plugin.includes</name>
>>
>> <value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor)
>> |query-(basic|site|url)|response-(json|xml)|addhdfskey</value>
>>   </property>
>>
>>
> 


-- 

--------------------------------
Walter Tietze
Senior Softwareengineer
Research

Neofonie GmbH
Robert-Koch-Platz 4
10115 Berlin

T +49.30 24627 318
F +49.30 24627 120

[email protected]
http://www.neofonie.de

Handelsregister
Berlin-Charlottenburg: HRB 67460

Geschäftsführung:
Thomas Kitlitschko
--------------------------------

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.nutch.plugin;

import java.io.File;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URI;
import java.net.URLDecoder;
import java.util.HashMap;
import java.util.Map;

import java.util.Arrays;
import java.lang.reflect.*;
import java.util.*;
import java.net.*;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;

import org.slf4j.Logger;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.RunJar;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

/**
 * The <code>PluginManifestParser</code> parser just parse the manifest file
 * in all plugin directories.
 * 
 * @author joa23
 */
public class PluginManifestParser {
  private static final String ATTR_NAME = "name";
  private static final String ATTR_CLASS = "class";
  private static final String ATTR_ID = "id";

  public static final Logger LOG = PluginRepository.LOG;

  private static final boolean WINDOWS = System.getProperty("os.name")
      .startsWith("Windows");

  private Configuration conf;

  private PluginRepository pluginRepository;

  public PluginManifestParser(Configuration conf,
      PluginRepository pluginRepository) {
    this.conf = conf;
    this.pluginRepository = pluginRepository;
  }

  /**
   * Returns a list of all found plugin descriptors.
   * 
   * @param pluginFolders
   *          folders to search plugins from
   * @return A {@link Map} of all found {@link PluginDescriptor}s.
   */
  public Map<String, PluginDescriptor> parsePluginFolder(String[] pluginFolders) {
    Map<String, PluginDescriptor> map = new HashMap<String, PluginDescriptor>();

    if (pluginFolders == null) {
      throw new IllegalArgumentException("plugin.folders is not defined");
    }

    for (String name : pluginFolders) {
      File directory = getPluginFolder(name);
      if (directory == null) {
        continue;
      }
      LOG.info("Plugins: looking in: " + directory.getAbsolutePath());
      for (File oneSubFolder : directory.listFiles()) {
        if (oneSubFolder.isDirectory()) {
          String manifestPath = oneSubFolder.getAbsolutePath() + File.separator
              + "plugin.xml";
          try {
            LOG.debug("parsing: " + manifestPath);
            PluginDescriptor p = parseManifestFile(manifestPath);
            map.put(p.getPluginId(), p);
          } catch (MalformedURLException e) {
            LOG.warn(e.toString());
          } catch (SAXException e) {
            LOG.warn(e.toString());
          } catch (IOException e) {
            LOG.warn(e.toString());
          } catch (ParserConfigurationException e) {
            LOG.warn(e.toString());
          }
        }
      }
    }
    return map;
  }

  /** 
   * addJarsToClassPath - loads all given jars into the given classloader.
   * @param classLoader
   * @param jars
   */
  private static void addJarsToClassPath(ClassLoader classLoader, File[] jars) {
	    if (classLoader instanceof URLClassLoader) {
	      try {
	        Method addUrlMethod = URLClassLoader.class.getDeclaredMethod("addURL", new Class[] { URL.class });
	        addUrlMethod.setAccessible(true);
	        if (null != addUrlMethod) {
	          for (File jar : jars) {
	            try {
	            	LOG.info("Adding jar " + jar.toURI() + " to classloader");
	              addUrlMethod.invoke(classLoader, jar.toURI().toURL());
	            } catch (Exception e) {
	              e.printStackTrace();
	            }
	          }
	        }
	      } catch (Exception e) {
	        e.printStackTrace();
	      }
	 
	    }
	  }

  /**
   * Return the named plugin folder. If the name is absolute then it is
   * returned. Otherwise, for relative names, the classpath is scanned.
   */
  public File getPluginFolder(String name) {
    File directory = new File(name);
    if (!directory.isAbsolute()) {
      URL url = PluginManifestParser.class.getClassLoader().getResource(name);
      if (url == null && directory.exists() && directory.isDirectory()
          && directory.listFiles().length > 0) {
        return directory; // relative path that is not in the classpath
      } else if (url == null) {
        LOG.warn("Plugins: directory not found: " + name);
        return null;
        
      }else if("jar".equals(url.getProtocol()) && url.getFile().endsWith("job.jar!/classes/plugins")) {
    	  LOG.info("Path of jar is " + url.getPath() + " with file " + url.getFile());
    	  // directory = new File(url.getPath());
    	  File currentDir = new File(".");
    	  
    	  try {
    		  String jarPath = url.getFile().substring(0, url.getFile().indexOf("!")).substring("file:".length());
    		  LOG.info("Unpacking jar from " + jarPath + " to " + currentDir.getAbsolutePath());
        	  RunJar.unJar(new File(jarPath), currentDir);   
        	  
        	  File jarDir = new File(currentDir.getAbsolutePath() + "/lib");
        	  if(jarDir.exists()) {
        		  LOG.info("Loading jar files from lib");
        		  File[] jars = jarDir.listFiles();
        		  ClassLoader cl = PluginManifestParser.class.getClassLoader();
        		  addJarsToClassPath(cl, jars);        		  
         	  }
    	  }catch (IOException ioe) {
    		  LOG.error("BÃ¶ser Walter Fehler:" + ioe.getMessage());
    	  }
    	  
    	  return directory;
    	  
      } else if (!"file".equals(url.getProtocol())) {
        LOG.warn("Plugins: not a file: url. Can't load plugins from: " + url);
        return null;
      }
      String path = url.getPath();
      if (WINDOWS && path.startsWith("/")) // patch a windows bug
        path = path.substring(1);
      try {
        path = URLDecoder.decode(path, "UTF-8"); // decode the url path
      } catch (UnsupportedEncodingException e) {
      }
      directory = new File(path);
    }
    return directory;
  }

  /**
   * @param manifestPath
   * @throws ParserConfigurationException
   * @throws IOException
   * @throws SAXException
   * @throws MalformedURLException
   */
  private PluginDescriptor parseManifestFile(String pManifestPath)
      throws MalformedURLException, SAXException, IOException,
      ParserConfigurationException {
    Document document = parseXML(new File(pManifestPath).toURI().toURL());
    String pPath = new File(pManifestPath).getParent();
    return parsePlugin(document, pPath);
  }

  /**
   * @param url
   * @return Document
   * @throws IOException
   * @throws SAXException
   * @throws ParserConfigurationException
   * @throws DocumentException
   */
  private Document parseXML(URL url) throws SAXException, IOException,
      ParserConfigurationException {
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = factory.newDocumentBuilder();
    return builder.parse(url.openStream());
  }

  /**
   * @param pDocument
   * @throws MalformedURLException
   */
  private PluginDescriptor parsePlugin(Document pDocument, String pPath)
      throws MalformedURLException {
    Element rootElement = pDocument.getDocumentElement();
    String id = rootElement.getAttribute(ATTR_ID);
    String name = rootElement.getAttribute(ATTR_NAME);
    String version = rootElement.getAttribute("version");
    String providerName = rootElement.getAttribute("provider-name");
    String pluginClazz = null;
    if (rootElement.getAttribute(ATTR_CLASS).trim().length() > 0) {
      pluginClazz = rootElement.getAttribute(ATTR_CLASS);
    }
    PluginDescriptor pluginDescriptor = new PluginDescriptor(id, version, name,
        providerName, pluginClazz, pPath, this.conf);
    LOG.debug("plugin: id=" + id + " name=" + name + " version=" + version
          + " provider=" + providerName + "class=" + pluginClazz);
    parseExtension(rootElement, pluginDescriptor);
    parseExtensionPoints(rootElement, pluginDescriptor);
    parseLibraries(rootElement, pluginDescriptor);
    parseRequires(rootElement, pluginDescriptor);
    return pluginDescriptor;
  }

  /**
   * @param pRootElement
   * @param pDescriptor
   * @throws MalformedURLException
   */
  private void parseRequires(Element pRootElement, PluginDescriptor pDescriptor)
      throws MalformedURLException {

    NodeList nodelist = pRootElement.getElementsByTagName("requires");
    if (nodelist.getLength() > 0) {

      Element requires = (Element) nodelist.item(0);

      NodeList imports = requires.getElementsByTagName("import");
      for (int i = 0; i < imports.getLength(); i++) {
        Element anImport = (Element) imports.item(i);
        String plugin = anImport.getAttribute("plugin");
        if (plugin != null) {
          pDescriptor.addDependency(plugin);
        }
      }
    }
  }

  /**
   * @param pRootElement
   * @param pDescriptor
   * @throws MalformedURLException
   */
  private void parseLibraries(Element pRootElement, PluginDescriptor pDescriptor)
      throws MalformedURLException {
    NodeList nodelist = pRootElement.getElementsByTagName("runtime");
    if (nodelist.getLength() > 0) {

      Element runtime = (Element) nodelist.item(0);

      NodeList libraries = runtime.getElementsByTagName("library");
      for (int i = 0; i < libraries.getLength(); i++) {
        Element library = (Element) libraries.item(i);
        String libName = library.getAttribute(ATTR_NAME);
        NodeList list = library.getElementsByTagName("export");
        Element exportElement = (Element) list.item(0);
        if (exportElement != null)
          pDescriptor.addExportedLibRelative(libName);
        else
          pDescriptor.addNotExportedLibRelative(libName);
      }
    }
  }

  /**
   * @param rootElement
   * @param pluginDescriptor
   */
  private void parseExtensionPoints(Element pRootElement,
      PluginDescriptor pPluginDescriptor) {
    NodeList list = pRootElement.getElementsByTagName("extension-point");
    if (list != null) {
      for (int i = 0; i < list.getLength(); i++) {
        Element oneExtensionPoint = (Element) list.item(i);
        String id = oneExtensionPoint.getAttribute(ATTR_ID);
        String name = oneExtensionPoint.getAttribute(ATTR_NAME);
        String schema = oneExtensionPoint.getAttribute("schema");
        ExtensionPoint extensionPoint = new ExtensionPoint(id, name, schema);
        pPluginDescriptor.addExtensionPoint(extensionPoint);
      }
    }
  }

  /**
   * @param rootElement
   * @param pluginDescriptor
   */
  private void parseExtension(Element pRootElement,
      PluginDescriptor pPluginDescriptor) {
    NodeList extensions = pRootElement.getElementsByTagName("extension");
    if (extensions != null) {
      for (int i = 0; i < extensions.getLength(); i++) {
        Element oneExtension = (Element) extensions.item(i);
        String pointId = oneExtension.getAttribute("point");

        NodeList extensionImplementations = oneExtension.getChildNodes();
        if (extensionImplementations != null) {
          for (int j = 0; j < extensionImplementations.getLength(); j++) {
            Node node = extensionImplementations.item(j);
            if (!node.getNodeName().equals("implementation")) {
              continue;
            }
            Element oneImplementation = (Element) node;
            String id = oneImplementation.getAttribute(ATTR_ID);
            String extensionClass = oneImplementation.getAttribute(ATTR_CLASS);
            LOG.debug("impl: point=" + pointId + " class=" + extensionClass);
            Extension extension = new Extension(pPluginDescriptor, pointId, id,
                extensionClass, this.conf, this.pluginRepository);
            NodeList parameters = oneImplementation
                .getElementsByTagName("parameter");
            if (parameters != null) {
              for (int k = 0; k < parameters.getLength(); k++) {
                Element param = (Element) parameters.item(k);
                extension.addAttribute(param.getAttribute(ATTR_NAME), param
                    .getAttribute("value"));
              }
            }
            pPluginDescriptor.addExtension(extension);
          }
        }
      }
    }
  }
}

Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Reply via email to