Re: Apache nutch 1.9 error - Input path does not exist

gsamsa Sat, 27 Sep 2014 15:28:18 -0700

Thx a lot that works like a charm.

However, my current problem is that I cannot see anything on solr. Any
recommendations what I am doing wrong? I have done it exaclty like
described on the wiki page. That is my schema.xml:


<?xml version="1.0" encoding="UTF-8" ?>
<schema name="nutch" version="1.5">
    <types>
        <fieldType name="string" class="solr.StrField"
sortMissingLast="true"
            omitNorms="true"/>
        <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
            omitNorms="true" positionIncrementGap="0"/>
        <fieldType name="float" class="solr.TrieFloatField"
precisionStep="0"
            omitNorms="true" positionIncrementGap="0"/>
        <fieldType name="date" class="solr.TrieDateField" precisionStep="0"
            omitNorms="true" positionIncrementGap="0"/>

        <fieldType name="text" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.StopFilterFactory"
                    ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"
                    catenateWords="1" catenateNumbers="1" catenateAll="0"
                    splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPorterFilterFactory"
                    protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldType>
        <fieldType name="url" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"/>
            </analyzer>
        </fieldType>
    </types>
    <fields>
        <field name="id" type="string" stored="true" indexed="true"
            required="true"/>

        <!-- core fields -->
        <field name="segment" type="string" stored="true" indexed="false"/>
        <field name="digest" type="string" stored="true" indexed="false"/>
        <field name="boost" type="float" stored="true" indexed="false"/>

        <!-- fields for index-basic plugin -->
        <field name="host" type="string" stored="false" indexed="true"/>
        <field name="url" type="url" stored="true" indexed="true"/>
        <field name="content" type="text" stored="false" indexed="true"/>
        <field name="title" type="text" stored="true" indexed="true"/>
        <field name="cache" type="string" stored="true" indexed="false"/>
        <field name="tstamp" type="date" stored="true" indexed="false"/>

        <!-- fields for index-anchor plugin -->
        <field name="anchor" type="string" stored="true" indexed="true"
            multiValued="true"/>

        <!-- fields for index-more plugin -->
        <field name="type" type="string" stored="true" indexed="true"
            multiValued="true"/>
        <field name="contentLength" type="long" stored="true"
            indexed="false"/>
        <field name="lastModified" type="date" stored="true"
            indexed="false"/>
        <field name="date" type="date" stored="true" indexed="true"/>

        <!-- fields for languageidentifier plugin -->
        <field name="lang" type="string" stored="true" indexed="true"/>

        <!-- fields for subcollection plugin -->
        <field name="subcollection" type="string" stored="true"
            indexed="true" multiValued="true"/>

        <!-- fields for feed plugin (tag is also used by
microformats-reltag)-->
        <field name="author" type="string" stored="true" indexed="true"/>
        <field name="tag" type="string" stored="true" indexed="true"
multiValued="true"/>
        <field name="feed" type="string" stored="true" indexed="true"/>
        <field name="publishedDate" type="date" stored="true"
            indexed="true"/>
        <field name="updatedDate" type="date" stored="true"
            indexed="true"/>

        <!-- fields for creativecommons plugin -->
        <field name="cc" type="string" stored="true" indexed="true"
            multiValued="true"/>

        <!-- fields for tld plugin -->
        <field name="tld" type="string" stored="false" indexed="false"/>

        <field name="_version_" type="long" stored="true" indexed="true"/>
    </fields>
    <uniqueKey>id</uniqueKey>
    <defaultSearchField>content</defaultSearchField>
    <solrQueryParser defaultOperator="OR"/>
</schema>


Furthermore, would it be better to use a later solr like 4.10?

I appreciate your reply!

On Wed, Sep 24, 2014 at 4:35 PM, Jonathan Cooper-Ellis [via Lucene] <
[email protected]> wrote:

> Hello,
>
> It looks like you're confusing the usage of bin/crawl with the old
> bin/nutch crawl command. You want to start the crawl like this:
>
> bin/crawl <seed_directory> <crawl_directory> <solr_url> <number_of_rounds>
>
> So, the script thinks "-solr" is your crawl directory (which does not
> exist):
>
> 2014-09-24 14:40:04,302 ERROR fetcher.Fetcher - Fetcher:
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/home/testUser/Desktop/nutch-solr-example/apache-
> nutch-1.9/-solr/segments/crawl_generate
>
> Hope that helps!
>
> -jce
>
>
>
> On Wed, Sep 24, 2014 at 9:36 AM, gsamsa <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=4160936&i=0>> wrote:
>
> > Hello guys,
> >
> > I have installed *apache nutch 1.9* and *solr 3.6.2*, which run on an
> > ubuntu
> > virtual machine in virtualbox.
> >
> > *Description of error*
> >
> >
> > I start a crawl like that:
> >
> > *./bin/crawl urls/ -solr http://127.0.0.1:8983/solr/ 1*
> >
> > However, I get the following error(that is my log from
> > `nutch/logs/hadoop.logs`):
> >
> >
> >
> >     /  2014-09-24 14:39:46,252 INFO  crawl.Injector - Injector: starting
> at
> > 2014-09-24 14:39:46
> >         2014-09-24 14:39:46,259 INFO  crawl.Injector - Injector:
> crawlDb:
> > -solr/crawldb
> >         2014-09-24 14:39:46,259 INFO  crawl.Injector - Injector: urlDir:
> > urls
> >         2014-09-24 14:39:46,260 INFO  crawl.Injector - Injector:
> Converting
> > injected urls to crawl db entries.
> >         2014-09-24 14:39:47,263 WARN  util.NativeCodeLoader - Unable to
> > load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> >         2014-09-24 14:39:47,375 WARN  snappy.LoadSnappy - Snappy native
> > library not loaded
> >         2014-09-24 14:39:49,076 INFO  regex.RegexURLNormalizer - can't
> find
> > rules for scope 'inject', using default
> >         2014-09-24 14:39:49,132 INFO  regex.RegexURLNormalizer - can't
> find
> > rules for scope 'inject', using default
> >         2014-09-24 14:39:50,001 INFO  crawl.Injector - Injector: Total
> > number of urls rejected by filters: 0
> >         2014-09-24 14:39:50,002 INFO  crawl.Injector - Injector: Total
> > number of urls after normalization: 2
> >         2014-09-24 14:39:50,003 INFO  crawl.Injector - Injector: Merging
> > injected urls into crawl db.
> >         2014-09-24 14:39:51,046 INFO  crawl.Injector - Injector:
> overwrite:
> > false
> >         2014-09-24 14:39:51,046 INFO  crawl.Injector - Injector: update:
> > false
> >         2014-09-24 14:39:52,116 INFO  crawl.Injector - Injector: URLs
> > merged: 2
> >         2014-09-24 14:39:52,136 INFO  crawl.Injector - Injector: Total
> new
> > urls injected: 0
> >         2014-09-24 14:39:52,139 INFO  crawl.Injector - Injector:
> finished
> > at
> > 2014-09-24 14:39:52, elapsed: 00:00:05
> >         2014-09-24 14:39:55,557 WARN  util.NativeCodeLoader - Unable to
> > load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> >         2014-09-24 14:39:55,571 INFO  crawl.Generator - Generator:
> starting
> > at 2014-09-24 14:39:55
> >         2014-09-24 14:39:55,574 INFO  crawl.Generator - Generator:
> > Selecting
> > best-scoring urls due for fetch.
> >         2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator:
> > filtering: false
> >         2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator:
> > normalizing: true
> >         2014-09-24 14:39:55,575 INFO  crawl.Generator - Generator: topN:
> > 50000
> >         2014-09-24 14:39:58,013 INFO  crawl.FetchScheduleFactory - Using
> > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> >         2014-09-24 14:39:58,014 INFO  crawl.AbstractFetchSchedule -
> > defaultInterval=2592000
> >         2014-09-24 14:39:58,014 INFO  crawl.AbstractFetchSchedule -
> > maxInterval=7776000
> >         2014-09-24 14:39:58,044 INFO  regex.RegexURLNormalizer - can't
> find
> > rules for scope 'partition', using default
> >         2014-09-24 14:39:58,291 INFO  crawl.FetchScheduleFactory - Using
> > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> >         2014-09-24 14:39:58,292 INFO  crawl.AbstractFetchSchedule -
> > defaultInterval=2592000
> >         2014-09-24 14:39:58,292 INFO  crawl.AbstractFetchSchedule -
> > maxInterval=7776000
> >         2014-09-24 14:39:58,370 INFO  regex.RegexURLNormalizer - can't
> find
> > rules for scope 'generate_host_count', using default
> >         2014-09-24 14:39:58,782 INFO  crawl.Generator - Generator:
> > Partitioning selected urls for politeness.
> >         2014-09-24 14:39:59,785 INFO  crawl.Generator - Generator:
> segment:
> > -solr/segments/20140924143959
> >         2014-09-24 14:40:00,313 INFO  regex.RegexURLNormalizer - can't
> find
> > rules for scope 'partition', using default
> >         2014-09-24 14:40:01,032 INFO  crawl.Generator - Generator:
> finished
> > at 2014-09-24 14:40:01, elapsed: 00:00:05
> >         2014-09-24 14:40:03,462 INFO  fetcher.Fetcher - Fetcher:
> starting
> > at
> > 2014-09-24 14:40:03
> >         2014-09-24 14:40:03,467 INFO  fetcher.Fetcher - Fetcher:
> segment:
> > -solr/segments
> >         2014-09-24 14:40:03,467 INFO  fetcher.Fetcher - Fetcher
> Timelimit
> > set for : 1411573203467
> >         2014-09-24 14:40:04,207 WARN  util.NativeCodeLoader - Unable to
> > load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> >         2014-09-24 14:40:04,301 ERROR security.UserGroupInformation -
> > PriviledgedActionException as:testUser
> > cause:org.apache.hadoop.mapred.InvalidInputException: Input path does
> not
> > exist:
> >
> >
> file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate
>
> >         2014-09-24 14:40:04,302 ERROR fetcher.Fetcher - Fetcher:
> > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> >
> >
> file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate
>
> >                 at
> >
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
>
> >                 at
> >
> >
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
>
> >                 at
> > org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:106)
> >                 at
> > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
> >                 at
> > org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
> >                 at
> > org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
> >                 at
> > org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
> >                 at
> > org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> >                 at java.security.AccessController.doPrivileged(Native
> > Method)
> >                 at javax.security.auth.Subject.doAs(Subject.java:415)
> >                 at
> >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
>
> >                 at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> >                 at
> > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> >                 at
> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> >                 at
> > org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1432)
> >                 at
> org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1468)
> >                 at
> > org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >                 at
> > org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1441)/
> >
> > I basically have configured my solr like in the tutorial on  apache wiki
> > <
> http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch>
> > :
> >
> > /    mv ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
> > ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml.org
> >
> >     cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml
> > ${APACHE_SOLR_HOME}/example/solr/conf/
> >     vi ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
> >
> >     Copy exactly in 351 line: <field name="_version_" type="long"
> > indexed="true" stored="true"/>
> > /
> > This is what I get when I start solr:
> >
> > <http://lucene.472066.n3.nabble.com/file/n4160918/solr.jpg>
> >
> > *What I tried:*
> >
> >
> > According to this  thread
> > <
> >
> http://lucene.472066.n3.nabble.com/Exception-org-apache-hadoop-mapred-InvalidInputException-Input-path-does-not-exist-file-home-nutch-1a-td3572303.html
> > >
> > the issue should be fixed by deleting all segments files in
> > *-solr/segments*, however, that does not resolve the issue.
> >
> > Any recommendations where this error can come from and what I can do to
> fix
> > it?
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918p4160936.html
>  To unsubscribe from Apache nutch 1.9 error - Input path does not exist, click
> here
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4160918&code=bWFnZ3JlZ29yLnNhbXNhQGdtYWlsLmNvbXw0MTYwOTE4fDE4MjAwMjIxMjE=>
> .
> NAML
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918p4160959.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Apache nutch 1.9 error - Input path does not exist

Reply via email to