Thx a lot that works like a charm.
However, my current problem is that I cannot see anything on solr. Any
recommendations what I am doing wrong? I have done it exaclty like
described on the wiki page. That is my schema.xml:
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="nutch" version="1.5">
<types>
<fieldType name="string" class="solr.StrField"
sortMissingLast="true"
omitNorms="true"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0"
omitNorms="true" positionIncrementGap="0"/>
<fieldType name="float" class="solr.TrieFloatField"
precisionStep="0"
omitNorms="true" positionIncrementGap="0"/>
<fieldType name="date" class="solr.TrieDateField" precisionStep="0"
omitNorms="true" positionIncrementGap="0"/>
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="url" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"/>
</analyzer>
</fieldType>
</types>
<fields>
<field name="id" type="string" stored="true" indexed="true"
required="true"/>
<!-- core fields -->
<field name="segment" type="string" stored="true" indexed="false"/>
<field name="digest" type="string" stored="true" indexed="false"/>
<field name="boost" type="float" stored="true" indexed="false"/>
<!-- fields for index-basic plugin -->
<field name="host" type="string" stored="false" indexed="true"/>
<field name="url" type="url" stored="true" indexed="true"/>
<field name="content" type="text" stored="false" indexed="true"/>
<field name="title" type="text" stored="true" indexed="true"/>
<field name="cache" type="string" stored="true" indexed="false"/>
<field name="tstamp" type="date" stored="true" indexed="false"/>
<!-- fields for index-anchor plugin -->
<field name="anchor" type="string" stored="true" indexed="true"
multiValued="true"/>
<!-- fields for index-more plugin -->
<field name="type" type="string" stored="true" indexed="true"
multiValued="true"/>
<field name="contentLength" type="long" stored="true"
indexed="false"/>
<field name="lastModified" type="date" stored="true"
indexed="false"/>
<field name="date" type="date" stored="true" indexed="true"/>
<!-- fields for languageidentifier plugin -->
<field name="lang" type="string" stored="true" indexed="true"/>
<!-- fields for subcollection plugin -->
<field name="subcollection" type="string" stored="true"
indexed="true" multiValued="true"/>
<!-- fields for feed plugin (tag is also used by
microformats-reltag)-->
<field name="author" type="string" stored="true" indexed="true"/>
<field name="tag" type="string" stored="true" indexed="true"
multiValued="true"/>
<field name="feed" type="string" stored="true" indexed="true"/>
<field name="publishedDate" type="date" stored="true"
indexed="true"/>
<field name="updatedDate" type="date" stored="true"
indexed="true"/>
<!-- fields for creativecommons plugin -->
<field name="cc" type="string" stored="true" indexed="true"
multiValued="true"/>
<!-- fields for tld plugin -->
<field name="tld" type="string" stored="false" indexed="false"/>
<field name="_version_" type="long" stored="true" indexed="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>content</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
</schema>
Furthermore, would it be better to use a later solr like 4.10?
I appreciate your reply!
On Wed, Sep 24, 2014 at 4:35 PM, Jonathan Cooper-Ellis [via Lucene] <
[email protected]> wrote:
> Hello,
>
> It looks like you're confusing the usage of bin/crawl with the old
> bin/nutch crawl command. You want to start the crawl like this:
>
> bin/crawl <seed_directory> <crawl_directory> <solr_url> <number_of_rounds>
>
> So, the script thinks "-solr" is your crawl directory (which does not
> exist):
>
> 2014-09-24 14:40:04,302 ERROR fetcher.Fetcher - Fetcher:
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/home/testUser/Desktop/nutch-solr-example/apache-
> nutch-1.9/-solr/segments/crawl_generate
>
> Hope that helps!
>
> -jce
>
>
>
> On Wed, Sep 24, 2014 at 9:36 AM, gsamsa <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=4160936&i=0>> wrote:
>
> > Hello guys,
> >
> > I have installed *apache nutch 1.9* and *solr 3.6.2*, which run on an
> > ubuntu
> > virtual machine in virtualbox.
> >
> > *Description of error*
> >
> >
> > I start a crawl like that:
> >
> > *./bin/crawl urls/ -solr http://127.0.0.1:8983/solr/ 1*
> >
> > However, I get the following error(that is my log from
> > `nutch/logs/hadoop.logs`):
> >
> >
> >
> > / 2014-09-24 14:39:46,252 INFO crawl.Injector - Injector: starting
> at
> > 2014-09-24 14:39:46
> > 2014-09-24 14:39:46,259 INFO crawl.Injector - Injector:
> crawlDb:
> > -solr/crawldb
> > 2014-09-24 14:39:46,259 INFO crawl.Injector - Injector: urlDir:
> > urls
> > 2014-09-24 14:39:46,260 INFO crawl.Injector - Injector:
> Converting
> > injected urls to crawl db entries.
> > 2014-09-24 14:39:47,263 WARN util.NativeCodeLoader - Unable to
> > load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2014-09-24 14:39:47,375 WARN snappy.LoadSnappy - Snappy native
> > library not loaded
> > 2014-09-24 14:39:49,076 INFO regex.RegexURLNormalizer - can't
> find
> > rules for scope 'inject', using default
> > 2014-09-24 14:39:49,132 INFO regex.RegexURLNormalizer - can't
> find
> > rules for scope 'inject', using default
> > 2014-09-24 14:39:50,001 INFO crawl.Injector - Injector: Total
> > number of urls rejected by filters: 0
> > 2014-09-24 14:39:50,002 INFO crawl.Injector - Injector: Total
> > number of urls after normalization: 2
> > 2014-09-24 14:39:50,003 INFO crawl.Injector - Injector: Merging
> > injected urls into crawl db.
> > 2014-09-24 14:39:51,046 INFO crawl.Injector - Injector:
> overwrite:
> > false
> > 2014-09-24 14:39:51,046 INFO crawl.Injector - Injector: update:
> > false
> > 2014-09-24 14:39:52,116 INFO crawl.Injector - Injector: URLs
> > merged: 2
> > 2014-09-24 14:39:52,136 INFO crawl.Injector - Injector: Total
> new
> > urls injected: 0
> > 2014-09-24 14:39:52,139 INFO crawl.Injector - Injector:
> finished
> > at
> > 2014-09-24 14:39:52, elapsed: 00:00:05
> > 2014-09-24 14:39:55,557 WARN util.NativeCodeLoader - Unable to
> > load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2014-09-24 14:39:55,571 INFO crawl.Generator - Generator:
> starting
> > at 2014-09-24 14:39:55
> > 2014-09-24 14:39:55,574 INFO crawl.Generator - Generator:
> > Selecting
> > best-scoring urls due for fetch.
> > 2014-09-24 14:39:55,575 INFO crawl.Generator - Generator:
> > filtering: false
> > 2014-09-24 14:39:55,575 INFO crawl.Generator - Generator:
> > normalizing: true
> > 2014-09-24 14:39:55,575 INFO crawl.Generator - Generator: topN:
> > 50000
> > 2014-09-24 14:39:58,013 INFO crawl.FetchScheduleFactory - Using
> > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> > 2014-09-24 14:39:58,014 INFO crawl.AbstractFetchSchedule -
> > defaultInterval=2592000
> > 2014-09-24 14:39:58,014 INFO crawl.AbstractFetchSchedule -
> > maxInterval=7776000
> > 2014-09-24 14:39:58,044 INFO regex.RegexURLNormalizer - can't
> find
> > rules for scope 'partition', using default
> > 2014-09-24 14:39:58,291 INFO crawl.FetchScheduleFactory - Using
> > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> > 2014-09-24 14:39:58,292 INFO crawl.AbstractFetchSchedule -
> > defaultInterval=2592000
> > 2014-09-24 14:39:58,292 INFO crawl.AbstractFetchSchedule -
> > maxInterval=7776000
> > 2014-09-24 14:39:58,370 INFO regex.RegexURLNormalizer - can't
> find
> > rules for scope 'generate_host_count', using default
> > 2014-09-24 14:39:58,782 INFO crawl.Generator - Generator:
> > Partitioning selected urls for politeness.
> > 2014-09-24 14:39:59,785 INFO crawl.Generator - Generator:
> segment:
> > -solr/segments/20140924143959
> > 2014-09-24 14:40:00,313 INFO regex.RegexURLNormalizer - can't
> find
> > rules for scope 'partition', using default
> > 2014-09-24 14:40:01,032 INFO crawl.Generator - Generator:
> finished
> > at 2014-09-24 14:40:01, elapsed: 00:00:05
> > 2014-09-24 14:40:03,462 INFO fetcher.Fetcher - Fetcher:
> starting
> > at
> > 2014-09-24 14:40:03
> > 2014-09-24 14:40:03,467 INFO fetcher.Fetcher - Fetcher:
> segment:
> > -solr/segments
> > 2014-09-24 14:40:03,467 INFO fetcher.Fetcher - Fetcher
> Timelimit
> > set for : 1411573203467
> > 2014-09-24 14:40:04,207 WARN util.NativeCodeLoader - Unable to
> > load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2014-09-24 14:40:04,301 ERROR security.UserGroupInformation -
> > PriviledgedActionException as:testUser
> > cause:org.apache.hadoop.mapred.InvalidInputException: Input path does
> not
> > exist:
> >
> >
> file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate
>
> > 2014-09-24 14:40:04,302 ERROR fetcher.Fetcher - Fetcher:
> > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> >
> >
> file:/home/testUser/Desktop/nutch-solr-example/apache-nutch-1.9/-solr/segments/crawl_generate
>
> > at
> >
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
>
> > at
> >
> >
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
>
> > at
> > org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:106)
> > at
> > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
> > at
> > org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
> > at
> > org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
> > at
> > org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
> > at
> > org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> > at java.security.AccessController.doPrivileged(Native
> > Method)
> > at javax.security.auth.Subject.doAs(Subject.java:415)
> > at
> >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
>
> > at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> > at
> > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> > at
> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> > at
> > org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1432)
> > at
> org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1468)
> > at
> > org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at
> > org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1441)/
> >
> > I basically have configured my solr like in the tutorial on apache wiki
> > <
> http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch>
> > :
> >
> > / mv ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
> > ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml.org
> >
> > cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml
> > ${APACHE_SOLR_HOME}/example/solr/conf/
> > vi ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
> >
> > Copy exactly in 351 line: <field name="_version_" type="long"
> > indexed="true" stored="true"/>
> > /
> > This is what I get when I start solr:
> >
> > <http://lucene.472066.n3.nabble.com/file/n4160918/solr.jpg>
> >
> > *What I tried:*
> >
> >
> > According to this thread
> > <
> >
> http://lucene.472066.n3.nabble.com/Exception-org-apache-hadoop-mapred-InvalidInputException-Input-path-does-not-exist-file-home-nutch-1a-td3572303.html
> > >
> > the issue should be fixed by deleting all segments files in
> > *-solr/segments*, however, that does not resolve the issue.
> >
> > Any recommendations where this error can come from and what I can do to
> fix
> > it?
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918.html
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918p4160936.html
> To unsubscribe from Apache nutch 1.9 error - Input path does not exist, click
> here
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4160918&code=bWFnZ3JlZ29yLnNhbXNhQGdtYWlsLmNvbXw0MTYwOTE4fDE4MjAwMjIxMjE=>
> .
> NAML
> <http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
--
View this message in context:
http://lucene.472066.n3.nabble.com/Apache-nutch-1-9-error-Input-path-does-not-exist-tp4160918p4160959.html
Sent from the Nutch - User mailing list archive at Nabble.com.