I take it that the solrindex reduce task is single threaded and can't be run
in parallel? The map task ran fine but I got these errors as soon as it
started the reduce:
ReduceAttempt TASK_TYPE="REDUCE" TASKID="task_201010271034_0088_r_000000"
TASK_ATTEMPT_ID="attempt_201010271034_0088_r_000000_0"
START_TIME="1288230817350" TRACKER_NA
ME="tracker_search1:localhost/127\.0\.0\.1:38875" HTTP_PORT="50060" .
ReduceAttempt TASK_TYPE="REDUCE" TASKID="task_201010271034_0088_r_000000"
TASK_ATTEMPT_ID="attempt_201010271034_0088_r_000000_0" TASK_STATUS="FAILED"
FINISH_TIME="128
8233261168" HOSTNAME="search1" ERROR="java\.io\.IOException
at
org\.apache\.nutch\.indexer\.solr\.SolrWriter\.makeIOException(SolrWriter\.java:85)
at
org\.apache\.nutch\.indexer\.solr\.SolrWriter\.write(SolrWriter\.java:66)
at
org\.apache\.nutch\.indexer\.IndexerOutputFormat$1\.write(IndexerOutputFormat\.java:54)
at
org\.apache\.nutch\.indexer\.IndexerOutputFormat$1\.write(IndexerOutputFormat\.java:44)
at
org\.apache\.hadoop\.mapred\.ReduceTask$3\.collect(ReduceTask\.java:440)
at
org\.apache\.nutch\.indexer\.IndexerMapReduce\.reduce(IndexerMapReduce\.java:159)
at
org\.apache\.nutch\.indexer\.IndexerMapReduce\.reduce(IndexerMapReduce\.java:50)
at
org\.apache\.hadoop\.mapred\.ReduceTask\.runOldReducer(ReduceTask\.java:463)
at
org\.apache\.hadoop\.mapred\.ReduceTask\.run(ReduceTask\.java:411)
at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:170)
Caused by: org\.apache\.solr\.client\.solrj\.SolrServerException:
java\.net\.ConnectException: Connection refused
at
org\.apache\.solr\.client\.solrj\.impl\.CommonsHttpSolrServer\.request(CommonsHttpSolrServer\.java:472)
at
org\.apache\.solr\.client\.solrj\.impl\.CommonsHttpSolrServer\.request(CommonsHttpSolrServer\.java:243)
at
org\.apache\.solr\.client\.solrj\.request\.AbstractUpdateRequest\.process(AbstractUpdateRequest\.java:105)
at
org\.apache\.solr\.client\.solrj\.SolrServer\.add(SolrServer\.java:49)
at
org\.apache\.nutch\.indexer\.solr\.SolrWriter\.write(SolrWriter\.java:64)
\.\.\. 8 more
Caused by: java\.net\.ConnectException: Connection refused
at java\.net\.PlainSocketImpl\.socketConnect(Native Method)
at java\.net\.PlainSocketImpl\.doConnect(PlainSocketImpl\.java:333)
at
java\.net\.PlainSocketImpl\.connectToAddress(PlainSocketImpl\.java:195)
at java\.net\.PlainSocketImpl\.connect(PlainSocketImpl\.java:182)
at java\.net\.SocksSocketImpl\.connect(SocksSocketImpl\.java:366)
at java\.net\.Socket\.connect(Socket\.java:529)
at java\.net\.Socket\.connect(Socket\.java:478)
at java\.net\.Socket\.<init>(Socket\.java:375)
at java\.net\.Socket\.<init>(Socket\.java:249)
at
org\.apache\.commons\.httpclient\.protocol\.DefaultProtocolSocketFactory\.createSocket(DefaultProtocolSocketFactory\.java:80)
at
org\.apache\.commons\.httpclient\.protocol\.DefaultProtocolSocketFactory\.createSocket(DefaultProtocolSocketFactory\.java:122)
at
org\.apache\.commons\.httpclient\.HttpConnection\.open(HttpConnection\.java:707)
at
org\.apache\.commons\.httpclient\.MultiThreadedHttpConnectionManager$HttpConnectionAdapter\.open(MultiThreadedHttpConnectionManager\.java:1361)
at
org\.apache\.commons\.httpclient\.HttpMethodDirector\.executeWithRetry(HttpMethodDirector\.java:387)
at
org\.apache\.commons\.httpclient\.HttpMethodDirector\.executeMethod(HttpMethodDirector\.java:171)
at
org\.apache\.commons\.httpclient\.HttpClient\.executeMethod(HttpClient\.java:397)
at
org\.apache\.commons\.httpclient\.HttpClient\.executeMethod(HttpClient\.java:323)
I am running in pseudo-cluster mode and I have the following set in
mapred-site.xml
<property>
<name>mapred.map.tasks</name>
<value>7</value>
<description>The default number of map tasks per job. Typically set
to a prime several times greater than number of available hosts.
Ignored when mapred.job.tracker is "local".
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>7</value>
<description>The default number of reduce tasks per job. Typically set
to a prime close to the number of available hosts. Ignored when
mapred.job.tracker is "local".
</description>
</property>
If needed, I can always set up two conf directories, use the current
settings up to the solrindex and stop the hadoop daemons right before the
solrindex starts and then restart the daemons with solrindex settings that
have the number of reduce tasks set to 1. However, I would prefer to run the
reduce tasks in parallel two take advantage of the multicore/multithreaded
sparc processor, if there is a way to do so.
Thanks,
Steve Cohen
On Wed, Oct 27, 2010 at 5:27 AM, Markus Jelsma
<[email protected]>wrote:
> Only overwrite the jar's if you use Solr 3.x branch or trunk. If you're
> using
> 1.3 or 1.4.1 then you're fine with the jar's that come with Nutch 1.2.
>
> The schema.xml and solrmapping.xml files are just examples that come with
> Nutch. Nutch won't index your Solr database, it will only send data to your
> Solr index using the solrmapping.xml as a mapping between Nutch' fields and
> Solr's fields.
>
> You need the solrmapping.xml in your Nutch configuration and schema.xml in
> your
> Solr configuration, then you're good to go with an example crawl. Fire the
> solrindex command and if it fails, send output from Nutch' logs/hadoop.log
> and
> the output of your Solr servlet container.
>
>
> On Tuesday 26 October 2010 23:46:29 Steve Cohen wrote:
> > Thanks for the response.
> >
> > Let me see if I understand properly.
> >
> > The solr jar files in lib
> >
> > ./lib/apache-solr-core-1.4.0.jar
> > ./lib/apache-solr-solrj-1.4.0.jar
> >
> > and solrindex-mapping.xml/schema.xml file in conf is so that nutch can
> > index the solr database, while the separate solr instance is used to
> > search the result?
> >
> > Thanks,
> > Steve
> >
> > On Tue, Oct 26, 2010 at 3:43 PM, Markus Jelsma
> >
> > <[email protected]>wrote:
> > > Hi,
> > >
> > > You'll need a 1.3 or 1.4.x version (i don't know if the bin format
> > > changed between 1.3 to 1.4 but i think it didn't). You can also use
> Solr
> > > 3.x branch but you'll have to copy the solrj and solr-core JAR's from
> > > your Solr build to Nutch' lib dir.
> > >
> > > Make sure you'll check Solr's log if Nutch fails to send data, and make
> > > sure your Solr schema.xml is correct for what Nutch sends to it.
> > >
> > > Cheers,
> > >
> > >
> > > On Tue, 26 Oct 2010 15:05:33 -0400, Steve Cohen <[email protected]>
> > >
> > > wrote:
> > >> Hello,
> > >>
> > >> I am looking at the wiki page for running nutch and solr.
> > >>
> > >> http://wiki.apache.org/nutch/RunningNutchAndSolr
> > >>
> > >> I see this step:
> > >>
> > >> *1.* Download Solr version 1.3.0 or
> > >> LucidWorks<http://wiki.apache.org/nutch/LucidWorks>for Solr from
> > >>
> > >> Download page
> > >>
> > >> and this step:
> > >>
> > >> *5.* Configure Solr For the sake of simplicity we are going to use the
> > >> example configuration of Solr as a base.
> > >>
> > >> Do we still download a version of solr (presumably version 1.4 since
> > >> that is
> > >> what nutch 1.2 is using) and configure it?
> > >>
> > >> Thanks,
> > >> Steve
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536600 / 06-50258350
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536600 / 06-50258350
>