Hello friends,
I configured nutch 2.2.1 to crwal the web page http://intranet.uclv.edu.cu.
I get the results located below in this page when I ran this command: 
./bin/crawl urls crawlId http://localhost:8983/solr/ 3
I need to know if I wrong, but I feel like something is not working well, I 
attached the config files too.
Please, write me, this is my 3rd mail and I haven't answers or suggestions from 
these mailing list
Thanks in advance,
Luis Armando



root@solr1:/opt/apache-nutch-2.2.1/runtime/local# ./bin/crawl urls crawlId 
http://localhost:8983/solr/ 3
InjectorJob: starting at 2013-10-17 18:43:13
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora 
storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Injector: finished at 2013-10-17 18:43:15, elapsed: 00:00:02
Thu Oct 17 18:43:15 UTC 2013 : Iteration 1 of 3
Generating batchId
Generating a new fetchlist
GeneratorJob: starting at 2013-10-17 18:43:16
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: false
GeneratorJob: normalizing: false
GeneratorJob: topN: 50000
GeneratorJob: finished at 2013-10-17 18:43:19, time elapsed: 00:00:02
GeneratorJob: generated batch id: 1382035395-32147
Fetching :
FetcherJob: starting
FetcherJob: batchId: 1382035395-32147
Fetcher: Your 'http.agent.name' value should be listed first in 
'http.robots.agents' property.
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1382046200181
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
-finishing thread FetcherThread1, activeThreads=0
-finishing thread FetcherThread2, activeThreads=0
-finishing thread FetcherThread3, activeThreads=0
-finishing thread FetcherThread4, activeThreads=0
-finishing thread FetcherThread6, activeThreads=0
-finishing thread FetcherThread5, activeThreads=0
-finishing thread FetcherThread7, activeThreads=0
-finishing thread FetcherThread8, activeThreads=1
-finishing thread FetcherThread9, activeThreads=0
-finishing thread FetcherThread10, activeThreads=0
-finishing thread FetcherThread11, activeThreads=0
-finishing thread FetcherThread12, activeThreads=0
-finishing thread FetcherThread13, activeThreads=0
-finishing thread FetcherThread15, activeThreads=0
-finishing thread FetcherThread14, activeThreads=0
-finishing thread FetcherThread16, activeThreads=0
-finishing thread FetcherThread17, activeThreads=0
-finishing thread FetcherThread18, activeThreads=0
-finishing thread FetcherThread19, activeThreads=0
-finishing thread FetcherThread20, activeThreads=0
-finishing thread FetcherThread21, activeThreads=0
-finishing thread FetcherThread23, activeThreads=0
-finishing thread FetcherThread22, activeThreads=0
-finishing thread FetcherThread24, activeThreads=0
-finishing thread FetcherThread26, activeThreads=0
-finishing thread FetcherThread25, activeThreads=0
-finishing thread FetcherThread27, activeThreads=0
-finishing thread FetcherThread28, activeThreads=0
-finishing thread FetcherThread29, activeThreads=0
-finishing thread FetcherThread30, activeThreads=0
-finishing thread FetcherThread31, activeThreads=0
-finishing thread FetcherThread32, activeThreads=0
-finishing thread FetcherThread33, activeThreads=0
-finishing thread FetcherThread34, activeThreads=0
-finishing thread FetcherThread35, activeThreads=0
-finishing thread FetcherThread36, activeThreads=0
-finishing thread FetcherThread38, activeThreads=0
-finishing thread FetcherThread37, activeThreads=0
-finishing thread FetcherThread39, activeThreads=0
-finishing thread FetcherThread40, activeThreads=0
-finishing thread FetcherThread41, activeThreads=0
-finishing thread FetcherThread42, activeThreads=0
-finishing thread FetcherThread43, activeThreads=0
-finishing thread FetcherThread44, activeThreads=0
-finishing thread FetcherThread45, activeThreads=0
-finishing thread FetcherThread46, activeThreads=0
-finishing thread FetcherThread47, activeThreads=0
-finishing thread FetcherThread48, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread49, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 
queues
-activeThreads=0
FetcherJob: done
Parsing :
ParserJob: starting
ParserJob: resuming:    false
ParserJob: forced reparse:      false
ParserJob: batchId:     1382035395-32147
ParserJob: success
CrawlDB update for crawlId
DbUpdaterJob: starting
DbUpdaterJob: done
Indexing crawlId on SOLR index -> http://localhost:8983/solr/
SolrIndexerJob: starting
SolrIndexerJob: done.
SOLR dedup -> http://localhost:8983/solr/

La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. 
Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. 
http://www.congresouniversidad.cu/


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
   <property>
        <name>http.agent.name</name>
        <value>My Nutch Spider</value>
   </property>

   <property>
   	 <name>plugin.includes</name>
    	<value>protocol-file|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    	<description>Regular expression naming plugin directory names to
      	 include.  Any plugin not matching this expression is excluded.
      	 In any case you need at least include the nutch-extensionpoints plugin. By
      	 default Nutch includes crawling just HTML and plain text via HTTP,
      	 and basic indexing and search plugins. In order to use HTTPS please enable
      	 protocol-httpclient, but be aware of possible intermittent problems with the
      	 underlying commons-httpclient library.
   	</description>
   </property>
</configuration>
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
#-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+^http://([a-z0-9]*\.)*intranet.uclv.edu.cu/
+^file://srv/samba/files/

Reply via email to