There is usually an URL filter in your way. Use bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined to check whether they are filtered. Markus
-----Original message----- > From:Néstor <[email protected]> > Sent: Tuesday 4th October 2016 18:57 > To: [email protected] > Subject: Re: why the results have diff number of fields > > Maybe because I am trying to just crawl a subfolder mysite.com/subfolder and > I am having problems configuring it to do this and is going and crawling > other pages from the parent directory. > > Thanks! > > > > On Tue, Oct 4, 2016 at 4:00 AM, Markus Jelsma <[email protected]> > wrote: > > > Well, probably because you or something indexes different stuff to the > > Solr index. The first doesn't come from Nutch, the second does. > > Markus > > > > > > > > -----Original message----- > > > From:Nestor <[email protected]> > > > Sent: Tuesday 4th October 2016 2:07 > > > To: [email protected] > > > Subject: why the results have diff number of fields > > > > > > In my solr query result for "url:*" number of returned fields vary > > compare > > > to my second query(see bottom) > > > <result name="response" numFound="4861" start="0"> > > > <doc> > > > <str name="body">...</str> > > > <str name="changed">2010-10-13T18:58:28</str> > > > <str name="created">2010-10-13T18:58:28</str> > > > <str name="entity">file</str> > > > <str name="hash">hvvzxf</str> > > > <str name="id">hvvzxf/file/53-623</str> > > > <arr name="im_vid_9">...</arr> > > > <str name="language">und</str> > > > <str name="name"/> > > > <str name="nid">623</str> > > > <str name="path">sites/default/files/HomePage.pdf</str> > > > <str name="promote">F</str> > > > <str name="site">http://www.mysite.com/</str> > > > <str name="sm_facetbuilder_solr_type">solr_type:facet_3</str> > > > <arr name="sm_vid_Project_Type">...</arr> > > > <arr name="spell">...</arr> > > > <str name="ss_file_node_title">Training Test 2</str> > > > <str name="ss_file_node_url">http://www.mysite.com/training-test-2</str> > > > <str name="ss_filemime">application/pdf</str> > > > <str name="status">T</str> > > > <str name="sticky">F</str> > > > <str name="teaser">...</str> > > > <arr name="tid">...</arr> > > > <str name="timestamp">2012-11-28T05:05:52.623</str> > > > <str name="title">HomePage.pdf</str> > > > <str name="ts_vid_9_names">Construction Professional Services</str> > > > <str name="uid">0</str> > > > <str name="url">...</str> > > > <arr name="vid">...</arr> > > > </doc> > > > > > > When I do a solr query as "content:water" I get less fields in the > > results: > > > <result name="response" numFound="177" start="0"> > > > <doc> > > > <float name="boost">0.027676692</float> > > > <str name="digest">4872e938706f9bee4d928330e5713623</str> > > > <str name="id">http://www.mysite.com/es/biographies</str> > > > <str name="segment">20161003150513</str> > > > <str name="title">Biographies</str> > > > <date name="tstamp">2016-10-03T15:21:45.346Z</date> > > > <str name="url">http://www.mysite.com/es/biographies</str> > > > </doc> > > > > > > Why is that? > > > > > > > > > Thanks, > > > > > > > > > > > > -- > > > View this message in context: http://lucene.472066.n3. > > nabble.com/why-the-results-have-diff-number-of-fields-tp4299378.html > > > Sent from the Nutch - User mailing list archive at Nabble.com. > > > > > > > > > -- > Né§t☼r *Authority gone to one's head is the greatest enemy of Truth* >

