Re: Changing Host Name for Solr Index

Sebastian Nagel Sat, 07 Sep 2013 15:09:37 -0700

Hi Iain,

I set up a similar project to check normalizing in indexer.
In general, it works (for current trunk, I didn't check 1.6)
but there is a problem if documents are target of a redirect
(there has been only one such document).
In this case, there is a second "representation" URL (eg.
if the source of a temporary redirect is simpler than the target
it's taken as more representative URL).
NUTCH-1636 is opened to address this bug.


Are all 4 URLs you tried target of redirects
resp. is the repr URL set in CrawlDb?
(you can use "bin/nutch readdb" and look for _repr_)

Sebastian

On 09/07/2013 04:42 AM, Iain Lopata wrote:
> Sebastian,
> 
> I have been calling Solrindex with -normalize (and have modified the crawl 
> script to use -normalize at the index stage also).  I have also been deleting 
> the documents from Solr before reindexing.
> 
> The problem is occurring with all documents to be normalized -- although 
> there are only four so far in my testing.
> 
> I am using Nutch 1.6 on Ubuntu and Solr 1.4.1
> 
> I am using the url as the id field in solr if that makes a difference.
> 
> The following is a slightly longer log extract
> 
> 2013-09-06 21:29:58,927 DEBUG regex.RegexURLNormalizer - resource for scope 
> 'indexer': regex-normalize-indexer.xml
> 2013-09-06 21:30:01,263 DEBUG util.ObjectCache - No object cache found for 
> conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml, 
> mapred-site.xml, 
> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0001.xml, 
> instantiating a new object cache
> 2013-09-06 21:30:01,286 INFO  indexer.IndexingFilters - Adding 
> com.atlantbh.nutch.filter.index.omit.OmitIndexingFilter
> 2013-09-06 21:30:01,337 INFO  indexer.IndexingFilters - Adding 
> com.atlantbh.nutch.filter.xpath.XPathIndexingFilter
> 2013-09-06 21:30:01,338 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2013-09-06 21:30:01,394 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
> 2013-09-06 21:30:01,394 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.staticfield.StaticFieldIndexer
> 2013-09-06 21:30:01,394 INFO  anchor.AnchorIndexingFilter - Anchor 
> deduplication is: off
> 2013-09-06 21:30:01,394 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2013-09-06 21:30:01,804 INFO  solr.SolrMappingReader - source: subcollection 
> dest: company
> 2013-09-06 21:30:01,804 INFO  solr.SolrMappingReader - source: name dest: name
> 2013-09-06 21:30:01,804 INFO  solr.SolrMappingReader - source: address dest: 
> address
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: email dest: 
> email
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: education 
> dest: education
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: practice dest: 
> practice
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: vcflink dest: 
> vcflink
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: imageurl dest: 
> imageurl
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: role dest: role
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: profile dest: 
> profile
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: title dest: 
> title
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: host dest: host
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: segment dest: 
> segment
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: boost dest: 
> boost
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: url dest: id
> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: url dest: url
> 2013-09-06 21:30:01,845 INFO  collection.CollectionManager - Instantiating 
> CollectionManager
> 2013-09-06 21:30:01,845 INFO  collection.CollectionManager - initializing 
> CollectionManager
> 2013-09-06 21:30:01,884 INFO  collection.CollectionManager - file has 16 
> elements
> 2013-09-06 21:30:02,065 INFO  solr.SolrWriter - Indexing 4 documents
> 2013-09-06 21:30:22,411 INFO  solr.SolrIndexer - SolrIndexer: finished at 
> 2013-09-06 21:30:22, elapsed: 00:01:09
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]] 
> Sent: Friday, September 06, 2013 3:03 PM
> To: [email protected]
> Subject: Re: Changing Host Name for Solr Index
> 
> Hi Iain,
> 
> I assume that indexer is called with -normalize
> 
> % bin/nutch solrindex ... -normalize
> 
> and that Solr index is emptied before because adding or changing 
> normalization rules will not cause that "old" documents are updated/deleted 
> because their id field is still filled with unnormalized URLs.
> 
> Does this happen for some or for all documents with URLs to be normalized?
> 
> Can you provide more information (Nutch version, more logs)?
> 
> Thanks,
> Sebastian
> 
> On 09/06/2013 03:34 AM, Iain Lopata wrote:
>> I am attempting to crawl the mobile subdomain of a site: m.example.com. 
>>
>> Instead of indexing these pages in Solr as m.example.com/page1.html,  I want
>> to add them as www.example.com/page1.html   
>>
>> I have used the regex-urlnormalizer at the indexing phase and specified:
>>
>> <regex>
>>   <pattern>m\.example\.com</pattern>
>>   <substitution>www\.example\.com</substitution>
>> </regex>
>>
>> I can see in the hadoop log that the configuration file for the 
>> indexer scope is being correctly read:
>>
>>        2013-09-05 18:24:06,685 DEBUG regex.RegexURLNormalizer - 
>> resource for scope 'indexer': regex-normalize-indexer.xml
>>        
>> I have also confirmed that this the substitution is working with 
>> URLNormalizerChecker.
>>
>> However, when the indexing to Solr completes, both the url and host 
>> fields contain the m.example.com host name.
>>
>> Any ideas on how I can correct this?
>>
>> Thanks
>>
> 
>

Re: Changing Host Name for Solr Index

Reply via email to