Hi Iain,

that's good news. Thank you for reporting!

Sebastian

On 09/08/2013 02:10 AM, Iain Lopata wrote:
> Sebastian,
> 
> That was exactly the problem -- and the fix you have posted fixed it.  Thank 
> you so much!
> 
> FYI for others -- The patch file doesn’t apply to 1.6, but it is very easy to 
> see where the changes need to be made.
> 
> Iain
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]] 
> Sent: Saturday, September 07, 2013 5:08 PM
> To: [email protected]
> Subject: Re: Changing Host Name for Solr Index
> 
> Hi Iain,
> 
> I set up a similar project to check normalizing in indexer.
> In general, it works (for current trunk, I didn't check 1.6) but there is a 
> problem if documents are target of a redirect (there has been only one such 
> document).
> In this case, there is a second "representation" URL (eg.
> if the source of a temporary redirect is simpler than the target it's taken 
> as more representative URL).
> NUTCH-1636 is opened to address this bug.
> 
> Are all 4 URLs you tried target of redirects resp. is the repr URL set in 
> CrawlDb?
> (you can use "bin/nutch readdb" and look for _repr_)
> 
> Sebastian
> 
> On 09/07/2013 04:42 AM, Iain Lopata wrote:
>> Sebastian,
>>
>> I have been calling Solrindex with -normalize (and have modified the crawl 
>> script to use -normalize at the index stage also).  I have also been 
>> deleting the documents from Solr before reindexing.
>>
>> The problem is occurring with all documents to be normalized -- although 
>> there are only four so far in my testing.
>>
>> I am using Nutch 1.6 on Ubuntu and Solr 1.4.1
>>
>> I am using the url as the id field in solr if that makes a difference.
>>
>> The following is a slightly longer log extract
>>
>> 2013-09-06 21:29:58,927 DEBUG regex.RegexURLNormalizer - resource for 
>> scope 'indexer': regex-normalize-indexer.xml
>> 2013-09-06 21:30:01,263 DEBUG util.ObjectCache - No object cache found 
>> for conf=Configuration: core-default.xml, core-site.xml, 
>> mapred-default.xml, mapred-site.xml, 
>> file:/tmp/hadoop-root/mapred/local/localRunner/job_local_0001.xml, 
>> instantiating a new object cache
>> 2013-09-06 21:30:01,286 INFO  indexer.IndexingFilters - Adding 
>> com.atlantbh.nutch.filter.index.omit.OmitIndexingFilter
>> 2013-09-06 21:30:01,337 INFO  indexer.IndexingFilters - Adding 
>> com.atlantbh.nutch.filter.xpath.XPathIndexingFilter
>> 2013-09-06 21:30:01,338 INFO  indexer.IndexingFilters - Adding 
>> org.apache.nutch.indexer.basic.BasicIndexingFilter
>> 2013-09-06 21:30:01,394 INFO  indexer.IndexingFilters - Adding 
>> org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
>> 2013-09-06 21:30:01,394 INFO  indexer.IndexingFilters - Adding 
>> org.apache.nutch.indexer.staticfield.StaticFieldIndexer
>> 2013-09-06 21:30:01,394 INFO  anchor.AnchorIndexingFilter - Anchor 
>> deduplication is: off
>> 2013-09-06 21:30:01,394 INFO  indexer.IndexingFilters - Adding 
>> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
>> 2013-09-06 21:30:01,804 INFO  solr.SolrMappingReader - source: 
>> subcollection dest: company
>> 2013-09-06 21:30:01,804 INFO  solr.SolrMappingReader - source: name 
>> dest: name
>> 2013-09-06 21:30:01,804 INFO  solr.SolrMappingReader - source: address 
>> dest: address
>> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: email 
>> dest: email
>> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: 
>> education dest: education
>> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: 
>> practice dest: practice
>> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: vcflink 
>> dest: vcflink
>> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: 
>> imageurl dest: imageurl
>> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: role 
>> dest: role
>> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: profile 
>> dest: profile
>> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: title 
>> dest: title
>> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: host 
>> dest: host
>> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: segment 
>> dest: segment
>> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: boost 
>> dest: boost
>> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: digest 
>> dest: digest
>> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: tstamp 
>> dest: tstamp
>> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: url 
>> dest: id
>> 2013-09-06 21:30:01,805 INFO  solr.SolrMappingReader - source: url 
>> dest: url
>> 2013-09-06 21:30:01,845 INFO  collection.CollectionManager - 
>> Instantiating CollectionManager
>> 2013-09-06 21:30:01,845 INFO  collection.CollectionManager - 
>> initializing CollectionManager
>> 2013-09-06 21:30:01,884 INFO  collection.CollectionManager - file has 
>> 16 elements
>> 2013-09-06 21:30:02,065 INFO  solr.SolrWriter - Indexing 4 documents
>> 2013-09-06 21:30:22,411 INFO  solr.SolrIndexer - SolrIndexer: finished 
>> at 2013-09-06 21:30:22, elapsed: 00:01:09
>>
>> -----Original Message-----
>> From: Sebastian Nagel [mailto:[email protected]]
>> Sent: Friday, September 06, 2013 3:03 PM
>> To: [email protected]
>> Subject: Re: Changing Host Name for Solr Index
>>
>> Hi Iain,
>>
>> I assume that indexer is called with -normalize
>>
>> % bin/nutch solrindex ... -normalize
>>
>> and that Solr index is emptied before because adding or changing 
>> normalization rules will not cause that "old" documents are updated/deleted 
>> because their id field is still filled with unnormalized URLs.
>>
>> Does this happen for some or for all documents with URLs to be normalized?
>>
>> Can you provide more information (Nutch version, more logs)?
>>
>> Thanks,
>> Sebastian
>>
>> On 09/06/2013 03:34 AM, Iain Lopata wrote:
>>> I am attempting to crawl the mobile subdomain of a site: m.example.com. 
>>>
>>> Instead of indexing these pages in Solr as m.example.com/page1.html,  I want
>>> to add them as www.example.com/page1.html   
>>>
>>> I have used the regex-urlnormalizer at the indexing phase and specified:
>>>
>>> <regex>
>>>   <pattern>m\.example\.com</pattern>
>>>   <substitution>www\.example\.com</substitution>
>>> </regex>
>>>
>>> I can see in the hadoop log that the configuration file for the 
>>> indexer scope is being correctly read:
>>>
>>>        2013-09-05 18:24:06,685 DEBUG regex.RegexURLNormalizer - 
>>> resource for scope 'indexer': regex-normalize-indexer.xml
>>>        
>>> I have also confirmed that this the substitution is working with 
>>> URLNormalizerChecker.
>>>
>>> However, when the indexing to Solr completes, both the url and host 
>>> fields contain the m.example.com host name.
>>>
>>> Any ideas on how I can correct this?
>>>
>>> Thanks
>>>
>>
>>
> 
> 

Reply via email to