RE: URL in crawldb not appearing in Solr after indexing.

Os Tyler Thu, 01 Aug 2013 15:12:09 -0700

Thanks again for your time, Sebastian.

The output relevant to this URL is quite lengthy and, sense it is a PDF, 
contains a lot of binary content. I'm pasting the output up to the binary 
content at the bottom of this email. It looks to me like everything required is 
there, is there a way to debug solrindex that you can point out to me that will 
help show why this entry is not making it from the segment to Solr. (Please see 
the segread output below)


BTW, I came up with a workaround, which was to pull the record from the stage 
Solr instance and then build and xml statement and push it to our production 
Solr environment using curl, but I would still like to understand how to 
diagnose why this file, that shows up in both the crawldb and the segment, was 
not making it to Solr after running solrindex.

For reference, here is the command for manually adding to Solr, followed by the 
output from segread.

Adding a record to Solr index manually:
 curl http://<solrhost>:<port>/solr/update -H "Content-Type: text/xml" 
--data-binary '<add> <doc> <field name="content">Full page content goes 
here</field> <field name="digest">fc57093cf1d347d1bf94bf4950deb738</field> 
<field 
name="id">http://redacted.com/files/ppb/ppb_3j_002_vacation_policy.pdf</field> 
<field name="title">Vacation Policy</field> <field 
name="tstamp">2013-07-23T16:13:25.431Z</field> <field name="url"> 
http://redacted.com/files/ppb/ppb_3j_002_vacation_policy.pdf </field> </doc> 
</add>'

Here is the segread output:
Recno:: 3252
URL:: http://redacted.com/files/ppb/ppb_3j_002_vacation_policy.pdf

CrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Tue Jul 30 18:27:26 EDT 2013
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 100000 seconds (1 days)
Score: 6.5826543E-4
Signature: null
Metadata: _ngt_: 1375222322923Content-Type: application/pdf_pst_: success(1), 
lastModified=0

CrawlDatum::
Version: 7
Status: 2 (db_fetched)
Fetch time: Wed Jul 31 01:22:44 EDT 2013
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 100000 seconds (1 days)
Score: 6.5826543E-4
Signature: null
Metadata: _ngt_: 1375222322923Content-Type: application/pdf_pst_: success(1), 
lastModified=0

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Tue Jul 30 18:48:38 EDT 2013
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 80000 seconds (0 days)
Score: 5.418439E-4
Signature: null
Metadata: 

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Tue Jul 30 18:48:14 EDT 2013
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 80000 seconds (0 days)
Score: 8.344418E-6
Signature: null
Metadata: 

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Tue Jul 30 18:48:35 EDT 2013
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 80000 seconds (0 days)
Score: 4.298202E-5
Signature: null
Metadata: 

CrawlDatum::
Version: 7
Status: 67 (linked)
Fetch time: Tue Jul 30 18:48:14 EDT 2013
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 80000 seconds (0 days)
Score: 1.7928971E-7
Signature: null
Metadata: 

Content::
Version: -1
url: http://intranet.ur.com/files/ppb/ppb_3j_002_vacation_policy.pdf
base: http://intranet.ur.com/files/ppb/ppb_3j_002_vacation_policy.pdf
contentType: application/pdf
metadata: Date=Tue, 30 Jul 2013 22:27:26 GMT Content-Length=282669 Expires=Thu, 
29 Aug 2013 22:27:26 GMT Last-Modified=Wed, 24 Jul 2013 17:28
:43 GMT nutch.crawl.score=6.5826543E-4 _fst_=33 
nutch.segment.name=20130730181550 Accept-Ranges=bytes Connection=close 
Content-Type=applicati
on/pdf Server=Apache/2.2.12 (Linux/SUSE) Cache-Control=public, max-age=2419200, 
no-transform 
Content:
%PDF-1.5
(data content from here on)


________________________________________
From: Sebastian Nagel [[email protected]]
Sent: Thursday, August 01, 2013 1:52 PM
To: [email protected]
Subject: Re: URL in crawldb not appearing in Solr after indexing.

> But when after I run solrindex against the specific segment,
> the URL is still not visible in the Solr search results.
There should be other data related to this URL in the same segment.
What about parse data (including meta data), parsed text, and signature?

On 07/31/2013 02:55 AM, Os Tyler wrote:
> A little progress. I edited 'crawl' and added "-adddays 2" prompting the 
> crawl to include the URL in question (which had a fetch time of:
>> Fetch time: Wed Jul 31 01:22:44 EDT 2013
>
> Now the URL I am after is in a segment (I ran segread against the segment and 
> can see the URL there)
> Recno:: 3252
> URL:: http://redacted.com/files/ppb/ppb_3j_002_vacation_policy.pdf
>
> CrawlDatum::
> Version: 7
> Status: 33 (fetch_success)
>
> But when after I run solrindex against the specific segment, the URL is still 
> not visible in the Solr search results.
>
> Any further thoughts, suggestions?
>
> ________________________________________
> From: Os Tyler [[email protected]]
> Sent: Tuesday, July 30, 2013 2:54 PM
> To: [email protected]
> Subject: RE: URL in crawldb not appearing in Solr after indexing.
>
> Thanks, I appreciate your help.
>
> None of the existing segments contains the relevant URL. Is there a way to 
> ensure a specific URL makes it to a segment?
>
> ________________________________________
> From: Sebastian Nagel [[email protected]]
> Sent: Tuesday, July 30, 2013 2:13 PM
> To: [email protected]
> Subject: Re: URL in crawldb not appearing in Solr after indexing.
>
> Hi,
>
> the signature of the document is null in CrawlDb.
> The signature is calculated when parsing the document, so:
> - has parsing taken place?
> - truncated content?
> - parse failure?
> etc.
>
>> How do I specifically request that an entry in crawldb gets pushed to Solr?
> You have to run solrindex on the segment which contains the fetched and 
> parsed data.
>
> To check whether the segment contains all required data, you can use
> % bin/nutch readseg ...
>
> Sebastian
>
> On 07/30/2013 06:48 PM, Os Tyler wrote:
>> Hello,
>>
>> I have successfully deployed Solr on our development environment and our 
>> stage environment. But am running into an anomaly the third time around.
>>
>> I have a specific URL that appears in the crawldb, but is not showing up in 
>> when I search from the Solr interface. How do I specifically request that an 
>> entry in crawldb gets pushed to Solr?
>>
>> I have run solrindex multiple times and it does not produce any errors. 
>> readdb, parsechecker and indexchecker all return positive results for this 
>> URL. Configuration is identical on the to-be-production machine as it is on 
>> dev and stage where it's correctly appearing in Solr.
>>
>> /usr/local/apache-nutch/bin/nutch readdb 
>> /usr/local/apache-nutch/intranet/crawldb/ -url 
>> http://redacted.com/ppb/ppb_3j_002_vacation_policy.pdf
>>
>> URL: http://redacted.com/ppb/ppb_3j_002_vacation_policy.pdf
>> Version: 7
>> Status: 2 (db_fetched)
>> Fetch time: Wed Jul 31 01:22:44 EDT 2013
>> Modified time: Wed Dec 31 19:00:00 EST 1969
>> Retries since fetch: 0
>> Retry interval: 100000 seconds (1 days)
>> Score: 6.5826543E-4
>> Signature: null
>> Metadata: Content-Type: application/pdf_pst_: success(1), lastModified=0
>>
>>
>>
>

RE: URL in crawldb not appearing in Solr after indexing.

Reply via email to