Re: Metadata adjuster

Karl Wright Wed, 22 Feb 2017 07:26:57 -0800

Hi Marisol,

Some observations.
(1) It makes no sense to have "Keep all incoming metadata" set to false,
since that will filter out everything that your tika extractor extracts.  I
doubt that is what you have intended.
(2) I can't see the Solr output configuration -- looks like it got
truncated?


Thanks,
Karl


On Wed, Feb 22, 2017 at 10:12 AM, Marisol Redondo <
[email protected]> wrote:

> Here you are:
>
> View a Job
>
> Top of Form
>
>
> ------------------------------
>
> Name:
>
> revenueToSites
> ------------------------------
>
> Pipeline:
>
> Stage
>
> Type
>
> Precedent
>
> Description
>
> Connection name
>
> 1.
>
> Repository
>
> Revenue Website
>
> 2.
>
> Transformation
>
> 1.
>
> Tikka Metadata Extractor
>
> 3.
>
> Transformation
>
> 2.
>
> Set mimeType and facetContentType
>
> customField
>
> 4.
>
> Output
>
> 3.
>
> sites solr dev
>
> Notifications:
>
> Stage
>
> Description
>
> Connection name
>
> No notification connections
> ------------------------------
>
> Priority:
>
> 5
>
> Start method:
>
> Don't automatically start
> ------------------------------
>
> Schedule type:
>
> Scan every document once
>
> Minimum recrawl interval:
>
> Not applicable
>
> Maximum recrawl interval:
>
> Not applicable
>
> Expiration interval:
>
> Not applicable
>
> Reseed interval:
>
> Not applicable
> ------------------------------
>
> No scheduled run times
> ------------------------------
>
> Maximum hop count for link type 'link':
>
> Unlimited
>
> Maximum hop count for link type 'redirect':
>
> Unlimited
> ------------------------------
>
> Hop count mode:
>
> Delete unreachable documents
> ------------------------------
>
> 1.
>
> Seeds:
>
> https://xxxxxx/index.aspx
> <https://preview.revenuedomain.ie/en/press-office/index.aspx>
> ------------------------------
>
> No canonicalization specified - all URLs will be reordered and have all
> sessions removed
> ------------------------------
>
> No mappings specified; will accept all URLs
> ------------------------------
>
> Include only hosts matching seeds?
>
> yes
> ------------------------------
>
> Include in crawl:
>
> .*
> ------------------------------
>
> Include in index:
>
> .*
> ------------------------------
>
> Exclude from crawl:
>
> \.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|
> EPS|wmf|WMF|zip|ZIP|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|
> exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|<script>|</script>|<script
> type="text/javascript">)
> [?*!@=].*
> ------------------------------
>
> Exclude from index:
> ------------------------------
>
> Exclude content from index:
> ------------------------------
>
> No access tokens specified
> ------------------------------
>
> Excluded headers:
>
> last-modified
> ------------------------------
>
> 2.
>
> Field mappings:
>
> Metadata field name
>
> Final field name
>
> No field mapping specified
> ------------------------------
>
> Keep all metadata:
>
> true
> ------------------------------
>
> Lower names:
>
> false
> ------------------------------
>
> Write limit:
> ------------------------------
>
> Ignore Tika exceptions:
>
> true
> ------------------------------
>
> Boilerplate extractor:
>
> -- No extraction selected --
> ------------------------------
>
> 3.
>
> Metadata expressions:
>
> Parameter name
>
> Remove this parameter?
>
> Expression ("${fieldname}" references a field)
>
> facetContentType
>
> false
>
> site.ie
> ------------------------------
>
> Keep all incoming metadata
>
> false
>
> Remove empty metadata values
>
> false
> ------------------------------
>
> 4.
>
> Bottom of Form
>
>
>
>     Marisol Redondo
>
>     Email: [email protected]
>
>     Phone: 35428
>
>
>
> Please note that Revenue cannot guarantee that any personal and sensitive 
> data, sent in plain text via standard email, is fully secure. Customers who 
> choose to use this channel are deemed to have accepted any risk involved. The 
> alternative communication methods offered by Revenue include standard post 
> and the option to use our (encrypted) MyEnquiries service which is available 
> within myAccount and ROS. You can register for either myAccount or ROS on the 
> Revenue website.
>
>
>
> Tabhair faoi deara nach féidir leis na Coimisinéirí Ioncaim ráthaíocht a 
> thabhairt go bhfuil aon sonraí pearsanta agus íogair a gcuirtear isteach i 
> ngnáth-théacs trí r-phost caighdeánach go huile is go hiomlán slán. Meastar 
> go nglacann custaiméirí a úsáideann an cainéal seo le haon riosca bainteach. 
> I measc na modhanna cumarsáide eile atá ag na Coimisinéirí ná post 
> caighdeánach agus an rogha ár seirbhís (criptithe) M'Fhiosruithe a úsáid, tá 
> sí ar fáil laistigh de MoChúrsaí agus ROS. Is féidir leat clárú le haghaidh 
> ceachtar MoChúrsaí nó ROS ar shuíomh gréasáin na gCoimisinéirí.
>
>
>
> On 22 February 2017 at 14:53, Karl Wright <[email protected]> wrote:
>
>> Hi Marisol,
>>
>> The [INFO] log entries indicate that your document has almost no metadata
>> at all.  But the Metadata Adjuster transformation connector is designed to
>> do exactly what you want.
>>
>> Can you view your job, and cut and paste the View Job page into an email,
>> so I can see how your metadata adjuster transformation connection and your
>> solr output connections are configured?  Thanks!
>>
>> Karl
>>
>>
>>
>>
>> On Wed, Feb 22, 2017 at 8:57 AM, Marisol Redondo <
>> [email protected]> wrote:
>>
>>> Hi  Karl and thank you for this quick answer.
>>>
>>> I was reading the documentation of MCF 1.10 but I'm using MCF 2.5, sorry
>>> for the confusion, and I think this version is compatible with solr6.
>>> The pdf doesn't have any metadata or field called facetContentType, this
>>> is because I'd been trying to use the Metadata Adjuster, to add a new
>>> metadata/property to the doc so solr can index by this field when I'm
>>> injecting the doc.
>>> Should I use other transformation or is there any other way of duing it?
>>> I am migrating from nutch to ManifoldCF and in nutch we can do it with
>>> plugins, and I was thinking that the plugins in nutch are the same as the
>>> transformation connectors in MCF
>>>
>>> The completed error in solr is :
>>>
>>> 017-02-21 13:19:32.108 INFO  (qtp1854778591-18) [   x:sites]
>>>> o.a.s.c.PluginBag Going to create a new requestHandler with {type =
>>>> requestHandler,name = /update/extract,class = 
>>>> solr.extraction.ExtractingRequestHandler,args
>>>> = {defaults={lowernames=true,fmap.meta=ignored_,fmap.content=_
>>>> text_,update.chain=add-unknown-fields-to-the-schema,df=_text_}}}
>>>
>>> 2017-02-21 13:19:32.454 INFO  (qtp1854778591-18) [   x:sites]
>>>> o.a.s.u.p.LogUpdateProcessorFactory [sites]  webapp=/solr path=/up
>>>
>>> date/extract params={resource.name=introduction.pdf&literal.id=https://
>>>> ...../introduction.pdf&wt=xml&version=2.2}{} 0 347
>>>
>>> 2017-02-21 13:19:32.455 ERROR (qtp1854778591-18) [   x:sites]
>>>> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: [
>>>
>>> doc=https://....../introduction.pdf] missing required field:
>>>> facetContentType
>>>
>>>         at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBu
>>>> ilder.java:197)
>>>
>>>         at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(Ad
>>>> dUpdateCommand.java:82)
>>>
>>>         at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(D
>>>> irectUpdateHandler2.java:277)
>>>
>>>         at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUp
>>>> dateHandler2.java:211)
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>> On 21 February 2017 at 14:52, Karl Wright <[email protected]> wrote:
>>>
>>>> Hi Marisol,
>>>>
>>>> Can you find the [INFO] entry in the Solr log for this document?  That
>>>> should help clear up any confusion.
>>>>
>>>> Also, for what it is worth, MCF 1.10 is not using a SolrJ that is up to
>>>> date with Solr 6.x.  That could be the source of the problem  Is there any
>>>> reason you are using a 1.x version of MCF?
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Tue, Feb 21, 2017 at 8:42 AM, Marisol Redondo <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi.
>>>>>
>>>>> I'm trying to use metadata adjuster to add one field to the solr index
>>>>> but doesn't inject the field into a solr's field.
>>>>> Maybe I'm misundertaning the use of the metada adjuster, but I have
>>>>> read in the documentation (https://manifoldcf.apache.org
>>>>> /release/release-1.10/en_US/end-user-documentation.html) that I can
>>>>> add metadata to the document that is going to be indexed into solr, but 
>>>>> the
>>>>> solr instance gave me the error "missing required field:
>>>>> facetContentType".
>>>>>
>>>>> ManifoldCF Job pipeline:
>>>>> 1. Repository (type web repository)
>>>>> 2. Transformation (Tikka Metadata Extractor)
>>>>> 3. Transformation (type Metada Adjuster)
>>>>> 4. Output (Solr 6)
>>>>>
>>>>> ManifoldCF Job Metadata Expressions tab:
>>>>>   Parameter name: "facetContentType"
>>>>>   Remove this parameter: false
>>>>>   Expresion: xxxx  (the literal text value I want in facetContentType)
>>>>>
>>>>> Solr schema:
>>>>>   .....
>>>>>   <field name="facetContentType" type="string" indexed="true"
>>>>> stored="true" required="true"/>
>>>>>  ....
>>>>>
>>>>> The error logged in ManifoldCF is:
>>>>>       Error from server at http://solrServer:port/solr/c
>>>>> <http://revnetsolrdev:8983/solr/sites>ore: [doc=https://
>>>>> ....../index.aspx] missing required field: facetContentType.
>>>>>
>>>>> Thanks for your help
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Metadata adjuster

Reply via email to