Re: Metadata adjuster

Marisol Redondo Wed, 22 Feb 2017 07:45:09 -0800

I was trying with "Keep all incoming metadata" set to false and too true,
but I'll take your advice and set to true.


I don't know why you can't see it, but it's the 4 stage

On 22 February 2017 at 15:26, Karl Wright <[email protected]> wrote:

> Hi Marisol,
>
> Some observations.
> (1) It makes no sense to have "Keep all incoming metadata" set to false,
> since that will filter out everything that your tika extractor extracts.  I
> doubt that is what you have intended.
> (2) I can't see the Solr output configuration -- looks like it got
> truncated?
>
> Thanks,
> Karl
>
>
> On Wed, Feb 22, 2017 at 10:12 AM, Marisol Redondo <
> [email protected]> wrote:
>
>> Here you are:
>>
>> View a Job
>>
>> Top of Form
>>
>>
>> ------------------------------
>>
>> Name:
>>
>> revenueToSites
>> ------------------------------
>>
>> Pipeline:
>>
>> Stage
>>
>> Type
>>
>> Precedent
>>
>> Description
>>
>> Connection name
>>
>> 1.
>>
>> Repository
>>
>> Revenue Website
>>
>> 2.
>>
>> Transformation
>>
>> 1.
>>
>> Tikka Metadata Extractor
>>
>> 3.
>>
>> Transformation
>>
>> 2.
>>
>> Set mimeType and facetContentType
>>
>> customField
>>
>> 4.
>>
>> Output
>>
>> 3.
>>
>> sites solr dev
>>
>> Notifications:
>>
>> Stage
>>
>> Description
>>
>> Connection name
>>
>> No notification connections
>> ------------------------------
>>
>> Priority:
>>
>> 5
>>
>> Start method:
>>
>> Don't automatically start
>> ------------------------------
>>
>> Schedule type:
>>
>> Scan every document once
>>
>> Minimum recrawl interval:
>>
>> Not applicable
>>
>> Maximum recrawl interval:
>>
>> Not applicable
>>
>> Expiration interval:
>>
>> Not applicable
>>
>> Reseed interval:
>>
>> Not applicable
>> ------------------------------
>>
>> No scheduled run times
>> ------------------------------
>>
>> Maximum hop count for link type 'link':
>>
>> Unlimited
>>
>> Maximum hop count for link type 'redirect':
>>
>> Unlimited
>> ------------------------------
>>
>> Hop count mode:
>>
>> Delete unreachable documents
>> ------------------------------
>>
>> 1.
>>
>> Seeds:
>>
>> https://xxxxxx/index.aspx
>> <https://preview.revenuedomain.ie/en/press-office/index.aspx>
>> ------------------------------
>>
>> No canonicalization specified - all URLs will be reordered and have all
>> sessions removed
>> ------------------------------
>>
>> No mappings specified; will accept all URLs
>> ------------------------------
>>
>> Include only hosts matching seeds?
>>
>> yes
>> ------------------------------
>>
>> Include in crawl:
>>
>> .*
>> ------------------------------
>>
>> Include in index:
>>
>> .*
>> ------------------------------
>>
>> Exclude from crawl:
>>
>> \.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|
>> wmf|WMF|zip|ZIP|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|
>> EXE|jpeg|JPEG|bmp|BMP|js|JS|<script>|</script>|<script
>> type="text/javascript">)
>> [?*!@=].*
>> ------------------------------
>>
>> Exclude from index:
>> ------------------------------
>>
>> Exclude content from index:
>> ------------------------------
>>
>> No access tokens specified
>> ------------------------------
>>
>> Excluded headers:
>>
>> last-modified
>> ------------------------------
>>
>> 2.
>>
>> Field mappings:
>>
>> Metadata field name
>>
>> Final field name
>>
>> No field mapping specified
>> ------------------------------
>>
>> Keep all metadata:
>>
>> true
>> ------------------------------
>>
>> Lower names:
>>
>> false
>> ------------------------------
>>
>> Write limit:
>> ------------------------------
>>
>> Ignore Tika exceptions:
>>
>> true
>> ------------------------------
>>
>> Boilerplate extractor:
>>
>> -- No extraction selected --
>> ------------------------------
>>
>> 3.
>>
>> Metadata expressions:
>>
>> Parameter name
>>
>> Remove this parameter?
>>
>> Expression ("${fieldname}" references a field)
>>
>> facetContentType
>>
>> false
>>
>> site.ie
>> ------------------------------
>>
>> Keep all incoming metadata
>>
>> false
>>
>> Remove empty metadata values
>>
>> false
>> ------------------------------
>>
>> 4.
>>
>> Bottom of Form
>>
>>
>>
>>     Marisol Redondo
>>
>>     Email: [email protected]
>>
>>     Phone: 35428
>>
>>
>>
>> Please note that Revenue cannot guarantee that any personal and sensitive 
>> data, sent in plain text via standard email, is fully secure. Customers who 
>> choose to use this channel are deemed to have accepted any risk involved. 
>> The alternative communication methods offered by Revenue include standard 
>> post and the option to use our (encrypted) MyEnquiries service which is 
>> available within myAccount and ROS. You can register for either myAccount or 
>> ROS on the Revenue website.
>>
>>
>>
>> Tabhair faoi deara nach féidir leis na Coimisinéirí Ioncaim ráthaíocht a 
>> thabhairt go bhfuil aon sonraí pearsanta agus íogair a gcuirtear isteach i 
>> ngnáth-théacs trí r-phost caighdeánach go huile is go hiomlán slán. Meastar 
>> go nglacann custaiméirí a úsáideann an cainéal seo le haon riosca bainteach. 
>> I measc na modhanna cumarsáide eile atá ag na Coimisinéirí ná post 
>> caighdeánach agus an rogha ár seirbhís (criptithe) M'Fhiosruithe a úsáid, tá 
>> sí ar fáil laistigh de MoChúrsaí agus ROS. Is féidir leat clárú le haghaidh 
>> ceachtar MoChúrsaí nó ROS ar shuíomh gréasáin na gCoimisinéirí.
>>
>>
>>
>> On 22 February 2017 at 14:53, Karl Wright <[email protected]> wrote:
>>
>>> Hi Marisol,
>>>
>>> The [INFO] log entries indicate that your document has almost no
>>> metadata at all.  But the Metadata Adjuster transformation connector is
>>> designed to do exactly what you want.
>>>
>>> Can you view your job, and cut and paste the View Job page into an
>>> email, so I can see how your metadata adjuster transformation connection
>>> and your solr output connections are configured?  Thanks!
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>> On Wed, Feb 22, 2017 at 8:57 AM, Marisol Redondo <
>>> [email protected]> wrote:
>>>
>>>> Hi  Karl and thank you for this quick answer.
>>>>
>>>> I was reading the documentation of MCF 1.10 but I'm using MCF 2.5,
>>>> sorry for the confusion, and I think this version is compatible with solr6.
>>>> The pdf doesn't have any metadata or field called facetContentType,
>>>> this is because I'd been trying to use the Metadata Adjuster, to add a new
>>>> metadata/property to the doc so solr can index by this field when I'm
>>>> injecting the doc.
>>>> Should I use other transformation or is there any other way of duing it?
>>>> I am migrating from nutch to ManifoldCF and in nutch we can do it with
>>>> plugins, and I was thinking that the plugins in nutch are the same as the
>>>> transformation connectors in MCF
>>>>
>>>> The completed error in solr is :
>>>>
>>>> 017-02-21 13:19:32.108 INFO  (qtp1854778591-18) [   x:sites]
>>>>> o.a.s.c.PluginBag Going to create a new requestHandler with {type =
>>>>> requestHandler,name = /update/extract,class = 
>>>>> solr.extraction.ExtractingRequestHandler,args
>>>>> = {defaults={lowernames=true,fmap.meta=ignored_,fmap.content=_
>>>>> text_,update.chain=add-unknown-fields-to-the-schema,df=_text_}}}
>>>>
>>>> 2017-02-21 13:19:32.454 INFO  (qtp1854778591-18) [   x:sites]
>>>>> o.a.s.u.p.LogUpdateProcessorFactory [sites]  webapp=/solr path=/up
>>>>
>>>> date/extract params={resource.name=introduction.pdf&literal.id=https://
>>>>> ...../introduction.pdf&wt=xml&version=2.2}{} 0 347
>>>>
>>>> 2017-02-21 13:19:32.455 ERROR (qtp1854778591-18) [   x:sites]
>>>>> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: [
>>>>
>>>> doc=https://....../introduction.pdf] missing required field:
>>>>> facetContentType
>>>>
>>>>         at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBu
>>>>> ilder.java:197)
>>>>
>>>>         at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(Ad
>>>>> dUpdateCommand.java:82)
>>>>
>>>>         at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(D
>>>>> irectUpdateHandler2.java:277)
>>>>
>>>>         at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUp
>>>>> dateHandler2.java:211)
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On 21 February 2017 at 14:52, Karl Wright <[email protected]> wrote:
>>>>
>>>>> Hi Marisol,
>>>>>
>>>>> Can you find the [INFO] entry in the Solr log for this document?  That
>>>>> should help clear up any confusion.
>>>>>
>>>>> Also, for what it is worth, MCF 1.10 is not using a SolrJ that is up
>>>>> to date with Solr 6.x.  That could be the source of the problem  Is there
>>>>> any reason you are using a 1.x version of MCF?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Feb 21, 2017 at 8:42 AM, Marisol Redondo <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi.
>>>>>>
>>>>>> I'm trying to use metadata adjuster to add one field to the solr
>>>>>> index but doesn't inject the field into a solr's field.
>>>>>> Maybe I'm misundertaning the use of the metada adjuster, but I have
>>>>>> read in the documentation (https://manifoldcf.apache.org
>>>>>> /release/release-1.10/en_US/end-user-documentation.html) that I can
>>>>>> add metadata to the document that is going to be indexed into solr, but 
>>>>>> the
>>>>>> solr instance gave me the error "missing required field:
>>>>>> facetContentType".
>>>>>>
>>>>>> ManifoldCF Job pipeline:
>>>>>> 1. Repository (type web repository)
>>>>>> 2. Transformation (Tikka Metadata Extractor)
>>>>>> 3. Transformation (type Metada Adjuster)
>>>>>> 4. Output (Solr 6)
>>>>>>
>>>>>> ManifoldCF Job Metadata Expressions tab:
>>>>>>   Parameter name: "facetContentType"
>>>>>>   Remove this parameter: false
>>>>>>   Expresion: xxxx  (the literal text value I want in facetContentType)
>>>>>>
>>>>>> Solr schema:
>>>>>>   .....
>>>>>>   <field name="facetContentType" type="string" indexed="true"
>>>>>> stored="true" required="true"/>
>>>>>>  ....
>>>>>>
>>>>>> The error logged in ManifoldCF is:
>>>>>>       Error from server at http://solrServer:port/solr/c
>>>>>> <http://revnetsolrdev:8983/solr/sites>ore: [doc=https://
>>>>>> ....../index.aspx] missing required field: facetContentType.
>>>>>>
>>>>>> Thanks for your help
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Metadata adjuster

Reply via email to