Here you are: View a Job
Top of Form ------------------------------ Name: revenueToSites ------------------------------ Pipeline: Stage Type Precedent Description Connection name 1. Repository Revenue Website 2. Transformation 1. Tikka Metadata Extractor 3. Transformation 2. Set mimeType and facetContentType customField 4. Output 3. sites solr dev Notifications: Stage Description Connection name No notification connections ------------------------------ Priority: 5 Start method: Don't automatically start ------------------------------ Schedule type: Scan every document once Minimum recrawl interval: Not applicable Maximum recrawl interval: Not applicable Expiration interval: Not applicable Reseed interval: Not applicable ------------------------------ No scheduled run times ------------------------------ Maximum hop count for link type 'link': Unlimited Maximum hop count for link type 'redirect': Unlimited ------------------------------ Hop count mode: Delete unreachable documents ------------------------------ 1. Seeds: https://xxxxxx/index.aspx <https://preview.revenuedomain.ie/en/press-office/index.aspx> ------------------------------ No canonicalization specified - all URLs will be reordered and have all sessions removed ------------------------------ No mappings specified; will accept all URLs ------------------------------ Include only hosts matching seeds? yes ------------------------------ Include in crawl: .* ------------------------------ Include in index: .* ------------------------------ Exclude from crawl: \.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|<script>|</script>|<script type="text/javascript">) [?*!@=].* ------------------------------ Exclude from index: ------------------------------ Exclude content from index: ------------------------------ No access tokens specified ------------------------------ Excluded headers: last-modified ------------------------------ 2. Field mappings: Metadata field name Final field name No field mapping specified ------------------------------ Keep all metadata: true ------------------------------ Lower names: false ------------------------------ Write limit: ------------------------------ Ignore Tika exceptions: true ------------------------------ Boilerplate extractor: -- No extraction selected -- ------------------------------ 3. Metadata expressions: Parameter name Remove this parameter? Expression ("${fieldname}" references a field) facetContentType false site.ie ------------------------------ Keep all incoming metadata false Remove empty metadata values false ------------------------------ 4. Bottom of Form Marisol Redondo Email: [email protected] Phone: 35428 Please note that Revenue cannot guarantee that any personal and sensitive data, sent in plain text via standard email, is fully secure. Customers who choose to use this channel are deemed to have accepted any risk involved. The alternative communication methods offered by Revenue include standard post and the option to use our (encrypted) MyEnquiries service which is available within myAccount and ROS. You can register for either myAccount or ROS on the Revenue website. Tabhair faoi deara nach féidir leis na Coimisinéirí Ioncaim ráthaíocht a thabhairt go bhfuil aon sonraí pearsanta agus íogair a gcuirtear isteach i ngnáth-théacs trí r-phost caighdeánach go huile is go hiomlán slán. Meastar go nglacann custaiméirí a úsáideann an cainéal seo le haon riosca bainteach. I measc na modhanna cumarsáide eile atá ag na Coimisinéirí ná post caighdeánach agus an rogha ár seirbhís (criptithe) M'Fhiosruithe a úsáid, tá sí ar fáil laistigh de MoChúrsaí agus ROS. Is féidir leat clárú le haghaidh ceachtar MoChúrsaí nó ROS ar shuíomh gréasáin na gCoimisinéirí. On 22 February 2017 at 14:53, Karl Wright <[email protected]> wrote: > Hi Marisol, > > The [INFO] log entries indicate that your document has almost no metadata > at all. But the Metadata Adjuster transformation connector is designed to > do exactly what you want. > > Can you view your job, and cut and paste the View Job page into an email, > so I can see how your metadata adjuster transformation connection and your > solr output connections are configured? Thanks! > > Karl > > > > > On Wed, Feb 22, 2017 at 8:57 AM, Marisol Redondo < > [email protected]> wrote: > >> Hi Karl and thank you for this quick answer. >> >> I was reading the documentation of MCF 1.10 but I'm using MCF 2.5, sorry >> for the confusion, and I think this version is compatible with solr6. >> The pdf doesn't have any metadata or field called facetContentType, this >> is because I'd been trying to use the Metadata Adjuster, to add a new >> metadata/property to the doc so solr can index by this field when I'm >> injecting the doc. >> Should I use other transformation or is there any other way of duing it? >> I am migrating from nutch to ManifoldCF and in nutch we can do it with >> plugins, and I was thinking that the plugins in nutch are the same as the >> transformation connectors in MCF >> >> The completed error in solr is : >> >> 017-02-21 13:19:32.108 INFO (qtp1854778591-18) [ x:sites] >>> o.a.s.c.PluginBag Going to create a new requestHandler with {type = >>> requestHandler,name = /update/extract,class = >>> solr.extraction.ExtractingRequestHandler,args >>> = {defaults={lowernames=true,fmap.meta=ignored_,fmap.content=_ >>> text_,update.chain=add-unknown-fields-to-the-schema,df=_text_}}} >> >> 2017-02-21 13:19:32.454 INFO (qtp1854778591-18) [ x:sites] >>> o.a.s.u.p.LogUpdateProcessorFactory [sites] webapp=/solr path=/up >> >> date/extract params={resource.name=introduction.pdf&literal.id=https://.. >>> .../introduction.pdf&wt=xml&version=2.2}{} 0 347 >> >> 2017-02-21 13:19:32.455 ERROR (qtp1854778591-18) [ x:sites] >>> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: [ >> >> doc=https://....../introduction.pdf] missing required field: >>> facetContentType >> >> at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBu >>> ilder.java:197) >> >> at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(Ad >>> dUpdateCommand.java:82) >> >> at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(D >>> irectUpdateHandler2.java:277) >> >> at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUp >>> dateHandler2.java:211) >> >> >> >> Thanks >> >> >> On 21 February 2017 at 14:52, Karl Wright <[email protected]> wrote: >> >>> Hi Marisol, >>> >>> Can you find the [INFO] entry in the Solr log for this document? That >>> should help clear up any confusion. >>> >>> Also, for what it is worth, MCF 1.10 is not using a SolrJ that is up to >>> date with Solr 6.x. That could be the source of the problem Is there any >>> reason you are using a 1.x version of MCF? >>> >>> Karl >>> >>> >>> On Tue, Feb 21, 2017 at 8:42 AM, Marisol Redondo < >>> [email protected]> wrote: >>> >>>> Hi. >>>> >>>> I'm trying to use metadata adjuster to add one field to the solr index >>>> but doesn't inject the field into a solr's field. >>>> Maybe I'm misundertaning the use of the metada adjuster, but I have >>>> read in the documentation (https://manifoldcf.apache.org >>>> /release/release-1.10/en_US/end-user-documentation.html) that I can >>>> add metadata to the document that is going to be indexed into solr, but the >>>> solr instance gave me the error "missing required field: >>>> facetContentType". >>>> >>>> ManifoldCF Job pipeline: >>>> 1. Repository (type web repository) >>>> 2. Transformation (Tikka Metadata Extractor) >>>> 3. Transformation (type Metada Adjuster) >>>> 4. Output (Solr 6) >>>> >>>> ManifoldCF Job Metadata Expressions tab: >>>> Parameter name: "facetContentType" >>>> Remove this parameter: false >>>> Expresion: xxxx (the literal text value I want in facetContentType) >>>> >>>> Solr schema: >>>> ..... >>>> <field name="facetContentType" type="string" indexed="true" >>>> stored="true" required="true"/> >>>> .... >>>> >>>> The error logged in ManifoldCF is: >>>> Error from server at http://solrServer:port/solr/c >>>> <http://revnetsolrdev:8983/solr/sites>ore: [doc=https:// >>>> ....../index.aspx] missing required field: facetContentType. >>>> >>>> Thanks for your help >>>> >>> >>> >> >
