I was trying with "Keep all incoming metadata" set to false and too true, but I'll take your advice and set to true.
I don't know why you can't see it, but it's the 4 stage On 22 February 2017 at 15:26, Karl Wright <[email protected]> wrote: > Hi Marisol, > > Some observations. > (1) It makes no sense to have "Keep all incoming metadata" set to false, > since that will filter out everything that your tika extractor extracts. I > doubt that is what you have intended. > (2) I can't see the Solr output configuration -- looks like it got > truncated? > > Thanks, > Karl > > > On Wed, Feb 22, 2017 at 10:12 AM, Marisol Redondo < > [email protected]> wrote: > >> Here you are: >> >> View a Job >> >> Top of Form >> >> >> ------------------------------ >> >> Name: >> >> revenueToSites >> ------------------------------ >> >> Pipeline: >> >> Stage >> >> Type >> >> Precedent >> >> Description >> >> Connection name >> >> 1. >> >> Repository >> >> Revenue Website >> >> 2. >> >> Transformation >> >> 1. >> >> Tikka Metadata Extractor >> >> 3. >> >> Transformation >> >> 2. >> >> Set mimeType and facetContentType >> >> customField >> >> 4. >> >> Output >> >> 3. >> >> sites solr dev >> >> Notifications: >> >> Stage >> >> Description >> >> Connection name >> >> No notification connections >> ------------------------------ >> >> Priority: >> >> 5 >> >> Start method: >> >> Don't automatically start >> ------------------------------ >> >> Schedule type: >> >> Scan every document once >> >> Minimum recrawl interval: >> >> Not applicable >> >> Maximum recrawl interval: >> >> Not applicable >> >> Expiration interval: >> >> Not applicable >> >> Reseed interval: >> >> Not applicable >> ------------------------------ >> >> No scheduled run times >> ------------------------------ >> >> Maximum hop count for link type 'link': >> >> Unlimited >> >> Maximum hop count for link type 'redirect': >> >> Unlimited >> ------------------------------ >> >> Hop count mode: >> >> Delete unreachable documents >> ------------------------------ >> >> 1. >> >> Seeds: >> >> https://xxxxxx/index.aspx >> <https://preview.revenuedomain.ie/en/press-office/index.aspx> >> ------------------------------ >> >> No canonicalization specified - all URLs will be reordered and have all >> sessions removed >> ------------------------------ >> >> No mappings specified; will accept all URLs >> ------------------------------ >> >> Include only hosts matching seeds? >> >> yes >> ------------------------------ >> >> Include in crawl: >> >> .* >> ------------------------------ >> >> Include in index: >> >> .* >> ------------------------------ >> >> Exclude from crawl: >> >> \.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS| >> wmf|WMF|zip|ZIP|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe| >> EXE|jpeg|JPEG|bmp|BMP|js|JS|<script>|</script>|<script >> type="text/javascript">) >> [?*!@=].* >> ------------------------------ >> >> Exclude from index: >> ------------------------------ >> >> Exclude content from index: >> ------------------------------ >> >> No access tokens specified >> ------------------------------ >> >> Excluded headers: >> >> last-modified >> ------------------------------ >> >> 2. >> >> Field mappings: >> >> Metadata field name >> >> Final field name >> >> No field mapping specified >> ------------------------------ >> >> Keep all metadata: >> >> true >> ------------------------------ >> >> Lower names: >> >> false >> ------------------------------ >> >> Write limit: >> ------------------------------ >> >> Ignore Tika exceptions: >> >> true >> ------------------------------ >> >> Boilerplate extractor: >> >> -- No extraction selected -- >> ------------------------------ >> >> 3. >> >> Metadata expressions: >> >> Parameter name >> >> Remove this parameter? >> >> Expression ("${fieldname}" references a field) >> >> facetContentType >> >> false >> >> site.ie >> ------------------------------ >> >> Keep all incoming metadata >> >> false >> >> Remove empty metadata values >> >> false >> ------------------------------ >> >> 4. >> >> Bottom of Form >> >> >> >> Marisol Redondo >> >> Email: [email protected] >> >> Phone: 35428 >> >> >> >> Please note that Revenue cannot guarantee that any personal and sensitive >> data, sent in plain text via standard email, is fully secure. Customers who >> choose to use this channel are deemed to have accepted any risk involved. >> The alternative communication methods offered by Revenue include standard >> post and the option to use our (encrypted) MyEnquiries service which is >> available within myAccount and ROS. You can register for either myAccount or >> ROS on the Revenue website. >> >> >> >> Tabhair faoi deara nach féidir leis na Coimisinéirí Ioncaim ráthaíocht a >> thabhairt go bhfuil aon sonraí pearsanta agus íogair a gcuirtear isteach i >> ngnáth-théacs trí r-phost caighdeánach go huile is go hiomlán slán. Meastar >> go nglacann custaiméirí a úsáideann an cainéal seo le haon riosca bainteach. >> I measc na modhanna cumarsáide eile atá ag na Coimisinéirí ná post >> caighdeánach agus an rogha ár seirbhís (criptithe) M'Fhiosruithe a úsáid, tá >> sí ar fáil laistigh de MoChúrsaí agus ROS. Is féidir leat clárú le haghaidh >> ceachtar MoChúrsaí nó ROS ar shuíomh gréasáin na gCoimisinéirí. >> >> >> >> On 22 February 2017 at 14:53, Karl Wright <[email protected]> wrote: >> >>> Hi Marisol, >>> >>> The [INFO] log entries indicate that your document has almost no >>> metadata at all. But the Metadata Adjuster transformation connector is >>> designed to do exactly what you want. >>> >>> Can you view your job, and cut and paste the View Job page into an >>> email, so I can see how your metadata adjuster transformation connection >>> and your solr output connections are configured? Thanks! >>> >>> Karl >>> >>> >>> >>> >>> On Wed, Feb 22, 2017 at 8:57 AM, Marisol Redondo < >>> [email protected]> wrote: >>> >>>> Hi Karl and thank you for this quick answer. >>>> >>>> I was reading the documentation of MCF 1.10 but I'm using MCF 2.5, >>>> sorry for the confusion, and I think this version is compatible with solr6. >>>> The pdf doesn't have any metadata or field called facetContentType, >>>> this is because I'd been trying to use the Metadata Adjuster, to add a new >>>> metadata/property to the doc so solr can index by this field when I'm >>>> injecting the doc. >>>> Should I use other transformation or is there any other way of duing it? >>>> I am migrating from nutch to ManifoldCF and in nutch we can do it with >>>> plugins, and I was thinking that the plugins in nutch are the same as the >>>> transformation connectors in MCF >>>> >>>> The completed error in solr is : >>>> >>>> 017-02-21 13:19:32.108 INFO (qtp1854778591-18) [ x:sites] >>>>> o.a.s.c.PluginBag Going to create a new requestHandler with {type = >>>>> requestHandler,name = /update/extract,class = >>>>> solr.extraction.ExtractingRequestHandler,args >>>>> = {defaults={lowernames=true,fmap.meta=ignored_,fmap.content=_ >>>>> text_,update.chain=add-unknown-fields-to-the-schema,df=_text_}}} >>>> >>>> 2017-02-21 13:19:32.454 INFO (qtp1854778591-18) [ x:sites] >>>>> o.a.s.u.p.LogUpdateProcessorFactory [sites] webapp=/solr path=/up >>>> >>>> date/extract params={resource.name=introduction.pdf&literal.id=https:// >>>>> ...../introduction.pdf&wt=xml&version=2.2}{} 0 347 >>>> >>>> 2017-02-21 13:19:32.455 ERROR (qtp1854778591-18) [ x:sites] >>>>> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: [ >>>> >>>> doc=https://....../introduction.pdf] missing required field: >>>>> facetContentType >>>> >>>> at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBu >>>>> ilder.java:197) >>>> >>>> at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(Ad >>>>> dUpdateCommand.java:82) >>>> >>>> at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(D >>>>> irectUpdateHandler2.java:277) >>>> >>>> at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUp >>>>> dateHandler2.java:211) >>>> >>>> >>>> >>>> Thanks >>>> >>>> >>>> On 21 February 2017 at 14:52, Karl Wright <[email protected]> wrote: >>>> >>>>> Hi Marisol, >>>>> >>>>> Can you find the [INFO] entry in the Solr log for this document? That >>>>> should help clear up any confusion. >>>>> >>>>> Also, for what it is worth, MCF 1.10 is not using a SolrJ that is up >>>>> to date with Solr 6.x. That could be the source of the problem Is there >>>>> any reason you are using a 1.x version of MCF? >>>>> >>>>> Karl >>>>> >>>>> >>>>> On Tue, Feb 21, 2017 at 8:42 AM, Marisol Redondo < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi. >>>>>> >>>>>> I'm trying to use metadata adjuster to add one field to the solr >>>>>> index but doesn't inject the field into a solr's field. >>>>>> Maybe I'm misundertaning the use of the metada adjuster, but I have >>>>>> read in the documentation (https://manifoldcf.apache.org >>>>>> /release/release-1.10/en_US/end-user-documentation.html) that I can >>>>>> add metadata to the document that is going to be indexed into solr, but >>>>>> the >>>>>> solr instance gave me the error "missing required field: >>>>>> facetContentType". >>>>>> >>>>>> ManifoldCF Job pipeline: >>>>>> 1. Repository (type web repository) >>>>>> 2. Transformation (Tikka Metadata Extractor) >>>>>> 3. Transformation (type Metada Adjuster) >>>>>> 4. Output (Solr 6) >>>>>> >>>>>> ManifoldCF Job Metadata Expressions tab: >>>>>> Parameter name: "facetContentType" >>>>>> Remove this parameter: false >>>>>> Expresion: xxxx (the literal text value I want in facetContentType) >>>>>> >>>>>> Solr schema: >>>>>> ..... >>>>>> <field name="facetContentType" type="string" indexed="true" >>>>>> stored="true" required="true"/> >>>>>> .... >>>>>> >>>>>> The error logged in ManifoldCF is: >>>>>> Error from server at http://solrServer:port/solr/c >>>>>> <http://revnetsolrdev:8983/solr/sites>ore: [doc=https:// >>>>>> ....../index.aspx] missing required field: facetContentType. >>>>>> >>>>>> Thanks for your help >>>>>> >>>>> >>>>> >>>> >>> >> >
