Hi Marisol, Some observations. (1) It makes no sense to have "Keep all incoming metadata" set to false, since that will filter out everything that your tika extractor extracts. I doubt that is what you have intended. (2) I can't see the Solr output configuration -- looks like it got truncated?
Thanks, Karl On Wed, Feb 22, 2017 at 10:12 AM, Marisol Redondo < [email protected]> wrote: > Here you are: > > View a Job > > Top of Form > > > ------------------------------ > > Name: > > revenueToSites > ------------------------------ > > Pipeline: > > Stage > > Type > > Precedent > > Description > > Connection name > > 1. > > Repository > > Revenue Website > > 2. > > Transformation > > 1. > > Tikka Metadata Extractor > > 3. > > Transformation > > 2. > > Set mimeType and facetContentType > > customField > > 4. > > Output > > 3. > > sites solr dev > > Notifications: > > Stage > > Description > > Connection name > > No notification connections > ------------------------------ > > Priority: > > 5 > > Start method: > > Don't automatically start > ------------------------------ > > Schedule type: > > Scan every document once > > Minimum recrawl interval: > > Not applicable > > Maximum recrawl interval: > > Not applicable > > Expiration interval: > > Not applicable > > Reseed interval: > > Not applicable > ------------------------------ > > No scheduled run times > ------------------------------ > > Maximum hop count for link type 'link': > > Unlimited > > Maximum hop count for link type 'redirect': > > Unlimited > ------------------------------ > > Hop count mode: > > Delete unreachable documents > ------------------------------ > > 1. > > Seeds: > > https://xxxxxx/index.aspx > <https://preview.revenuedomain.ie/en/press-office/index.aspx> > ------------------------------ > > No canonicalization specified - all URLs will be reordered and have all > sessions removed > ------------------------------ > > No mappings specified; will accept all URLs > ------------------------------ > > Include only hosts matching seeds? > > yes > ------------------------------ > > Include in crawl: > > .* > ------------------------------ > > Include in index: > > .* > ------------------------------ > > Exclude from crawl: > > \.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps| > EPS|wmf|WMF|zip|ZIP|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV| > exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|<script>|</script>|<script > type="text/javascript">) > [?*!@=].* > ------------------------------ > > Exclude from index: > ------------------------------ > > Exclude content from index: > ------------------------------ > > No access tokens specified > ------------------------------ > > Excluded headers: > > last-modified > ------------------------------ > > 2. > > Field mappings: > > Metadata field name > > Final field name > > No field mapping specified > ------------------------------ > > Keep all metadata: > > true > ------------------------------ > > Lower names: > > false > ------------------------------ > > Write limit: > ------------------------------ > > Ignore Tika exceptions: > > true > ------------------------------ > > Boilerplate extractor: > > -- No extraction selected -- > ------------------------------ > > 3. > > Metadata expressions: > > Parameter name > > Remove this parameter? > > Expression ("${fieldname}" references a field) > > facetContentType > > false > > site.ie > ------------------------------ > > Keep all incoming metadata > > false > > Remove empty metadata values > > false > ------------------------------ > > 4. > > Bottom of Form > > > > Marisol Redondo > > Email: [email protected] > > Phone: 35428 > > > > Please note that Revenue cannot guarantee that any personal and sensitive > data, sent in plain text via standard email, is fully secure. Customers who > choose to use this channel are deemed to have accepted any risk involved. The > alternative communication methods offered by Revenue include standard post > and the option to use our (encrypted) MyEnquiries service which is available > within myAccount and ROS. You can register for either myAccount or ROS on the > Revenue website. > > > > Tabhair faoi deara nach féidir leis na Coimisinéirí Ioncaim ráthaíocht a > thabhairt go bhfuil aon sonraí pearsanta agus íogair a gcuirtear isteach i > ngnáth-théacs trí r-phost caighdeánach go huile is go hiomlán slán. Meastar > go nglacann custaiméirí a úsáideann an cainéal seo le haon riosca bainteach. > I measc na modhanna cumarsáide eile atá ag na Coimisinéirí ná post > caighdeánach agus an rogha ár seirbhís (criptithe) M'Fhiosruithe a úsáid, tá > sí ar fáil laistigh de MoChúrsaí agus ROS. Is féidir leat clárú le haghaidh > ceachtar MoChúrsaí nó ROS ar shuíomh gréasáin na gCoimisinéirí. > > > > On 22 February 2017 at 14:53, Karl Wright <[email protected]> wrote: > >> Hi Marisol, >> >> The [INFO] log entries indicate that your document has almost no metadata >> at all. But the Metadata Adjuster transformation connector is designed to >> do exactly what you want. >> >> Can you view your job, and cut and paste the View Job page into an email, >> so I can see how your metadata adjuster transformation connection and your >> solr output connections are configured? Thanks! >> >> Karl >> >> >> >> >> On Wed, Feb 22, 2017 at 8:57 AM, Marisol Redondo < >> [email protected]> wrote: >> >>> Hi Karl and thank you for this quick answer. >>> >>> I was reading the documentation of MCF 1.10 but I'm using MCF 2.5, sorry >>> for the confusion, and I think this version is compatible with solr6. >>> The pdf doesn't have any metadata or field called facetContentType, this >>> is because I'd been trying to use the Metadata Adjuster, to add a new >>> metadata/property to the doc so solr can index by this field when I'm >>> injecting the doc. >>> Should I use other transformation or is there any other way of duing it? >>> I am migrating from nutch to ManifoldCF and in nutch we can do it with >>> plugins, and I was thinking that the plugins in nutch are the same as the >>> transformation connectors in MCF >>> >>> The completed error in solr is : >>> >>> 017-02-21 13:19:32.108 INFO (qtp1854778591-18) [ x:sites] >>>> o.a.s.c.PluginBag Going to create a new requestHandler with {type = >>>> requestHandler,name = /update/extract,class = >>>> solr.extraction.ExtractingRequestHandler,args >>>> = {defaults={lowernames=true,fmap.meta=ignored_,fmap.content=_ >>>> text_,update.chain=add-unknown-fields-to-the-schema,df=_text_}}} >>> >>> 2017-02-21 13:19:32.454 INFO (qtp1854778591-18) [ x:sites] >>>> o.a.s.u.p.LogUpdateProcessorFactory [sites] webapp=/solr path=/up >>> >>> date/extract params={resource.name=introduction.pdf&literal.id=https:// >>>> ...../introduction.pdf&wt=xml&version=2.2}{} 0 347 >>> >>> 2017-02-21 13:19:32.455 ERROR (qtp1854778591-18) [ x:sites] >>>> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: [ >>> >>> doc=https://....../introduction.pdf] missing required field: >>>> facetContentType >>> >>> at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBu >>>> ilder.java:197) >>> >>> at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(Ad >>>> dUpdateCommand.java:82) >>> >>> at org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(D >>>> irectUpdateHandler2.java:277) >>> >>> at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUp >>>> dateHandler2.java:211) >>> >>> >>> >>> Thanks >>> >>> >>> On 21 February 2017 at 14:52, Karl Wright <[email protected]> wrote: >>> >>>> Hi Marisol, >>>> >>>> Can you find the [INFO] entry in the Solr log for this document? That >>>> should help clear up any confusion. >>>> >>>> Also, for what it is worth, MCF 1.10 is not using a SolrJ that is up to >>>> date with Solr 6.x. That could be the source of the problem Is there any >>>> reason you are using a 1.x version of MCF? >>>> >>>> Karl >>>> >>>> >>>> On Tue, Feb 21, 2017 at 8:42 AM, Marisol Redondo < >>>> [email protected]> wrote: >>>> >>>>> Hi. >>>>> >>>>> I'm trying to use metadata adjuster to add one field to the solr index >>>>> but doesn't inject the field into a solr's field. >>>>> Maybe I'm misundertaning the use of the metada adjuster, but I have >>>>> read in the documentation (https://manifoldcf.apache.org >>>>> /release/release-1.10/en_US/end-user-documentation.html) that I can >>>>> add metadata to the document that is going to be indexed into solr, but >>>>> the >>>>> solr instance gave me the error "missing required field: >>>>> facetContentType". >>>>> >>>>> ManifoldCF Job pipeline: >>>>> 1. Repository (type web repository) >>>>> 2. Transformation (Tikka Metadata Extractor) >>>>> 3. Transformation (type Metada Adjuster) >>>>> 4. Output (Solr 6) >>>>> >>>>> ManifoldCF Job Metadata Expressions tab: >>>>> Parameter name: "facetContentType" >>>>> Remove this parameter: false >>>>> Expresion: xxxx (the literal text value I want in facetContentType) >>>>> >>>>> Solr schema: >>>>> ..... >>>>> <field name="facetContentType" type="string" indexed="true" >>>>> stored="true" required="true"/> >>>>> .... >>>>> >>>>> The error logged in ManifoldCF is: >>>>> Error from server at http://solrServer:port/solr/c >>>>> <http://revnetsolrdev:8983/solr/sites>ore: [doc=https:// >>>>> ....../index.aspx] missing required field: facetContentType. >>>>> >>>>> Thanks for your help >>>>> >>>> >>>> >>> >> >
