Ah, never mind -- I need you instead to view the Solr connection, and paste that in an email. Basically, I want to be sure you are not inadvertantly disabling metadata to Solr.
Thanks, Karl On Wed, Feb 22, 2017 at 10:39 AM, Karl Wright <[email protected]> wrote: > This is how the email appears here: > > >>>>>> > > 4. > > Bottom of Form > > > > Marisol Redondo > > Email: [email protected] > > Phone: 35428 > > > > Please note that Revenue cannot guarantee that any personal and sensitive > data, sent in plain text via standard email, is fully secure. Customers who > choose to use this channel are deemed to have accepted any risk involved. The > alternative communication methods offered by Revenue include standard post > and the option to use our (encrypted) MyEnquiries service which is available > within myAccount and ROS. You can register for either myAccount or ROS on the > Revenue website. > > > > Tabhair faoi deara nach féidir leis na Coimisinéirí Ioncaim ráthaíocht a > thabhairt go bhfuil aon sonraí pearsanta agus íogair a gcuirtear isteach i > ngnáth-théacs trí r-phost caighdeánach go huile is go hiomlán slán. Meastar > go nglacann custaiméirí a úsáideann an cainéal seo le haon riosca bainteach. > I measc na modhanna cumarsáide eile atá ag na Coimisinéirí ná post > caighdeánach agus an rogha ár seirbhís (criptithe) M'Fhiosruithe a úsáid, tá > sí ar fáil laistigh de MoChúrsaí agus ROS. Is féidir leat clárú le haghaidh > ceachtar MoChúrsaí nó ROS ar shuíomh gréasáin na gCoimisinéirí. > > <<<<<< > > In other words I cannot see anything from the 4. stage. > > > Thanks, > > Karl > > > On Wed, Feb 22, 2017 at 10:37 AM, Marisol Redondo < > [email protected]> wrote: > >> I was trying with "Keep all incoming metadata" set to false and too true, >> but I'll take your advice and set to true. >> >> I don't know why you can't see it, but it's the 4 stage >> >> On 22 February 2017 at 15:26, Karl Wright <[email protected]> wrote: >> >>> Hi Marisol, >>> >>> Some observations. >>> (1) It makes no sense to have "Keep all incoming metadata" set to false, >>> since that will filter out everything that your tika extractor extracts. I >>> doubt that is what you have intended. >>> (2) I can't see the Solr output configuration -- looks like it got >>> truncated? >>> >>> Thanks, >>> Karl >>> >>> >>> On Wed, Feb 22, 2017 at 10:12 AM, Marisol Redondo < >>> [email protected]> wrote: >>> >>>> Here you are: >>>> >>>> View a Job >>>> >>>> Top of Form >>>> >>>> >>>> ------------------------------ >>>> >>>> Name: >>>> >>>> revenueToSites >>>> ------------------------------ >>>> >>>> Pipeline: >>>> >>>> Stage >>>> >>>> Type >>>> >>>> Precedent >>>> >>>> Description >>>> >>>> Connection name >>>> >>>> 1. >>>> >>>> Repository >>>> >>>> Revenue Website >>>> >>>> 2. >>>> >>>> Transformation >>>> >>>> 1. >>>> >>>> Tikka Metadata Extractor >>>> >>>> 3. >>>> >>>> Transformation >>>> >>>> 2. >>>> >>>> Set mimeType and facetContentType >>>> >>>> customField >>>> >>>> 4. >>>> >>>> Output >>>> >>>> 3. >>>> >>>> sites solr dev >>>> >>>> Notifications: >>>> >>>> Stage >>>> >>>> Description >>>> >>>> Connection name >>>> >>>> No notification connections >>>> ------------------------------ >>>> >>>> Priority: >>>> >>>> 5 >>>> >>>> Start method: >>>> >>>> Don't automatically start >>>> ------------------------------ >>>> >>>> Schedule type: >>>> >>>> Scan every document once >>>> >>>> Minimum recrawl interval: >>>> >>>> Not applicable >>>> >>>> Maximum recrawl interval: >>>> >>>> Not applicable >>>> >>>> Expiration interval: >>>> >>>> Not applicable >>>> >>>> Reseed interval: >>>> >>>> Not applicable >>>> ------------------------------ >>>> >>>> No scheduled run times >>>> ------------------------------ >>>> >>>> Maximum hop count for link type 'link': >>>> >>>> Unlimited >>>> >>>> Maximum hop count for link type 'redirect': >>>> >>>> Unlimited >>>> ------------------------------ >>>> >>>> Hop count mode: >>>> >>>> Delete unreachable documents >>>> ------------------------------ >>>> >>>> 1. >>>> >>>> Seeds: >>>> >>>> https://xxxxxx/index.aspx >>>> <https://preview.revenuedomain.ie/en/press-office/index.aspx> >>>> ------------------------------ >>>> >>>> No canonicalization specified - all URLs will be reordered and have all >>>> sessions removed >>>> ------------------------------ >>>> >>>> No mappings specified; will accept all URLs >>>> ------------------------------ >>>> >>>> Include only hosts matching seeds? >>>> >>>> yes >>>> ------------------------------ >>>> >>>> Include in crawl: >>>> >>>> .* >>>> ------------------------------ >>>> >>>> Include in index: >>>> >>>> .* >>>> ------------------------------ >>>> >>>> Exclude from crawl: >>>> >>>> \.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|w >>>> mf|WMF|zip|ZIP|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE >>>> |jpeg|JPEG|bmp|BMP|js|JS|<script>|</script>|<script >>>> type="text/javascript">) >>>> [?*!@=].* >>>> ------------------------------ >>>> >>>> Exclude from index: >>>> ------------------------------ >>>> >>>> Exclude content from index: >>>> ------------------------------ >>>> >>>> No access tokens specified >>>> ------------------------------ >>>> >>>> Excluded headers: >>>> >>>> last-modified >>>> ------------------------------ >>>> >>>> 2. >>>> >>>> Field mappings: >>>> >>>> Metadata field name >>>> >>>> Final field name >>>> >>>> No field mapping specified >>>> ------------------------------ >>>> >>>> Keep all metadata: >>>> >>>> true >>>> ------------------------------ >>>> >>>> Lower names: >>>> >>>> false >>>> ------------------------------ >>>> >>>> Write limit: >>>> ------------------------------ >>>> >>>> Ignore Tika exceptions: >>>> >>>> true >>>> ------------------------------ >>>> >>>> Boilerplate extractor: >>>> >>>> -- No extraction selected -- >>>> ------------------------------ >>>> >>>> 3. >>>> >>>> Metadata expressions: >>>> >>>> Parameter name >>>> >>>> Remove this parameter? >>>> >>>> Expression ("${fieldname}" references a field) >>>> >>>> facetContentType >>>> >>>> false >>>> >>>> site.ie >>>> ------------------------------ >>>> >>>> Keep all incoming metadata >>>> >>>> false >>>> >>>> Remove empty metadata values >>>> >>>> false >>>> ------------------------------ >>>> >>>> 4. >>>> >>>> Bottom of Form >>>> >>>> >>>> >>>> Marisol Redondo >>>> >>>> Email: [email protected] >>>> >>>> Phone: 35428 >>>> >>>> >>>> >>>> Please note that Revenue cannot guarantee that any personal and sensitive >>>> data, sent in plain text via standard email, is fully secure. Customers >>>> who choose to use this channel are deemed to have accepted any risk >>>> involved. The alternative communication methods offered by Revenue include >>>> standard post and the option to use our (encrypted) MyEnquiries service >>>> which is available within myAccount and ROS. You can register for either >>>> myAccount or ROS on the Revenue website. >>>> >>>> >>>> >>>> Tabhair faoi deara nach féidir leis na Coimisinéirí Ioncaim ráthaíocht a >>>> thabhairt go bhfuil aon sonraí pearsanta agus íogair a gcuirtear isteach i >>>> ngnáth-théacs trí r-phost caighdeánach go huile is go hiomlán slán. >>>> Meastar go nglacann custaiméirí a úsáideann an cainéal seo le haon riosca >>>> bainteach. I measc na modhanna cumarsáide eile atá ag na Coimisinéirí ná >>>> post caighdeánach agus an rogha ár seirbhís (criptithe) M'Fhiosruithe a >>>> úsáid, tá sí ar fáil laistigh de MoChúrsaí agus ROS. Is féidir leat clárú >>>> le haghaidh ceachtar MoChúrsaí nó ROS ar shuíomh gréasáin na gCoimisinéirí. >>>> >>>> >>>> >>>> On 22 February 2017 at 14:53, Karl Wright <[email protected]> wrote: >>>> >>>>> Hi Marisol, >>>>> >>>>> The [INFO] log entries indicate that your document has almost no >>>>> metadata at all. But the Metadata Adjuster transformation connector is >>>>> designed to do exactly what you want. >>>>> >>>>> Can you view your job, and cut and paste the View Job page into an >>>>> email, so I can see how your metadata adjuster transformation connection >>>>> and your solr output connections are configured? Thanks! >>>>> >>>>> Karl >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Feb 22, 2017 at 8:57 AM, Marisol Redondo < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Karl and thank you for this quick answer. >>>>>> >>>>>> I was reading the documentation of MCF 1.10 but I'm using MCF 2.5, >>>>>> sorry for the confusion, and I think this version is compatible with >>>>>> solr6. >>>>>> The pdf doesn't have any metadata or field called facetContentType, >>>>>> this is because I'd been trying to use the Metadata Adjuster, to add a >>>>>> new >>>>>> metadata/property to the doc so solr can index by this field when I'm >>>>>> injecting the doc. >>>>>> Should I use other transformation or is there any other way of duing >>>>>> it? >>>>>> I am migrating from nutch to ManifoldCF and in nutch we can do it >>>>>> with plugins, and I was thinking that the plugins in nutch are the same >>>>>> as >>>>>> the transformation connectors in MCF >>>>>> >>>>>> The completed error in solr is : >>>>>> >>>>>> 017-02-21 13:19:32.108 INFO (qtp1854778591-18) [ x:sites] >>>>>>> o.a.s.c.PluginBag Going to create a new requestHandler with {type = >>>>>>> requestHandler,name = /update/extract,class = >>>>>>> solr.extraction.ExtractingRequestHandler,args >>>>>>> = {defaults={lowernames=true,fmap.meta=ignored_,fmap.content=_ >>>>>>> text_,update.chain=add-unknown-fields-to-the-schema,df=_text_}}} >>>>>> >>>>>> 2017-02-21 13:19:32.454 INFO (qtp1854778591-18) [ x:sites] >>>>>>> o.a.s.u.p.LogUpdateProcessorFactory [sites] webapp=/solr path=/up >>>>>> >>>>>> date/extract params={resource.name=introduction.pdf&literal.id >>>>>>> =https://...../introduction.pdf&wt=xml&version=2.2}{} 0 347 >>>>>> >>>>>> 2017-02-21 13:19:32.455 ERROR (qtp1854778591-18) [ x:sites] >>>>>>> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: [ >>>>>> >>>>>> doc=https://....../introduction.pdf] missing required field: >>>>>>> facetContentType >>>>>> >>>>>> at org.apache.solr.update.Documen >>>>>>> tBuilder.toDocument(DocumentBuilder.java:197) >>>>>> >>>>>> at org.apache.solr.update.AddUpda >>>>>>> teCommand.getLuceneDocument(AddUpdateCommand.java:82) >>>>>> >>>>>> at org.apache.solr.update.DirectU >>>>>>> pdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:277) >>>>>> >>>>>> at org.apache.solr.update.DirectU >>>>>>> pdateHandler2.addDoc0(DirectUpdateHandler2.java:211) >>>>>> >>>>>> >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>> On 21 February 2017 at 14:52, Karl Wright <[email protected]> wrote: >>>>>> >>>>>>> Hi Marisol, >>>>>>> >>>>>>> Can you find the [INFO] entry in the Solr log for this document? >>>>>>> That should help clear up any confusion. >>>>>>> >>>>>>> Also, for what it is worth, MCF 1.10 is not using a SolrJ that is up >>>>>>> to date with Solr 6.x. That could be the source of the problem Is >>>>>>> there >>>>>>> any reason you are using a 1.x version of MCF? >>>>>>> >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Tue, Feb 21, 2017 at 8:42 AM, Marisol Redondo < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi. >>>>>>>> >>>>>>>> I'm trying to use metadata adjuster to add one field to the solr >>>>>>>> index but doesn't inject the field into a solr's field. >>>>>>>> Maybe I'm misundertaning the use of the metada adjuster, but I have >>>>>>>> read in the documentation (https://manifoldcf.apache.org >>>>>>>> /release/release-1.10/en_US/end-user-documentation.html) that I >>>>>>>> can add metadata to the document that is going to be indexed into >>>>>>>> solr, but >>>>>>>> the solr instance gave me the error "missing required field: >>>>>>>> facetContentType". >>>>>>>> >>>>>>>> ManifoldCF Job pipeline: >>>>>>>> 1. Repository (type web repository) >>>>>>>> 2. Transformation (Tikka Metadata Extractor) >>>>>>>> 3. Transformation (type Metada Adjuster) >>>>>>>> 4. Output (Solr 6) >>>>>>>> >>>>>>>> ManifoldCF Job Metadata Expressions tab: >>>>>>>> Parameter name: "facetContentType" >>>>>>>> Remove this parameter: false >>>>>>>> Expresion: xxxx (the literal text value I want in >>>>>>>> facetContentType) >>>>>>>> >>>>>>>> Solr schema: >>>>>>>> ..... >>>>>>>> <field name="facetContentType" type="string" indexed="true" >>>>>>>> stored="true" required="true"/> >>>>>>>> .... >>>>>>>> >>>>>>>> The error logged in ManifoldCF is: >>>>>>>> Error from server at http://solrServer:port/solr/c >>>>>>>> <http://revnetsolrdev:8983/solr/sites>ore: [doc=https:// >>>>>>>> ....../index.aspx] missing required field: facetContentType. >>>>>>>> >>>>>>>> Thanks for your help >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
