Re: Store complete XML record (DIH XPathEntityProcessor)
Hi g, Hi Chantal I had the same problem. You can use XPathEntityProcessor but you have to insert an xsl. The drawback is performance wasting: See http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html Best regards Karsten Original-Nachricht Datum: Mon, 1 Aug 2011 12:17:45 +0200 Von: Chantal Ackermann chantal.ackerm...@btelligent.de An: solr-user@lucene.apache.org solr-user@lucene.apache.org Betreff: Re: Store complete XML record (DIH XPathEntityProcessor) Hi g, ok, I understand your problem, now. (Sorry for answering that late.) I don't think PlainTextEntityProcessor can help you. It does not take a regex. LineEntityProcessor does but your record elements probably do not come on their own line each and you wouldn't want to depend on that, anyway. I guess you would be best off writing your own entity processor - maybe by extending XPath EP if that gives you some advantage. You can of course also implement your own importer using SolrJ and your favourite XML parser framework - or any other programming language. If you are looking for a config-only solution - i'm not sure that there is one. Someone else might be able to comment on that? Cheers, Chantal On Thu, 2011-07-28 at 19:17 +0200, solruser@9913 wrote: Thanks Chantal I am ok with the second call and I already tried using that. Unfortunatly It reads the whole file into a field. My file is as below example xml record ... /record record ... /record record ... /record /xml Now the XPATH does the 'for each /record' part. For each record I also need to store the raw log in there. If I use the PlainTextEntityProcessor then it gives me the whole file (from xml .. /xml ) and not each of the record /record Am I using the PlainTextEntityProcessor wrong? THanks g -- View this message in context: http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3207203.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Store complete XML record (DIH XPathEntityProcessor)
Hi g, ok, I understand your problem, now. (Sorry for answering that late.) I don't think PlainTextEntityProcessor can help you. It does not take a regex. LineEntityProcessor does but your record elements probably do not come on their own line each and you wouldn't want to depend on that, anyway. I guess you would be best off writing your own entity processor - maybe by extending XPath EP if that gives you some advantage. You can of course also implement your own importer using SolrJ and your favourite XML parser framework - or any other programming language. If you are looking for a config-only solution - i'm not sure that there is one. Someone else might be able to comment on that? Cheers, Chantal On Thu, 2011-07-28 at 19:17 +0200, solruser@9913 wrote: Thanks Chantal I am ok with the second call and I already tried using that. Unfortunatly It reads the whole file into a field. My file is as below example xml record ... /record record ... /record record ... /record /xml Now the XPATH does the 'for each /record' part. For each record I also need to store the raw log in there. If I use the PlainTextEntityProcessor then it gives me the whole file (from xml .. /xml ) and not each of the record /record Am I using the PlainTextEntityProcessor wrong? THanks g -- View this message in context: http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3207203.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Store complete XML record (DIH XPathEntityProcessor)
On 8/1/2011 6:17 AM, Chantal Ackermann wrote: If you are looking for a config-only solution - i'm not sure that there is one. Someone else might be able to comment on that? You might want to take a look at SOLR-2597; it has a patch for XmlStripCharFilter, which will strip tags from XML for indexing (like HtmlStripCharFilter), and also allows you to specify XML element names to include/exclude. Not full XPath, but might work for you? You would have to compile the 2 java files and place them in your solr classpath since the patch has not been committed. -Mike
Re: Store complete XML record (DIH XPathEntityProcessor)
Hi g, have a look at the PlainTextEntityProcessor: http://wiki.apache.org/solr/DataImportHandler#PlainTextEntityProcessor you will have to call the URL twice that way, but I don't think you can get the complete document (the root element with all structure) via xpath - so the XPathEntityProcessor cannot help you. If calling the URL twice slows your indexer down in unacceptable ways you can always subclass XPathEntityProcessor (knowing Java is helpful, thoug...). There surely is a way to make it return what you need. Or maybe an entity processor that caches the content and uses XPath EP and PlainText EP to accomplish your needs (not sure whether the API allows for that). Cheers, Chantal On Thu, 2011-07-28 at 05:53 +0200, solruser@9913 wrote: I am trying to use DIH to import an XML based file with multiple XML records in it. Each record corresponds to one document in Lucene. I am using the DIH FileListEntityProcessor (to get file list) followed by the XPathEntityProcessor to create the entities. It works perfectly and I am able to map XML elements to fields . however I also need to store the entire XML record as separate 'full text' field. Is there any way the XPathEntityProcessor provides a variable like 'rawLine' or 'plainText' that I can map to a field. I tried to use the Plain Text processor after this - but that does not recognize the XML boundaries and just gives the whole XML file. entity name=x rootEntity=truedataSource=logfilereader processor=XPathEntityProcessor url=${logfile.fileAbsolutePath} stream=false forEach=/xml/myrecord transformer= field column=mycol1 xpath=/xml/myrecord/@something / and so on ... This works perfectly. However I also need something like ... field column=fullxmlrecord name=plainText / Any help is much appreciated. I am a newbie and may be missing something obvious here -g -- View this message in context: http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3205524.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Store complete XML record (DIH XPathEntityProcessor)
Thanks Chantal I am ok with the second call and I already tried using that. Unfortunatly It reads the whole file into a field. My file is as below example xml record ... /record record ... /record record ... /record /xml Now the XPATH does the 'for each /record' part. For each record I also need to store the raw log in there. If I use the PlainTextEntityProcessor then it gives me the whole file (from xml .. /xml ) and not each of the record /record Am I using the PlainTextEntityProcessor wrong? THanks g -- View this message in context: http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3207203.html Sent from the Solr - User mailing list archive at Nabble.com.
Store complete XML record (DIH XPathEntityProcessor)
I am trying to use DIH to import an XML based file with multiple XML records in it. Each record corresponds to one document in Lucene. I am using the DIH FileListEntityProcessor (to get file list) followed by the XPathEntityProcessor to create the entities. It works perfectly and I am able to map XML elements to fields . however I also need to store the entire XML record as separate 'full text' field. Is there any way the XPathEntityProcessor provides a variable like 'rawLine' or 'plainText' that I can map to a field. I tried to use the Plain Text processor after this - but that does not recognize the XML boundaries and just gives the whole XML file. entity name=x rootEntity=truedataSource=logfilereader processor=XPathEntityProcessor url=${logfile.fileAbsolutePath} stream=false forEach=/xml/myrecord transformer= field column=mycol1 xpath=/xml/myrecord/@something / and so on ... This works perfectly. However I also need something like ... field column=fullxmlrecord name=plainText / Any help is much appreciated. I am a newbie and may be missing something obvious here -g -- View this message in context: http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3205524.html Sent from the Solr - User mailing list archive at Nabble.com.