Re: Store complete XML record (DIH XPathEntityProcessor)

2011-08-02 Thread karsten-solr
Hi g, Hi Chantal

I had the same problem.
You can use XPathEntityProcessor but you have to insert an xsl. The drawback is 
performance wasting: See
http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html

Best regards
  Karsten

 Original-Nachricht 
 Datum: Mon, 1 Aug 2011 12:17:45 +0200
 Von: Chantal Ackermann chantal.ackerm...@btelligent.de
 An: solr-user@lucene.apache.org solr-user@lucene.apache.org
 Betreff: Re: Store complete XML record  (DIH  XPathEntityProcessor)

 Hi g,
 
 ok, I understand your problem, now. (Sorry for answering that late.)
 
 I don't think PlainTextEntityProcessor can help you. It does not take a
 regex. LineEntityProcessor does but your record elements probably do not
 come on their own line each and you wouldn't want to depend on that,
 anyway.
 
 I guess you would be best off writing your own entity processor - maybe
 by extending XPath EP if that gives you some advantage. You can of
 course also implement your own importer using SolrJ and your favourite
 XML parser framework - or any other programming language.
 
 If you are looking for a config-only solution - i'm not sure that there
 is one. Someone else might be able to comment on that?
 
 Cheers,
 Chantal
 
 
 On Thu, 2011-07-28 at 19:17 +0200, solruser@9913 wrote:
  Thanks Chantal
  I am ok with the second call and I already tried using that. 
 Unfortunatly
  It reads the whole file into a field.  My file is as below example
  xml  
record 
... 
/record

record 
... 
/record
   
 record 
... 
/record
  
  /xml
  
  Now the XPATH does the 'for each /record' part.  For each record I also
 need
  to store the raw log in there.  If I use the  PlainTextEntityProcessor
 then
  it gives me the whole file (from xml .. /xml ) and not each of the
  record /record
  
  Am I using the PlainTextEntityProcessor wrong?
  
  THanks
  g
  
  
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3207203.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 


Re: Store complete XML record (DIH XPathEntityProcessor)

2011-08-01 Thread Chantal Ackermann
Hi g,

ok, I understand your problem, now. (Sorry for answering that late.)

I don't think PlainTextEntityProcessor can help you. It does not take a
regex. LineEntityProcessor does but your record elements probably do not
come on their own line each and you wouldn't want to depend on that,
anyway.

I guess you would be best off writing your own entity processor - maybe
by extending XPath EP if that gives you some advantage. You can of
course also implement your own importer using SolrJ and your favourite
XML parser framework - or any other programming language.

If you are looking for a config-only solution - i'm not sure that there
is one. Someone else might be able to comment on that?

Cheers,
Chantal


On Thu, 2011-07-28 at 19:17 +0200, solruser@9913 wrote:
 Thanks Chantal
 I am ok with the second call and I already tried using that.  Unfortunatly
 It reads the whole file into a field.  My file is as below example
 xml  
   record 
   ... 
   /record
   
   record 
   ... 
   /record
  
record 
   ... 
   /record
 
 /xml
 
 Now the XPATH does the 'for each /record' part.  For each record I also need
 to store the raw log in there.  If I use the  PlainTextEntityProcessor then
 it gives me the whole file (from xml .. /xml ) and not each of the
 record /record
 
 Am I using the PlainTextEntityProcessor wrong?
 
 THanks
 g
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3207203.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Store complete XML record (DIH XPathEntityProcessor)

2011-08-01 Thread Michael Sokolov

On 8/1/2011 6:17 AM, Chantal Ackermann wrote:

If you are looking for a config-only solution - i'm not sure that there
is one. Someone else might be able to comment on that?

You might want to take a look at SOLR-2597; it has a patch for 
XmlStripCharFilter, which will strip tags from XML for indexing (like 
HtmlStripCharFilter), and also allows you to specify XML element names 
to include/exclude.  Not full XPath, but might work for you?  You would 
have to compile the 2 java files and place them in your solr classpath 
since the patch has not been committed.


-Mike


Re: Store complete XML record (DIH XPathEntityProcessor)

2011-07-28 Thread Chantal Ackermann

Hi g,

have a look at the PlainTextEntityProcessor:
http://wiki.apache.org/solr/DataImportHandler#PlainTextEntityProcessor

you will have to call the URL twice that way, but I don't think you can
get the complete document (the root element with all structure) via
xpath - so the XPathEntityProcessor cannot help you.

If calling the URL twice slows your indexer down in unacceptable ways
you can always subclass XPathEntityProcessor (knowing Java is helpful,
thoug...). There surely is a way to make it return what you need. Or
maybe an entity processor that caches the content and uses XPath EP and
PlainText EP to accomplish your needs (not sure whether the API allows
for that).



Cheers,
Chantal



On Thu, 2011-07-28 at 05:53 +0200, solruser@9913 wrote:
 I am trying to use DIH to import an XML based file with multiple XML records
 in it.  Each record corresponds to one document in Lucene.  I am using the
 DIH FileListEntityProcessor (to get file list) followed by the
 XPathEntityProcessor to create the entities.  
 
 It works perfectly and I am able to map XML elements to fields . however
 I also need to store the entire XML record as separate 'full text' field. 
 Is there any way the XPathEntityProcessor provides a variable like 'rawLine'
 or 'plainText' that I can map to a field.  
 
 I tried to use the Plain Text processor after this  - but that does not
 recognize the XML boundaries and just gives the whole XML file.
 
 
entity name=x rootEntity=truedataSource=logfilereader
processor=XPathEntityProcessor
url=${logfile.fileAbsolutePath}  stream=false
 forEach=/xml/myrecord
transformer=   
  field column=mycol1   
 xpath=/xml/myrecord/@something
 /
  
 and so on ...
 This works perfectly.  However I also need something like ...
 
   field column=fullxmlrecord name=plainText  /
 
 Any help is much appreciated. I am a newbie and may be missing something
 obvious here
 
 -g
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3205524.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Store complete XML record (DIH XPathEntityProcessor)

2011-07-28 Thread solruser@9913
Thanks Chantal
I am ok with the second call and I already tried using that.  Unfortunatly
It reads the whole file into a field.  My file is as below example
xml  
  record 
  ... 
  /record
  
  record 
  ... 
  /record
 
   record 
  ... 
  /record

/xml

Now the XPATH does the 'for each /record' part.  For each record I also need
to store the raw log in there.  If I use the  PlainTextEntityProcessor then
it gives me the whole file (from xml .. /xml ) and not each of the
record /record

Am I using the PlainTextEntityProcessor wrong?

THanks
g


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3207203.html
Sent from the Solr - User mailing list archive at Nabble.com.


Store complete XML record (DIH XPathEntityProcessor)

2011-07-27 Thread solruser@9913
I am trying to use DIH to import an XML based file with multiple XML records
in it.  Each record corresponds to one document in Lucene.  I am using the
DIH FileListEntityProcessor (to get file list) followed by the
XPathEntityProcessor to create the entities.  

It works perfectly and I am able to map XML elements to fields . however
I also need to store the entire XML record as separate 'full text' field. 
Is there any way the XPathEntityProcessor provides a variable like 'rawLine'
or 'plainText' that I can map to a field.  

I tried to use the Plain Text processor after this  - but that does not
recognize the XML boundaries and just gives the whole XML file.


   entity name=x rootEntity=truedataSource=logfilereader
   processor=XPathEntityProcessor
   url=${logfile.fileAbsolutePath}  stream=false
forEach=/xml/myrecord
   transformer=   
 field column=mycol1 
xpath=/xml/myrecord/@something
/
 
and so on ...
This works perfectly.  However I also need something like ...

field column=fullxmlrecord name=plainText  /

Any help is much appreciated. I am a newbie and may be missing something
obvious here

-g



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Store-complete-XML-record-DIH-XPathEntityProcessor-tp3205524p3205524.html
Sent from the Solr - User mailing list archive at Nabble.com.