RE: full-text indexing XML files

2009-12-11 Thread Feroze Daud
CDATA didn’t work either.It still complained about the input doc not being in 
correct format.

-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: Thursday, December 10, 2009 7:43 PM
To: solr-user@lucene.apache.org
Subject: Re: full-text indexing XML files

Or CDATA (much easier to work with).

On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote:

 Hi!



 I am trying to full text index an XML file. For various reasons, I
 cannot use Tika or other technology to parse the XML file. The
 requirement is to full-text index the XML file, including Tags and
 everything.



 So, I created a input index spec like this:



 add

 doc

 field name=id1001/field

 field name=nameNASA Advanced Research Labs/field

 field name=address1010 Main Street, Chattanooga, FL 32212/field

 field name=contentlistingid1001/id name  NASA Advanced
 Research Labs / name address1010 main street, chattanooga, FL
 32212/address/listing/field

 /doc

 /add



 You need to XML encode the value of the content field.

 --
 Regards,
 Shalin Shekhar Mangar.




-- 
Lance Norskog
goks...@gmail.com


RE: full-text indexing XML files

2009-12-11 Thread Feroze Daud
Yeah, xml tags as well. Essentially we want to full-text index the file,
without the need for stemming the tokens.

Will the SOLR analyzer be able to tokenize the document correctly if it
does not have any whitespaces (besides those required by XML syntax)?

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Thursday, December 10, 2009 8:00 PM
To: solr-user@lucene.apache.org
Subject: Re: full-text indexing XML files

What kind of searches do you want to do? Do you want to do searches that
match the XML tags?

wunder

On Dec 10, 2009, at 7:43 PM, Lance Norskog wrote:

 Or CDATA (much easier to work with).
 
 On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar
 shalinman...@gmail.com wrote:
 On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com
wrote:
 
 Hi!
 
 
 
 I am trying to full text index an XML file. For various reasons, I
 cannot use Tika or other technology to parse the XML file. The
 requirement is to full-text index the XML file, including Tags and
 everything.
 
 
 
 So, I created a input index spec like this:
 
 
 
 add
 
 doc
 
 field name=id1001/field
 
 field name=nameNASA Advanced Research Labs/field
 
 field name=address1010 Main Street, Chattanooga, FL
32212/field
 
 field name=contentlistingid1001/id name  NASA Advanced
 Research Labs / name address1010 main street, chattanooga, FL
 32212/address/listing/field
 
 /doc
 
 /add
 
 
 
 You need to XML encode the value of the content field.
 
 --
 Regards,
 Shalin Shekhar Mangar.
 
 
 
 
 -- 
 Lance Norskog
 goks...@gmail.com
 



Re: full-text indexing XML files

2009-12-11 Thread Walter Underwood
If you really want to do XML-senstive search, it could be a lot of work in 
Solr. Lucene is a flat data model, so hierarchy requires a lot of mapping to 
the schema or fancy, slow queries.

There are engines that are designed for XML indexing and search, using XQuery, 
so consider whether that might be less work overall.

XML engines are less mature than Lucene and Solr, so there is a big performance 
and scalability gap between the best free engines (eXist) and the best 
commercial engines (Mark Logic, where I work).

wunder
Walter Underwood
Lead Engineer, Mark Logic

On Dec 11, 2009, at 9:42 AM, Feroze Daud wrote:

 Yeah, xml tags as well. Essentially we want to full-text index the file,
 without the need for stemming the tokens.
 
 Will the SOLR analyzer be able to tokenize the document correctly if it
 does not have any whitespaces (besides those required by XML syntax)?
 
 -Original Message-
 From: Walter Underwood [mailto:wun...@wunderwood.org] 
 Sent: Thursday, December 10, 2009 8:00 PM
 To: solr-user@lucene.apache.org
 Subject: Re: full-text indexing XML files
 
 What kind of searches do you want to do? Do you want to do searches that
 match the XML tags?
 
 wunder
 
 On Dec 10, 2009, at 7:43 PM, Lance Norskog wrote:
 
 Or CDATA (much easier to work with).
 
 On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar
 shalinman...@gmail.com wrote:
 On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com
 wrote:
 
 Hi!
 
 
 
 I am trying to full text index an XML file. For various reasons, I
 cannot use Tika or other technology to parse the XML file. The
 requirement is to full-text index the XML file, including Tags and
 everything.
 
 
 
 So, I created a input index spec like this:
 
 
 
 add
 
 doc
 
 field name=id1001/field
 
 field name=nameNASA Advanced Research Labs/field
 
 field name=address1010 Main Street, Chattanooga, FL
 32212/field
 
 field name=contentlistingid1001/id name  NASA Advanced
 Research Labs / name address1010 main street, chattanooga, FL
 32212/address/listing/field
 
 /doc
 
 /add
 
 
 
 You need to XML encode the value of the content field.
 
 --
 Regards,
 Shalin Shekhar Mangar.
 
 
 
 
 -- 
 Lance Norskog
 goks...@gmail.com
 
 



Re: full-text indexing XML files

2009-12-11 Thread Lance Norskog
Please post a small sample file that has this problem with CDATA.

On Fri, Dec 11, 2009 at 9:41 AM, Feroze Daud fero...@zillow.com wrote:
 CDATA didn’t work either.It still complained about the input doc not being in 
 correct format.

 -Original Message-
 From: Lance Norskog [mailto:goks...@gmail.com]
 Sent: Thursday, December 10, 2009 7:43 PM
 To: solr-user@lucene.apache.org
 Subject: Re: full-text indexing XML files

 Or CDATA (much easier to work with).

 On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar
 shalinman...@gmail.com wrote:
 On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote:

 Hi!



 I am trying to full text index an XML file. For various reasons, I
 cannot use Tika or other technology to parse the XML file. The
 requirement is to full-text index the XML file, including Tags and
 everything.



 So, I created a input index spec like this:



 add

 doc

 field name=id1001/field

 field name=nameNASA Advanced Research Labs/field

 field name=address1010 Main Street, Chattanooga, FL 32212/field

 field name=contentlistingid1001/id name  NASA Advanced
 Research Labs / name address1010 main street, chattanooga, FL
 32212/address/listing/field

 /doc

 /add



 You need to XML encode the value of the content field.

 --
 Regards,
 Shalin Shekhar Mangar.




 --
 Lance Norskog
 goks...@gmail.com




-- 
Lance Norskog
goks...@gmail.com


Re: full-text indexing XML files

2009-12-10 Thread Lance Norskog
Or CDATA (much easier to work with).

On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote:

 Hi!



 I am trying to full text index an XML file. For various reasons, I
 cannot use Tika or other technology to parse the XML file. The
 requirement is to full-text index the XML file, including Tags and
 everything.



 So, I created a input index spec like this:



 add

 doc

 field name=id1001/field

 field name=nameNASA Advanced Research Labs/field

 field name=address1010 Main Street, Chattanooga, FL 32212/field

 field name=contentlistingid1001/id name  NASA Advanced
 Research Labs / name address1010 main street, chattanooga, FL
 32212/address/listing/field

 /doc

 /add



 You need to XML encode the value of the content field.

 --
 Regards,
 Shalin Shekhar Mangar.




-- 
Lance Norskog
goks...@gmail.com


Re: full-text indexing XML files

2009-12-10 Thread Walter Underwood
What kind of searches do you want to do? Do you want to do searches that match 
the XML tags?

wunder

On Dec 10, 2009, at 7:43 PM, Lance Norskog wrote:

 Or CDATA (much easier to work with).
 
 On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar
 shalinman...@gmail.com wrote:
 On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote:
 
 Hi!
 
 
 
 I am trying to full text index an XML file. For various reasons, I
 cannot use Tika or other technology to parse the XML file. The
 requirement is to full-text index the XML file, including Tags and
 everything.
 
 
 
 So, I created a input index spec like this:
 
 
 
 add
 
 doc
 
 field name=id1001/field
 
 field name=nameNASA Advanced Research Labs/field
 
 field name=address1010 Main Street, Chattanooga, FL 32212/field
 
 field name=contentlistingid1001/id name  NASA Advanced
 Research Labs / name address1010 main street, chattanooga, FL
 32212/address/listing/field
 
 /doc
 
 /add
 
 
 
 You need to XML encode the value of the content field.
 
 --
 Regards,
 Shalin Shekhar Mangar.
 
 
 
 
 -- 
 Lance Norskog
 goks...@gmail.com
 



full-text indexing XML files

2009-12-09 Thread Feroze Daud
Hi!



I am trying to full text index an XML file. For various reasons, I
cannot use Tika or other technology to parse the XML file. The
requirement is to full-text index the XML file, including Tags and
everything.

 

So, I created a input index spec like this:

 

add

doc

field name=id1001/field

field name=nameNASA Advanced Research Labs/field

field name=address1010 Main Street, Chattanooga, FL 32212/field

field name=contentlistingid1001/id name  NASA Advanced
Research Labs / name address1010 main street, chattanooga, FL
32212/address/listing/field

/doc

/add

 

When I try to pump this into SLOR with java -jar post.jar I get an
exception saying:

 

SimplePostTool: version 1.2

SimplePostTool: WARNING: Make sure your XML documents are encoded in
UTF-8, other encodings are not currently supported

SimplePostTool: POSTing files to http://localhost:8983/solr/update..

SimplePostTool: POSTing file index.doc

SimplePostTool: FATAL: Solr returned an error:
unexpected_XML_tag_doclisting

 

Any idea what I am doing wrong? Does the Solr index generator support
inner XML content in it's field tags? I tried enclosing the innerXML in
![CDATA[]] but that didn't work either.

 

Any help appreciated.

 

Thanks

 

Feroze.



Re: full-text indexing XML files

2009-12-09 Thread Shalin Shekhar Mangar
On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote:

 Hi!



 I am trying to full text index an XML file. For various reasons, I
 cannot use Tika or other technology to parse the XML file. The
 requirement is to full-text index the XML file, including Tags and
 everything.



 So, I created a input index spec like this:



 add

 doc

 field name=id1001/field

 field name=nameNASA Advanced Research Labs/field

 field name=address1010 Main Street, Chattanooga, FL 32212/field

 field name=contentlistingid1001/id name  NASA Advanced
 Research Labs / name address1010 main street, chattanooga, FL
 32212/address/listing/field

 /doc

 /add



You need to XML encode the value of the content field.

-- 
Regards,
Shalin Shekhar Mangar.