Getting error while indexing XML files on Hadoop

2015-01-13 Thread celebis


Hi to all from Istanbul, Turkey,

I can say that I'm a newbie in Solr  Hadoop,

I’m trying to index XML files (ipod_other.xml from lucidworks’ example
files, converted into sequence file format), using SolrXMLIngestMapper jars.
I’ve modified the schema.xml file by making the necesssary addions of the
fields stated in the ipod_other.xml file.

*Here’s my command:*
hadoop jar jobjar com.lucidworks.hadoop.ingest.IngestJob
-Dlww.commit.on.close=true -cls
com.lucidworks.hadoop.ingest.SolrXMLIngestMapper -c hdp1  -i
/user/hadoop/output/1420812982906sfu/part-r-0 -of
com.lucidworks.hadoop.io.LWMapRedOutputFormat -s
http://dc2vmhadappt01:8983/solr


In the end I constatly get Didn’t ingest any documents, failing error.

Anybody out there to help me out with this problem, any help is
appreciated..

Thanks

*Here are the addions to the schema.xml:*

field name=id type=string indexed=true stored=true required=true
multiValued=false / 
field name=name multiValued=true stored=true  type=text_en
indexed=true/
field name=sku type=text_en_splitting_tight indexed=true
stored=true omitNorms=true/
field name=manu type=text_general indexed=true stored=true
omitNorms=true/
field name=cat type=string indexed=true stored=true
multiValued=true/
field name=features type=text_general indexed=true stored=true
multiValued=true/
field name=includes type=text_general indexed=true stored=true
termVectors=true termPositions=true termOffsets=true /

field name=weight type=float indexed=true stored=true/
field name=price  type=float indexed=true stored=true/
field name=popularity type=int indexed=true stored=true /
field name=inStock type=boolean indexed=true stored=true /

field name=store type=location indexed=true stored=true/

dynamicField name=*_dt  type=dateindexed=true  stored=true/

field name=data_source stored=false type=text_en indexed=true/ 


*And here is the ipod_other.xml file;*

add

doc
  field name=idF8V7067-APL-KIT/field
  field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field
  field name=manuBelkin/field
  field name=catelectronics/field
  field name=catconnector/field
  field name=featurescar power adapter, white/field
  field name=weight4/field
  field name=price19.95/field
  field name=popularity1/field
  field name=inStockfalse/field
  
  field name=store45.17614,-93.87341/field
  field name=manufacturedate_dt2005-08-01T16:30:25Z/field
/doc

doc
  field name=idIW-02/field
  field name=nameiPod amp; iPod Mini USB 2.0 Cable/field
  field name=manuBelkin/field
  field name=catelectronics/field
  field name=catconnector/field
  field name=featurescar power adapter for iPod, white/field
  field name=weight2/field
  field name=price11.50/field
  field name=popularity1/field
  field name=inStockfalse/field
  
  field name=store37.7752,-122.4232/field
  field name=manufacturedate_dt2006-02-14T23:55:59Z/field
/doc


/add






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-error-while-indexing-XML-files-on-Hadoop-tp4179168.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [Indexing XML files in Solr with DataImportHandler]

2013-10-16 Thread kujta1
it is not indexing, it is saying there are no files indexed



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-XML-files-in-Solr-with-DataImportHandler-tp4095628p4095811.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [Indexing XML files in Solr with DataImportHandler]

2013-10-16 Thread Gora Mohanty
On 16 October 2013 13:06, kujta1 kujtim.rahm...@gmail.com wrote:
 it is not indexing, it is saying there are no files indexed

If you expect answers on the mailing list it might be best to provide
details here. From a quick glance at Stackoverflow, it looks like you
need a FileListEntityProcessor.

Searching Google turns up many examples of using a FileDataSource,
e.g., see:
http://java.dzone.com/news/data-import-handler-%E2%80%93-import

Regards,
Gora


[Indexing XML files in Solr with DataImportHandler]

2013-10-15 Thread kujta1
hello i have problems wih indexing xml file format. my solrconfigdaa-config
and solr files are here
http://stackoverflow.com/questions/19337979/indexing-xml-files-in-solr-with-dataimporthandlerCan
sombody help me why thi is not working!!thank you



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-XML-files-in-Solr-with-DataImportHandler-tp4095628.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: [Indexing XML files in Solr with DataImportHandler]

2013-10-15 Thread Shalin Shekhar Mangar
What is not working? Are you seeing any exceptions in the logs?


On Tue, Oct 15, 2013 at 3:53 PM, kujta1 kujtim.rahm...@gmail.com wrote:

 hello i have problems wih indexing xml file format. my solrconfigdaa-config
 and solr files are here

 http://stackoverflow.com/questions/19337979/indexing-xml-files-in-solr-with-dataimporthandlerCan
 sombody help me why thi is not working!!thank you



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Indexing-XML-files-in-Solr-with-DataImportHandler-tp4095628.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,
Shalin Shekhar Mangar.


RE: full-text indexing XML files

2009-12-11 Thread Feroze Daud
CDATA didn’t work either.It still complained about the input doc not being in 
correct format.

-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: Thursday, December 10, 2009 7:43 PM
To: solr-user@lucene.apache.org
Subject: Re: full-text indexing XML files

Or CDATA (much easier to work with).

On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote:

 Hi!



 I am trying to full text index an XML file. For various reasons, I
 cannot use Tika or other technology to parse the XML file. The
 requirement is to full-text index the XML file, including Tags and
 everything.



 So, I created a input index spec like this:



 add

 doc

 field name=id1001/field

 field name=nameNASA Advanced Research Labs/field

 field name=address1010 Main Street, Chattanooga, FL 32212/field

 field name=contentlistingid1001/id name  NASA Advanced
 Research Labs / name address1010 main street, chattanooga, FL
 32212/address/listing/field

 /doc

 /add



 You need to XML encode the value of the content field.

 --
 Regards,
 Shalin Shekhar Mangar.




-- 
Lance Norskog
goks...@gmail.com


RE: full-text indexing XML files

2009-12-11 Thread Feroze Daud
Yeah, xml tags as well. Essentially we want to full-text index the file,
without the need for stemming the tokens.

Will the SOLR analyzer be able to tokenize the document correctly if it
does not have any whitespaces (besides those required by XML syntax)?

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Thursday, December 10, 2009 8:00 PM
To: solr-user@lucene.apache.org
Subject: Re: full-text indexing XML files

What kind of searches do you want to do? Do you want to do searches that
match the XML tags?

wunder

On Dec 10, 2009, at 7:43 PM, Lance Norskog wrote:

 Or CDATA (much easier to work with).
 
 On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar
 shalinman...@gmail.com wrote:
 On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com
wrote:
 
 Hi!
 
 
 
 I am trying to full text index an XML file. For various reasons, I
 cannot use Tika or other technology to parse the XML file. The
 requirement is to full-text index the XML file, including Tags and
 everything.
 
 
 
 So, I created a input index spec like this:
 
 
 
 add
 
 doc
 
 field name=id1001/field
 
 field name=nameNASA Advanced Research Labs/field
 
 field name=address1010 Main Street, Chattanooga, FL
32212/field
 
 field name=contentlistingid1001/id name  NASA Advanced
 Research Labs / name address1010 main street, chattanooga, FL
 32212/address/listing/field
 
 /doc
 
 /add
 
 
 
 You need to XML encode the value of the content field.
 
 --
 Regards,
 Shalin Shekhar Mangar.
 
 
 
 
 -- 
 Lance Norskog
 goks...@gmail.com
 



Re: full-text indexing XML files

2009-12-11 Thread Walter Underwood
If you really want to do XML-senstive search, it could be a lot of work in 
Solr. Lucene is a flat data model, so hierarchy requires a lot of mapping to 
the schema or fancy, slow queries.

There are engines that are designed for XML indexing and search, using XQuery, 
so consider whether that might be less work overall.

XML engines are less mature than Lucene and Solr, so there is a big performance 
and scalability gap between the best free engines (eXist) and the best 
commercial engines (Mark Logic, where I work).

wunder
Walter Underwood
Lead Engineer, Mark Logic

On Dec 11, 2009, at 9:42 AM, Feroze Daud wrote:

 Yeah, xml tags as well. Essentially we want to full-text index the file,
 without the need for stemming the tokens.
 
 Will the SOLR analyzer be able to tokenize the document correctly if it
 does not have any whitespaces (besides those required by XML syntax)?
 
 -Original Message-
 From: Walter Underwood [mailto:wun...@wunderwood.org] 
 Sent: Thursday, December 10, 2009 8:00 PM
 To: solr-user@lucene.apache.org
 Subject: Re: full-text indexing XML files
 
 What kind of searches do you want to do? Do you want to do searches that
 match the XML tags?
 
 wunder
 
 On Dec 10, 2009, at 7:43 PM, Lance Norskog wrote:
 
 Or CDATA (much easier to work with).
 
 On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar
 shalinman...@gmail.com wrote:
 On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com
 wrote:
 
 Hi!
 
 
 
 I am trying to full text index an XML file. For various reasons, I
 cannot use Tika or other technology to parse the XML file. The
 requirement is to full-text index the XML file, including Tags and
 everything.
 
 
 
 So, I created a input index spec like this:
 
 
 
 add
 
 doc
 
 field name=id1001/field
 
 field name=nameNASA Advanced Research Labs/field
 
 field name=address1010 Main Street, Chattanooga, FL
 32212/field
 
 field name=contentlistingid1001/id name  NASA Advanced
 Research Labs / name address1010 main street, chattanooga, FL
 32212/address/listing/field
 
 /doc
 
 /add
 
 
 
 You need to XML encode the value of the content field.
 
 --
 Regards,
 Shalin Shekhar Mangar.
 
 
 
 
 -- 
 Lance Norskog
 goks...@gmail.com
 
 



Re: full-text indexing XML files

2009-12-11 Thread Lance Norskog
Please post a small sample file that has this problem with CDATA.

On Fri, Dec 11, 2009 at 9:41 AM, Feroze Daud fero...@zillow.com wrote:
 CDATA didn’t work either.It still complained about the input doc not being in 
 correct format.

 -Original Message-
 From: Lance Norskog [mailto:goks...@gmail.com]
 Sent: Thursday, December 10, 2009 7:43 PM
 To: solr-user@lucene.apache.org
 Subject: Re: full-text indexing XML files

 Or CDATA (much easier to work with).

 On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar
 shalinman...@gmail.com wrote:
 On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote:

 Hi!



 I am trying to full text index an XML file. For various reasons, I
 cannot use Tika or other technology to parse the XML file. The
 requirement is to full-text index the XML file, including Tags and
 everything.



 So, I created a input index spec like this:



 add

 doc

 field name=id1001/field

 field name=nameNASA Advanced Research Labs/field

 field name=address1010 Main Street, Chattanooga, FL 32212/field

 field name=contentlistingid1001/id name  NASA Advanced
 Research Labs / name address1010 main street, chattanooga, FL
 32212/address/listing/field

 /doc

 /add



 You need to XML encode the value of the content field.

 --
 Regards,
 Shalin Shekhar Mangar.




 --
 Lance Norskog
 goks...@gmail.com




-- 
Lance Norskog
goks...@gmail.com


Re: full-text indexing XML files

2009-12-10 Thread Lance Norskog
Or CDATA (much easier to work with).

On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar
shalinman...@gmail.com wrote:
 On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote:

 Hi!



 I am trying to full text index an XML file. For various reasons, I
 cannot use Tika or other technology to parse the XML file. The
 requirement is to full-text index the XML file, including Tags and
 everything.



 So, I created a input index spec like this:



 add

 doc

 field name=id1001/field

 field name=nameNASA Advanced Research Labs/field

 field name=address1010 Main Street, Chattanooga, FL 32212/field

 field name=contentlistingid1001/id name  NASA Advanced
 Research Labs / name address1010 main street, chattanooga, FL
 32212/address/listing/field

 /doc

 /add



 You need to XML encode the value of the content field.

 --
 Regards,
 Shalin Shekhar Mangar.




-- 
Lance Norskog
goks...@gmail.com


Re: full-text indexing XML files

2009-12-10 Thread Walter Underwood
What kind of searches do you want to do? Do you want to do searches that match 
the XML tags?

wunder

On Dec 10, 2009, at 7:43 PM, Lance Norskog wrote:

 Or CDATA (much easier to work with).
 
 On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar
 shalinman...@gmail.com wrote:
 On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote:
 
 Hi!
 
 
 
 I am trying to full text index an XML file. For various reasons, I
 cannot use Tika or other technology to parse the XML file. The
 requirement is to full-text index the XML file, including Tags and
 everything.
 
 
 
 So, I created a input index spec like this:
 
 
 
 add
 
 doc
 
 field name=id1001/field
 
 field name=nameNASA Advanced Research Labs/field
 
 field name=address1010 Main Street, Chattanooga, FL 32212/field
 
 field name=contentlistingid1001/id name  NASA Advanced
 Research Labs / name address1010 main street, chattanooga, FL
 32212/address/listing/field
 
 /doc
 
 /add
 
 
 
 You need to XML encode the value of the content field.
 
 --
 Regards,
 Shalin Shekhar Mangar.
 
 
 
 
 -- 
 Lance Norskog
 goks...@gmail.com
 



full-text indexing XML files

2009-12-09 Thread Feroze Daud
Hi!



I am trying to full text index an XML file. For various reasons, I
cannot use Tika or other technology to parse the XML file. The
requirement is to full-text index the XML file, including Tags and
everything.

 

So, I created a input index spec like this:

 

add

doc

field name=id1001/field

field name=nameNASA Advanced Research Labs/field

field name=address1010 Main Street, Chattanooga, FL 32212/field

field name=contentlistingid1001/id name  NASA Advanced
Research Labs / name address1010 main street, chattanooga, FL
32212/address/listing/field

/doc

/add

 

When I try to pump this into SLOR with java -jar post.jar I get an
exception saying:

 

SimplePostTool: version 1.2

SimplePostTool: WARNING: Make sure your XML documents are encoded in
UTF-8, other encodings are not currently supported

SimplePostTool: POSTing files to http://localhost:8983/solr/update..

SimplePostTool: POSTing file index.doc

SimplePostTool: FATAL: Solr returned an error:
unexpected_XML_tag_doclisting

 

Any idea what I am doing wrong? Does the Solr index generator support
inner XML content in it's field tags? I tried enclosing the innerXML in
![CDATA[]] but that didn't work either.

 

Any help appreciated.

 

Thanks

 

Feroze.



Re: full-text indexing XML files

2009-12-09 Thread Shalin Shekhar Mangar
On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote:

 Hi!



 I am trying to full text index an XML file. For various reasons, I
 cannot use Tika or other technology to parse the XML file. The
 requirement is to full-text index the XML file, including Tags and
 everything.



 So, I created a input index spec like this:



 add

 doc

 field name=id1001/field

 field name=nameNASA Advanced Research Labs/field

 field name=address1010 Main Street, Chattanooga, FL 32212/field

 field name=contentlistingid1001/id name  NASA Advanced
 Research Labs / name address1010 main street, chattanooga, FL
 32212/address/listing/field

 /doc

 /add



You need to XML encode the value of the content field.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Error when indexing XML files

2009-10-16 Thread Fergus McMenemie
Hi,

Please find the schema file attached. Please let me know what I am doing wrong.

Regards
Chaitali

--- On Wed, 10/14/09, Fergus McMenemie fer...@twig.me.uk wrote:


From: Fergus McMenemie fer...@twig.me.uk
Subject: Re: Error when indexing XML files
To: solr-user@lucene.apache.org
Date: Wednesday, October 14, 2009, 2:25 AM

Hi,

I am trying to index XML files using SolrJ. The original XML file contains 
nested elements. For example, the following is the snippet of the XML file.

entry
 nameSOMETHING /name
 facilitySOME_OTHER_THING/facility
 /entry

I have added the elements name and facility in Schema.xml file to make 
these elements indexable. I have changed the XML document above to look like -

add
doc
 ..
 field name=nameSOMETHING/field
 ..
/doc
/add

Can you send us the Schema.xml file you created? I suspect that
one of the fields should be multivalued.



   field name=facility type=string indexed=true stored=true/
   field name=name type=text indexed=true stored=true/

one or other, perhaps both your fields need to be

   field name=facility type=string indexed=true stored=true 
multiValued=true/
   field name=name type=text indexed=true stored=true 
multiValued=true/



-- 
Fergus.


Re: Error when indexing XML files

2009-10-16 Thread Fergus McMenemie
Hi,

Please find the schema file attached. Please let me know what I am doing wrong.

Regards
Chaitali

--- On Wed, 10/14/09, Fergus McMenemie fer...@twig.me.uk wrote:


From: Fergus McMenemie fer...@twig.me.uk
Subject: Re: Error when indexing XML files
To: solr-user@lucene.apache.org
Date: Wednesday, October 14, 2009, 2:25 AM

Hi,

I am trying to index XML files using SolrJ. The original XML file contains 
nested 
 elements. For example, the following is the snippet of the XML file.

entry
 nameSOMETHING /name
 facilitySOME_OTHER_THING/facility
 /entry

I have added the elements name and facility in Schema.xml file to make 
these 
elements indexable. I have changed the XML document above to look like -

add
doc
 ..
 field name=nameSOMETHING/field
 ..
/doc
/add

Can you send us the Schema.xml file you created? I suspect that
one of the fields should be multivalued.



   field name=facility type=string indexed=true stored=true/
   field name=name type=text indexed=true stored=true/

one or other, perhaps both your fields need to be

   field name=facility type=string indexed=true stored=true 
multiValued=true/
   field name=name type=text indexed=true stored=true 
multiValued=true/


-- 
Fergus


Re: Error when indexing XML files

2009-10-14 Thread Fergus McMenemie
Hi, 

I am trying to index XML files using SolrJ. The original XML file contains 
nested elements. For example, the following is the snippet of the XML file. 

entry
  nameSOMETHING /name
  facilitySOME_OTHER_THING/facility
 /entry

I have added the elements name and facility in Schema.xml file to make 
these elements indexable. I have changed the XML document above to look like - 

add
doc
 ..
 field name=nameSOMETHING/field 
 ..
/doc
/add

Can you send us the Schema.xml file you created? I suspect that 
one of the fields should be multivalued.

-- 
Fergus.


Re: Error when indexing XML files

2009-10-14 Thread Chaitali Gupta
Hi, 

Please find the schema file attached. Please let me know what I am doing wrong. 

Regards
Chaitali 

--- On Wed, 10/14/09, Fergus McMenemie fer...@twig.me.uk wrote:

From: Fergus McMenemie fer...@twig.me.uk
Subject: Re: Error when indexing XML files
To: solr-user@lucene.apache.org
Date: Wednesday, October 14, 2009, 2:25 AM

Hi, 

I am trying to index XML files using SolrJ. The original XML file contains 
nested elements. For example, the following is the snippet of the XML file. 

entry
  nameSOMETHING /name
  facilitySOME_OTHER_THING/facility
 /entry

I have added the elements name and facility in Schema.xml file to make 
these elements indexable. I have changed the XML document above to look like - 

add
doc
 ..
 field name=nameSOMETHING/field 
 ..
/doc
/add

Can you send us the Schema.xml file you created? I suspect that 
one of the fields should be multivalued.

-- 
Fergus.



  ?xml version=1.0 encoding=UTF-8 ?
!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the License); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an AS IS BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
--

!--  
 This is the Solr schema file. This file should be named schema.xml and
 should be in the conf directory under the solr home
 (i.e. ./solr/conf/schema.xml by default) 
 or located where the classloader for the Solr webapp can find it.

 This example schema is the recommended starting point for users.
 It should be kept correct and concise, usable out-of-the-box.

 For more information, on how to customize this file, please see
 http://wiki.apache.org/solr/SchemaXml
--

schema name=example version=1.1
  !-- attribute name is the name of this schema and is only used for display purposes.
   Applications should change this to reflect the nature of the search collection.
   version=1.1 is Solr's version number for the schema syntax and semantics.  It should
   not normally be changed by applications.
   1.0: multiValued attribute did not exist, all fields are multiValued by nature
   1.1: multiValued attribute introduced, false by default --

  types
!-- field type definitions. The name attribute is
   just a label to be used by field definitions.  The class
   attribute and any other attributes determine the real
   behavior of the fieldType.
 Class names starting with solr refer to java classes in the
   org.apache.solr.analysis package.
--

!-- The StrField type is not analyzed, but indexed/stored verbatim.  
   - StrField and TextField support an optional compressThreshold which
   limits compression (if enabled in the derived fields) to values which
   exceed a certain size (in characters).
--
fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/

!-- boolean type: true or false --
fieldType name=boolean class=solr.BoolField sortMissingLast=true omitNorms=true/

!-- The optional sortMissingLast and sortMissingFirst attributes are
 currently supported on types that are sorted internally as strings.
   - If sortMissingLast=true, then a sort on this field will cause documents
 without the field to come after documents with the field,
 regardless of the requested sort order (asc or desc).
   - If sortMissingFirst=true, then a sort on this field will cause documents
 without the field to come before documents with the field,
 regardless of the requested sort order.
   - If sortMissingLast=false and sortMissingFirst=false (the default),
 then default lucene sorting will be used which places docs without the
 field first in an ascending sort and last in a descending sort.
--


!-- numeric field types that store and index the text
 value verbatim (and hence don't support range queries, since the
 lexicographic ordering isn't equal to the numeric ordering) --
fieldType name=integer class=solr.IntField omitNorms=true/
fieldType name=long class=solr.LongField omitNorms=true/
fieldType name=float class=solr.FloatField omitNorms=true/
fieldType name=double class=solr.DoubleField omitNorms=true/


!-- Numeric field types that manipulate the value into
 a string value that isn't human-readable in its internal form,
 but with a lexicographic ordering the same as the numeric

Error when indexing XML files

2009-10-13 Thread Chaitali Gupta
Hi, 

I am trying to index XML files using SolrJ. The original XML file contains 
nested elements. For example, the following is the snippet of the XML file. 

entry
  nameSOMETHING /name
  facilitySOME_OTHER_THING/facility
 /entry

I have added the elements name and facility in Schema.xml file to make 
these elements indexable. I have changed the XML document above to look like - 

add
doc
 ..
 field name=nameSOMETHING/field 
 ..
/doc
/add

 I am getting the following error when I start Jetty - 

org.apache.solr.common.SolrException: 
ERROR_5457843_multiple_values_encountered_for_non_multiValued_field_facility___tracklesstrackless_

Can anyone please let me know if there is something I am doing wrong ? 

How can I maintain the parent-child relationship of the original XML file in 
the modified XML file?  Can I not use the original XML file as it is for 
indexing purposes? 

Thanks in advance. 

- Chaitali 



  

Re: Question on modifying solr behavior on indexing xml files..

2009-10-02 Thread Shalin Shekhar Mangar
On Thu, Oct 1, 2009 at 3:10 PM, Thung, Peter C CIV SPAWARSYSCEN-PACIFIC,
56340 peter.th...@navy.mil wrote:

 1.  In my playing around with
 sending in an XML document within a an XML CDATA tag,
 with termVectors=true

 I noticed the following behavior:
 personpeter/person
 collapses to the term
 personpeterperson
 instead of
 person
 and
 peter separately.

 I realize I could try and do a search and replaces of characters like
 =  to a space so that the default parser/indexer can preserve element
 names.
 However, I'm wondering if someon could point me to where one might do
 this withing
 the solr or apache lucene code as a proper plug in with maybe an example
 that I could use
 as a template.  Also where in the solrconfig.xml file I would want to
 change to reference the new parser.


Solr is agnostic of the content in a schema field. It does not know that it
is XML and hence it will do blind tokenization/filtering as defined for the
field type in schema.xml

If all you want is to do a full-text search on words found somewhere in that
XML, then your approach of replacing = to a space will work fine. You can
use the PatternReplaceFilter and specify a regex which matches these special
characters and replaces them by a space.

filter class=solr.PatternReplaceFilterFactory pattern=([=])
replacement=  replace=all/

Or you can use the MappingCharFilter (solr 1.4 feature) and specify a
mapping file which has these special characters mapped to a space.

charFilter class=solr.MappingCharFilterFactory
mapping=special-xml-symbols.txt/

The file should be in the format:
characterToBeReplaced = replacementChar

However, if you want to preserve the structure of the XML document, it is
best to parse it out yourself and put contents into Solr fields before
sending it to Solr. You may also want to look at DataImportHandler and
XPathEntityProcessor which is commonly used for importing XML files.

http://wiki.apache.org/solr/DataImportHandler


 2.  My other question would also be if this technique would work for XML
 type messages embedded
 in Microsoft Excel, or Powerpoint presentations where I would like to
 preserve knowining xml element term frequencies
 where I would try and leverage the component that automatically indexes
 microsoft documents.
 Would I need to modify that component and customize it?


Perhaps somebody who knows about Solr Cell can answer this but I think it
should work.

-- 
Regards,
Shalin Shekhar Mangar.


Question on modifying solr behavior on indexing xml files..

2009-10-01 Thread Thung, Peter C CIV SPAWARSYSCEN-PACIFIC, 56340
1.  In my playing around with 
sending in an XML document within a an XML CDATA tag,
with termVectors=true
 
I noticed the following behavior:
personpeter/person
collapses to the term
personpeterperson
instead of
person
and 
peter separately.
 
I realize I could try and do a search and replaces of characters like
=  to a space so that the default parser/indexer can preserve element
names.
However, I'm wondering if someon could point me to where one might do
this withing
the solr or apache lucene code as a proper plug in with maybe an example
that I could use
as a template.  Also where in the solrconfig.xml file I would want to
change to reference the new parser.
 
2.  My other question would also be if this technique would work for XML
type messages embedded
in Microsoft Excel, or Powerpoint presentations where I would like to
preserve knowining xml element term frequencies
where I would try and leverage the component that automatically indexes
microsoft documents.
Would I need to modify that component and customize it?
 
-Peter
 
 



Re: query regarding Indexing xml files -db-data-config.xml

2009-05-18 Thread jayakeerthi s
Hi  Noble,

Thanks for the reply,

As advised I have changed the db-data-config.xml as below. But still the
str name=Indexing completed. Added/Updated: 0 documents. Deleted 0
documents./str

dataConfig
dataSource type=FileDataSource name =xmlindex/
document name=products
 entity name=xmlfile processor=FileListEntityProcessor
fileName=c:\\test\\ipod_other.xml  recursive=true rootEntity=false
dataSource=null baseDir=${dataimporter.request.xmlDataDir}
useSolrAddSchema=true
entity name=data processor=XPathEntityProcessor
url=${xmlfile.fileAbsolutePath}
  field column=manu name=manu/
  /entity
   /entity
   /document
/dataConfig


Got error as below when baseDir is removed

INFO: last commit = 1242683454570
May 18, 2009 2:55:15 PM org.apache.solr.handler.dataimport.DataImporter
doFullImport
SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is
a required attribute Pro
cessing Document # 1
at
org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.j
ava:76)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:299)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:225)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:167)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:324)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:382)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:363)
May 18, 2009 2:55:15 PM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback

Please advise.

Thanks and regards,
Jay

2009/5/17 Noble Paul നോബിള്‍ नोब्ळ् noble.p...@corp.aol.com

 hi ,
 u may not need that enclosing entity , if you only wish to index one file.

 baseDir is not required if you give absolute path in the fileName.

 no need to mention forEach or fields if you set useSolrAddSchema=true

 On Sat, May 16, 2009 at 1:23 AM, jayakeerthi s mail2keer...@gmail.com
 wrote:
  Hi All,
 
  I am trying to index the fileds from the xml files, here is the
  configuration that I am using.
 
 
  db-data-config.xml
 
  dataConfig
 dataSource type=FileDataSource name =xmlindex/
 document name=products
  entity name=xmlfile processor=FileListEntityProcessor
  fileName=c:\test\ipod_other.xml  recursive=true rootEntity=false
  dataSource=null baseDir=${dataimporter.request.xmlDataDir}
  entity name=data processor=XPathEntityProcessor forEach=/record
 |
  /the/record/xpath  url=${xmlfile.fileAbsolutePath}
 field column=manu
  name=manu/
 
  /entity
 /entity
/document
  /dataConfig
 
  Schema.xml has the field manu
 
  The input xml file used to import the field is
 
  doc
   field name=idF8V7067-APL-KIT/field
   field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field
   field name=manuBelkin/field
   field name=catelectronics/field
   field name=catconnector/field
   field name=featurescar power adapter, white/field
   field name=weight4/field
   field name=price19.95/field
   field name=popularity1/field
   field name=inStockfalse/field
  /doc
 
 
  doing the full-import this is the response I am getting
 
  - lst name=statusMessages
   str name=Total Requests made to DataSource0/str
   str name=Total Rows Fetched0/str
   str name=Total Documents Skipped0/str
   str name=Full Dump Started2009-05-15 11:58:00/str
   str name=Indexing completed. Added/Updated: 0 documents. Deleted 0
  documents./str
   str name=Committed2009-05-15 11:58:00/str
   str name=Optimized2009-05-15 11:58:00/str
   str name=Time taken0:0:0.172/str
   /lst
   str name=WARNINGThis response format is experimental. It is likely
 to
  change in the future./str
   /response
 
 
  Do I missing anything here or is there any format on the input xml,??
 please
  help resolving this.
 
  Thanks and regards,
  Jay
 



 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com



Re: query regarding Indexing xml files -db-data-config.xml

2009-05-17 Thread Noble Paul നോബിള്‍ नोब्ळ्
hi ,
u may not need that enclosing entity , if you only wish to index one file.

baseDir is not required if you give absolute path in the fileName.

no need to mention forEach or fields if you set useSolrAddSchema=true

On Sat, May 16, 2009 at 1:23 AM, jayakeerthi s mail2keer...@gmail.com wrote:
 Hi All,

 I am trying to index the fileds from the xml files, here is the
 configuration that I am using.


 db-data-config.xml

 dataConfig
    dataSource type=FileDataSource name =xmlindex/
    document name=products
     entity name=xmlfile processor=FileListEntityProcessor
 fileName=c:\test\ipod_other.xml  recursive=true rootEntity=false
 dataSource=null baseDir=${dataimporter.request.xmlDataDir}
     entity name=data processor=XPathEntityProcessor forEach=/record |
 /the/record/xpath  url=${xmlfile.fileAbsolutePath}
            field column=manu
 name=manu/

     /entity
        /entity
       /document
 /dataConfig

 Schema.xml has the field manu

 The input xml file used to import the field is

 doc
  field name=idF8V7067-APL-KIT/field
  field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field
  field name=manuBelkin/field
  field name=catelectronics/field
  field name=catconnector/field
  field name=featurescar power adapter, white/field
  field name=weight4/field
  field name=price19.95/field
  field name=popularity1/field
  field name=inStockfalse/field
 /doc


 doing the full-import this is the response I am getting

 - lst name=statusMessages
  str name=Total Requests made to DataSource0/str
  str name=Total Rows Fetched0/str
  str name=Total Documents Skipped0/str
  str name=Full Dump Started2009-05-15 11:58:00/str
  str name=Indexing completed. Added/Updated: 0 documents. Deleted 0
 documents./str
  str name=Committed2009-05-15 11:58:00/str
  str name=Optimized2009-05-15 11:58:00/str
  str name=Time taken0:0:0.172/str
  /lst
  str name=WARNINGThis response format is experimental. It is likely to
 change in the future./str
  /response


 Do I missing anything here or is there any format on the input xml,?? please
 help resolving this.

 Thanks and regards,
 Jay




-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: query regarding Indexing xml files -db-data-config.xml

2009-05-16 Thread Fergus McMenemie
Hmmm, 

I thought that if you were using the XPathEntityProcessor that 
you have to specify an xpath for each of the fields you want
to populate. Unless you are using XPathEntityProcessor's use
useSolrAddSchema mode?

Fergus.

If that is your complete input file then it looks like you are missing the
wrapping add/add element:

add
doc
 field name=idF8V7067-APL-KIT/

 field
  field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field
  field name=manuBelkin/field
  field name=catelectronics/field
  field name=catconnector/field
  field name=featurescar power adapter, white/field
  field name=weight4/field
  field name=price19.95/field
  field name=popularity1/field
  field name=inStockfalse/field
 /doc

/add

Is it possible you just forgot to include the add?

-Jay


On Fri, May 15, 2009 at 12:53 PM, jayakeerthi s mail2keer...@gmail.comwrote:

 Hi All,

 I am trying to index the fileds from the xml files, here is the
 configuration that I am using.


 db-data-config.xml

 dataConfig
dataSource type=FileDataSource name =xmlindex/
document name=products
 entity name=xmlfile processor=FileListEntityProcessor
 fileName=c:\test\ipod_other.xml  recursive=true rootEntity=false
 dataSource=null baseDir=${dataimporter.request.xmlDataDir}
 entity name=data processor=XPathEntityProcessor forEach=/record |
 /the/record/xpath  url=${xmlfile.fileAbsolutePath}
field column=manu
 name=manu/

 /entity
/entity
   /document
 /dataConfig

 Schema.xml has the field manu

 The input xml file used to import the field is

 doc
  field name=idF8V7067-APL-KIT/field
  field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field
  field name=manuBelkin/field
  field name=catelectronics/field
  field name=catconnector/field
  field name=featurescar power adapter, white/field
  field name=weight4/field
  field name=price19.95/field
  field name=popularity1/field
  field name=inStockfalse/field
 /doc


 doing the full-import this is the response I am getting

 - lst name=statusMessages
  str name=Total Requests made to DataSource0/str
  str name=Total Rows Fetched0/str
  str name=Total Documents Skipped0/str
  str name=Full Dump Started2009-05-15 11:58:00/str
  str name=Indexing completed. Added/Updated: 0 documents. Deleted 0
 documents./str
  str name=Committed2009-05-15 11:58:00/str
  str name=Optimized2009-05-15 11:58:00/str
  str name=Time taken0:0:0.172/str
  /lst
  str name=WARNINGThis response format is experimental. It is likely to
 change in the future./str
  /response


 Do I missing anything here or is there any format on the input xml,??
 please
 help resolving this.

 Thanks and regards,
 Jay


-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===


query regarding Indexing xml files -db-data-config.xml

2009-05-15 Thread jayakeerthi s
Hi All,

I am trying to index the fileds from the xml files, here is the
configuration that I am using.


db-data-config.xml

dataConfig
dataSource type=FileDataSource name =xmlindex/
document name=products
 entity name=xmlfile processor=FileListEntityProcessor
fileName=c:\test\ipod_other.xml  recursive=true rootEntity=false
dataSource=null baseDir=${dataimporter.request.xmlDataDir}
 entity name=data processor=XPathEntityProcessor forEach=/record |
/the/record/xpath  url=${xmlfile.fileAbsolutePath}
field column=manu
name=manu/

 /entity
/entity
   /document
/dataConfig

Schema.xml has the field manu

The input xml file used to import the field is

doc
  field name=idF8V7067-APL-KIT/field
  field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field
  field name=manuBelkin/field
  field name=catelectronics/field
  field name=catconnector/field
  field name=featurescar power adapter, white/field
  field name=weight4/field
  field name=price19.95/field
  field name=popularity1/field
  field name=inStockfalse/field
/doc


doing the full-import this is the response I am getting

- lst name=statusMessages
  str name=Total Requests made to DataSource0/str
  str name=Total Rows Fetched0/str
  str name=Total Documents Skipped0/str
  str name=Full Dump Started2009-05-15 11:58:00/str
  str name=Indexing completed. Added/Updated: 0 documents. Deleted 0
documents./str
  str name=Committed2009-05-15 11:58:00/str
  str name=Optimized2009-05-15 11:58:00/str
  str name=Time taken0:0:0.172/str
  /lst
  str name=WARNINGThis response format is experimental. It is likely to
change in the future./str
  /response


Do I missing anything here or is there any format on the input xml,?? please
help resolving this.

Thanks and regards,
Jay


Re: query regarding Indexing xml files -db-data-config.xml

2009-05-15 Thread Jay Hill
If that is your complete input file then it looks like you are missing the
wrapping add/add element:

add
doc
 field name=idF8V7067-APL-KIT/

 field
  field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field
  field name=manuBelkin/field
  field name=catelectronics/field
  field name=catconnector/field
  field name=featurescar power adapter, white/field
  field name=weight4/field
  field name=price19.95/field
  field name=popularity1/field
  field name=inStockfalse/field
 /doc

/add

Is it possible you just forgot to include the add?

-Jay


On Fri, May 15, 2009 at 12:53 PM, jayakeerthi s mail2keer...@gmail.comwrote:

 Hi All,

 I am trying to index the fileds from the xml files, here is the
 configuration that I am using.


 db-data-config.xml

 dataConfig
dataSource type=FileDataSource name =xmlindex/
document name=products
 entity name=xmlfile processor=FileListEntityProcessor
 fileName=c:\test\ipod_other.xml  recursive=true rootEntity=false
 dataSource=null baseDir=${dataimporter.request.xmlDataDir}
 entity name=data processor=XPathEntityProcessor forEach=/record |
 /the/record/xpath  url=${xmlfile.fileAbsolutePath}
field column=manu
 name=manu/

 /entity
/entity
   /document
 /dataConfig

 Schema.xml has the field manu

 The input xml file used to import the field is

 doc
  field name=idF8V7067-APL-KIT/field
  field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field
  field name=manuBelkin/field
  field name=catelectronics/field
  field name=catconnector/field
  field name=featurescar power adapter, white/field
  field name=weight4/field
  field name=price19.95/field
  field name=popularity1/field
  field name=inStockfalse/field
 /doc


 doing the full-import this is the response I am getting

 - lst name=statusMessages
  str name=Total Requests made to DataSource0/str
  str name=Total Rows Fetched0/str
  str name=Total Documents Skipped0/str
  str name=Full Dump Started2009-05-15 11:58:00/str
  str name=Indexing completed. Added/Updated: 0 documents. Deleted 0
 documents./str
  str name=Committed2009-05-15 11:58:00/str
  str name=Optimized2009-05-15 11:58:00/str
  str name=Time taken0:0:0.172/str
  /lst
  str name=WARNINGThis response format is experimental. It is likely to
 change in the future./str
  /response


 Do I missing anything here or is there any format on the input xml,??
 please
 help resolving this.

 Thanks and regards,
 Jay



Re: query regarding Indexing xml files -db-data-config.xml

2009-05-15 Thread jayakeerthi s
Many thanks for the reply

The complete input xml file is below I missed to include this earlier.


add
doc
  field name=idF8V7067-APL-KIT/field
  field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field
  field name=manuBelkin/field
  field name=catelectronics/field
  field name=catconnector/field
  field name=featurescar power adapter, white/field
  field name=weight4/field
  field name=price19.95/field
  field name=popularity1/field
  field name=inStockfalse/field
/doc
doc
  field name=idIW-02/field
  field name=nameiPod amp; iPod Mini USB 2.0 Cable/field
  field name=manuBelkin/field
  field name=catelectronics/field
  field name=catconnector/field
  field name=featurescar power adapter for iPod, white/field
  field name=weight2/field
  field name=price11.50/field
  field name=popularity1/field
  field name=inStockfalse/field
/doc

/add

regards,
Jay
On Fri, May 15, 2009 at 1:14 PM, Jay Hill jayallenh...@gmail.com wrote:

 If that is your complete input file then it looks like you are missing the
 wrapping add/add element:

 add
 doc
  field name=idF8V7067-APL-KIT/
 
  field
   field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field
   field name=manuBelkin/field
   field name=catelectronics/field
   field name=catconnector/field
   field name=featurescar power adapter, white/field
   field name=weight4/field
   field name=price19.95/field
   field name=popularity1/field
   field name=inStockfalse/field
  /doc

 /add

 Is it possible you just forgot to include the add?

 -Jay


 On Fri, May 15, 2009 at 12:53 PM, jayakeerthi s mail2keer...@gmail.com
 wrote:

  Hi All,
 
  I am trying to index the fileds from the xml files, here is the
  configuration that I am using.
 
 
  db-data-config.xml
 
  dataConfig
 dataSource type=FileDataSource name =xmlindex/
 document name=products
  entity name=xmlfile processor=FileListEntityProcessor
  fileName=c:\test\ipod_other.xml  recursive=true rootEntity=false
  dataSource=null baseDir=${dataimporter.request.xmlDataDir}
  entity name=data processor=XPathEntityProcessor forEach=/record
 |
  /the/record/xpath  url=${xmlfile.fileAbsolutePath}
 field column=manu
  name=manu/
 
  /entity
 /entity
/document
  /dataConfig
 
  Schema.xml has the field manu
 
  The input xml file used to import the field is
 
  doc
   field name=idF8V7067-APL-KIT/field
   field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field
   field name=manuBelkin/field
   field name=catelectronics/field
   field name=catconnector/field
   field name=featurescar power adapter, white/field
   field name=weight4/field
   field name=price19.95/field
   field name=popularity1/field
   field name=inStockfalse/field
  /doc
 
 
  doing the full-import this is the response I am getting
 
  - lst name=statusMessages
   str name=Total Requests made to DataSource0/str
   str name=Total Rows Fetched0/str
   str name=Total Documents Skipped0/str
   str name=Full Dump Started2009-05-15 11:58:00/str
   str name=Indexing completed. Added/Updated: 0 documents. Deleted 0
  documents./str
   str name=Committed2009-05-15 11:58:00/str
   str name=Optimized2009-05-15 11:58:00/str
   str name=Time taken0:0:0.172/str
   /lst
   str name=WARNINGThis response format is experimental. It is likely
 to
  change in the future./str
   /response
 
 
  Do I missing anything here or is there any format on the input xml,??
  please
  help resolving this.
 
  Thanks and regards,
  Jay
 



Re: Indexing XML files

2006-12-07 Thread mirko
Thank you all for the quick responses.  They were very helpful.

My XML is well-formed, so I ended up implementing my own FieldType:

public class XMLField extends TextField {
  public void write(XMLWriter xmlWriter, String name, Fieldable f) throws
IOException {
xmlWriter.writePrim(xml, name, f.stringValue(), false);
  }
}

I looked at the XSD and there is one thing I don't understand:

If the desired way is to conform to the XSD (and hence the types used in XSD),
then how would it possible to use user-defined fieldtypes as plugins?  Wouldn't
they violate the same principle?

thanks,
mirko


Quoting Chris Hostetter [EMAIL PROTECTED]:
...
 I think Walters got the right idea ... as a general rule, we want to make
 the XmlResponseWriter bullet proof so that no matter waht data you put
 into your index, it is garunteed to produce a well formed XML document
 that conforms to a specified DTD, or XSD (see SOLR-17 for one we already
 have but we haven't figured out what to do with yet)

...

 if you're interested in writing a bit of custom java code you could in
 fact write a new FieldType (which could easily subclass TextField) with a
 custom write method that just outputs the raw value directly, and then
 load your field type as a plugin...

   http://wiki.apache.org/solr/SolrPlugins

 -Hoss





Re: Indexing XML files

2006-12-07 Thread Chris Hostetter

: I looked at the XSD and there is one thing I don't understand:
:
: If the desired way is to conform to the XSD (and hence the types used in XSD),
: then how would it possible to use user-defined fieldtypes as plugins?  
Wouldn't
: they violate the same principle?

The XSD is intended to match the behavior of the XmlResponseWriter and the
core solr code base ... if you write a new ResponseWriter (or use one of
the other built in ResponseWriters like JSON or Ruby) then all bets are
off.  if you are writing a new FieldType, then you might still be able to
use the XSD as is if your data can easily be represented using one of hte
primative' types (ie: i might add a new LonLatFieldType class for
efficinetly storing/searching geographic coordinates, but when writing as
XML the syntax str+37.774395-122.422156/str might work fine)

In a case like yours, where you genuinely need to extend the list of valid
tags, XMLSchema has a mechanism for that by letting you define your
own XSD which can reuse the elements defined in the main XSD. (the same
way DTDs can reuse elements from other DTDs)

all of this being a somewhat theoretical issue: since Solr doens't
currently do anything with that XSD ... I assume if/when it does, it will
be voluntary (ie: there might be a config option to have it include an XSD
of your choice in the XML header of the responses so you can validate if you
choose to)



-Hoss



Re: Indexing XML files

2006-12-06 Thread Graham O'Regan

couldn't you use a cdata section?

Chris Hostetter wrote:

Since XML is the transport for sending data to Solr, you need to make sure
all field values are XML escaped.

If you wanted to index a plain text title and that tile contained an
ampersand character

Sense  Sensability

...you would need to XML escape that as...

Sense amp; Sensability

...Solr internally will treat that consistently as the JAva string Sense
 Sensability and when it comes time to return that string back to your
query clients, will output it in whatever form is appropraite for your
ResponseWriter -- if that's XML, then it will be XML escaped again, if
it's JSON or something ike it, it can probably be left alone.

The same holds tru for any other characters you wna to include in your
field values: Solr doens't care that they *value* itself is an XML string,
just that you properly escape the value in your XML adddoc message to
Solr...

 add
  doc
   field name=titleAs You Like it/field
   field name=authorShakespeare, William/field
   field name=recordlt;myxmlgt;here goes the xml...lt;/myxmlgt;/field
  /doc
 /add

...does that make sense?

: Ideally, I would like to store the xml as is, and index only the content
: removing the xml-tags (I believe there is HTMLStripWhitespaceAnalyzer for
: that).
: And output the result as an xml (so, simple escaping does not work for me).

the escaping is just to send the data to Solr -- once sent, Solr will
process the unescaped string when deailing with analyzers, etc exactly as
you'd expect.


-Hoss


  


Re: Indexing XML files

2006-12-06 Thread Yonik Seeley

On 12/6/06, Graham O'Regan [EMAIL PROTECTED] wrote:

couldn't you use a cdata section?


That's just another form of escaping.  Mirko actually want's the XML
field value to be part of the XML of Solr's response, not encapsulated
by it.

-Yonik


Indexing XML files

2006-12-05 Thread mirko
Hi,

I am trying to index an xml file as a field in lucene, see example below:

add
 doc
  field name=titleAs You Like it/field
  field name=authorShakespeare, William/field
  field name=recordmyxmlhere goes the xml.../myxml/field
 /doc
/add

I can index the title and author fields because they are strings, but the
record field is an xml itself and I bump into some problems as I cannot
directly input an xml file using the post.sh script (solr complains).


I wonder what would be the correct (and relatively simple) way of doing it. 
Ideally, I would like to store the xml as is, and index only the content
removing the xml-tags (I believe there is HTMLStripWhitespaceAnalyzer for
that).
And output the result as an xml (so, simple escaping does not work for me).


So far, I had the idea of escaping the xml record and then unescaping it for
inner storage and using the analyzer for indexing (which would possible
require creating a class like XMLField or such).

thanks,
mirko


Re: Indexing XML files

2006-12-05 Thread Mike Klaas

On 12/5/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

You are right, it is escaped.  But my question is: (how) can I
make it unescaped?


I don't think solr will support such functionality.  The xml that solr
uses to return data is completely orthogonal to the xml embedded in
the data, and mixing the two would have utterly unpredictable results.
What if a document contained a str ... element?  That could crash
the parsing code, or leave it vulnerable to injection attacks.

Try using the JSON output format if you absolutely have no way of
unescaping the resulting data (though I'd expect that any
self-respecting xml parser would do that for you).

-MIke


Re: Indexing XML files

2006-12-05 Thread mirko
Hi,

Thanks for the quick response.  Now, I have one more question.
Is it possible to get the result for a query back in the following form
(considering the input is the escaped xml, what you mentioned before):

response
 responseHeader
  status0/status
  QTime0/QTime
 /responseHeader

 result numFound=1 start=0
  doc
   str name=labelAs You Like It (Promptbook of McVicars 1860)/str
   str name=authorShakespeare, William,/str
   str name=recordmyxml.../myxml/str
  /doc
 /result
/response

Note, that the here the xml data is not escaped.  If yes, what do I have to do
to get such results back?  Would str need to be replaced with a type, say,
xml which has a different write method?  Or will I only be able to display
escaped xml within str (and any other types).  If so, why?

thanks,
mirko


Quoting Chris Hostetter [EMAIL PROTECTED]:


 Since XML is the transport for sending data to Solr, you need to make sure
 all field values are XML escaped.

 If you wanted to index a plain text title and that tile contained an
 ampersand character

   Sense  Sensability

 ...you would need to XML escape that as...

   Sense amp; Sensability

 ...Solr internally will treat that consistently as the JAva string Sense
  Sensability and when it comes time to return that string back to your
 query clients, will output it in whatever form is appropraite for your
 ResponseWriter -- if that's XML, then it will be XML escaped again, if
 it's JSON or something ike it, it can probably be left alone.

 The same holds tru for any other characters you wna to include in your
 field values: Solr doens't care that they *value* itself is an XML string,
 just that you properly escape the value in your XML adddoc message to
 Solr...

  add
   doc
field name=titleAs You Like it/field
field name=authorShakespeare, William/field
field name=recordlt;myxmlgt;here goes the
 xml...lt;/myxmlgt;/field
   /doc
  /add

 ...does that make sense?

 : Ideally, I would like to store the xml as is, and index only the content
 : removing the xml-tags (I believe there is HTMLStripWhitespaceAnalyzer for
 : that).
 : And output the result as an xml (so, simple escaping does not work for me).

 the escaping is just to send the data to Solr -- once sent, Solr will
 process the unescaped string when deailing with analyzers, etc exactly as
 you'd expect.


 -Hoss





Re: Indexing XML files

2006-12-05 Thread Yonik Seeley

On 12/5/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

Thanks for the quick response.  Now, I have one more question.
Is it possible to get the result for a query back in the following form
(considering the input is the escaped xml, what you mentioned before):

response
 responseHeader
  status0/status
  QTime0/QTime
 /responseHeader

 result numFound=1 start=0
  doc
   str name=labelAs You Like It (Promptbook of McVicars 1860)/str
   str name=authorShakespeare, William,/str
   str name=recordmyxml.../myxml/str
  /doc
 /result
/response

Note, that the here the xml data is not escaped.


I bet it is escaped, but your browser has helpfully displayed it as unescaped.
Try doing CTRL-U in firefox to see the real source for the reply.


-Yonik


Re: Indexing XML files

2006-12-05 Thread mirko
Hi,

the idea is to apply XSLT transformation on the result.  But it seems that
I would have to apply two transformations in a row, one which unescapes the
escaped node and a second which performs the actual transformation...

mirko


Quoting Yonik Seeley [EMAIL PROTECTED]:

 On 12/5/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
  You are right, it is escaped.  But my question is: (how) can I
  make it unescaped?

 For what purpose?
 If you use an XML parser, the values it gives back to you will be unescaped.

 -Yonik





Re: Indexing XML files

2006-12-05 Thread Walter Underwood
At some point, it would be simpler to write a custom response handler
and generate the output in your desired XML format.

wunder

On 12/5/06 1:52 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Hi,
 
 the idea is to apply XSLT transformation on the result.  But it seems that
 I would have to apply two transformations in a row, one which unescapes the
 escaped node and a second which performs the actual transformation...
 
 mirko
 
 
 Quoting Yonik Seeley [EMAIL PROTECTED]:
 
 On 12/5/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 You are right, it is escaped.  But my question is: (how) can I
 make it unescaped?
 
 For what purpose?
 If you use an XML parser, the values it gives back to you will be unescaped.
 
 -Yonik



Re: Indexing XML files

2006-12-05 Thread Chris Hostetter

: At some point, it would be simpler to write a custom response handler
: and generate the output in your desired XML format.

I think Walters got the right idea ... as a general rule, we want to make
the XmlResponseWriter bullet proof so that no matter waht data you put
into your index, it is garunteed to produce a well formed XML document
that conforms to a specified DTD, or XSD (see SOLR-17 for one we already
have but we haven't figured out what to do with yet)

But I can certainly understand your use case: you know you have
wellformed XML values in some fields, and want to be able ot apply
a simple XSL transform on the whole response, and use XPath selectors to
pull data out of your response fields.

the best approach i can think of that should work for you out of the box
is what you already said: two XSL trnasforms ... one can be applied
on the Solr server using the qt=xslt response -- just create an XSL that
generates XML and unescapes the fields you know will contain wellformed
XML data -- then apply your second transform client side (or using a
proxy)

if you're interested in writing a bit of custom java code you could in
fact write a new FieldType (which could easily subclass TextField) with a
custom write method that just outputs the raw value directly, and then
load your field type as a plugin...

http://wiki.apache.org/solr/SolrPlugins

-Hoss