Getting error while indexing XML files on Hadoop
Hi to all from Istanbul, Turkey, I can say that I'm a newbie in Solr Hadoop, I’m trying to index XML files (ipod_other.xml from lucidworks’ example files, converted into sequence file format), using SolrXMLIngestMapper jars. I’ve modified the schema.xml file by making the necesssary addions of the fields stated in the ipod_other.xml file. *Here’s my command:* hadoop jar jobjar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.SolrXMLIngestMapper -c hdp1 -i /user/hadoop/output/1420812982906sfu/part-r-0 -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http://dc2vmhadappt01:8983/solr In the end I constatly get Didn’t ingest any documents, failing error. Anybody out there to help me out with this problem, any help is appreciated.. Thanks *Here are the addions to the schema.xml:* field name=id type=string indexed=true stored=true required=true multiValued=false / field name=name multiValued=true stored=true type=text_en indexed=true/ field name=sku type=text_en_splitting_tight indexed=true stored=true omitNorms=true/ field name=manu type=text_general indexed=true stored=true omitNorms=true/ field name=cat type=string indexed=true stored=true multiValued=true/ field name=features type=text_general indexed=true stored=true multiValued=true/ field name=includes type=text_general indexed=true stored=true termVectors=true termPositions=true termOffsets=true / field name=weight type=float indexed=true stored=true/ field name=price type=float indexed=true stored=true/ field name=popularity type=int indexed=true stored=true / field name=inStock type=boolean indexed=true stored=true / field name=store type=location indexed=true stored=true/ dynamicField name=*_dt type=dateindexed=true stored=true/ field name=data_source stored=false type=text_en indexed=true/ *And here is the ipod_other.xml file;* add doc field name=idF8V7067-APL-KIT/field field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field field name=manuBelkin/field field name=catelectronics/field field name=catconnector/field field name=featurescar power adapter, white/field field name=weight4/field field name=price19.95/field field name=popularity1/field field name=inStockfalse/field field name=store45.17614,-93.87341/field field name=manufacturedate_dt2005-08-01T16:30:25Z/field /doc doc field name=idIW-02/field field name=nameiPod amp; iPod Mini USB 2.0 Cable/field field name=manuBelkin/field field name=catelectronics/field field name=catconnector/field field name=featurescar power adapter for iPod, white/field field name=weight2/field field name=price11.50/field field name=popularity1/field field name=inStockfalse/field field name=store37.7752,-122.4232/field field name=manufacturedate_dt2006-02-14T23:55:59Z/field /doc /add -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-error-while-indexing-XML-files-on-Hadoop-tp4179168.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: [Indexing XML files in Solr with DataImportHandler]
it is not indexing, it is saying there are no files indexed -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-XML-files-in-Solr-with-DataImportHandler-tp4095628p4095811.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: [Indexing XML files in Solr with DataImportHandler]
On 16 October 2013 13:06, kujta1 kujtim.rahm...@gmail.com wrote: it is not indexing, it is saying there are no files indexed If you expect answers on the mailing list it might be best to provide details here. From a quick glance at Stackoverflow, it looks like you need a FileListEntityProcessor. Searching Google turns up many examples of using a FileDataSource, e.g., see: http://java.dzone.com/news/data-import-handler-%E2%80%93-import Regards, Gora
[Indexing XML files in Solr with DataImportHandler]
hello i have problems wih indexing xml file format. my solrconfigdaa-config and solr files are here http://stackoverflow.com/questions/19337979/indexing-xml-files-in-solr-with-dataimporthandlerCan sombody help me why thi is not working!!thank you -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-XML-files-in-Solr-with-DataImportHandler-tp4095628.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: [Indexing XML files in Solr with DataImportHandler]
What is not working? Are you seeing any exceptions in the logs? On Tue, Oct 15, 2013 at 3:53 PM, kujta1 kujtim.rahm...@gmail.com wrote: hello i have problems wih indexing xml file format. my solrconfigdaa-config and solr files are here http://stackoverflow.com/questions/19337979/indexing-xml-files-in-solr-with-dataimporthandlerCan sombody help me why thi is not working!!thank you -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-XML-files-in-Solr-with-DataImportHandler-tp4095628.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
RE: full-text indexing XML files
CDATA didn’t work either.It still complained about the input doc not being in correct format. -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Thursday, December 10, 2009 7:43 PM To: solr-user@lucene.apache.org Subject: Re: full-text indexing XML files Or CDATA (much easier to work with). On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote: Hi! I am trying to full text index an XML file. For various reasons, I cannot use Tika or other technology to parse the XML file. The requirement is to full-text index the XML file, including Tags and everything. So, I created a input index spec like this: add doc field name=id1001/field field name=nameNASA Advanced Research Labs/field field name=address1010 Main Street, Chattanooga, FL 32212/field field name=contentlistingid1001/id name NASA Advanced Research Labs / name address1010 main street, chattanooga, FL 32212/address/listing/field /doc /add You need to XML encode the value of the content field. -- Regards, Shalin Shekhar Mangar. -- Lance Norskog goks...@gmail.com
RE: full-text indexing XML files
Yeah, xml tags as well. Essentially we want to full-text index the file, without the need for stemming the tokens. Will the SOLR analyzer be able to tokenize the document correctly if it does not have any whitespaces (besides those required by XML syntax)? -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Thursday, December 10, 2009 8:00 PM To: solr-user@lucene.apache.org Subject: Re: full-text indexing XML files What kind of searches do you want to do? Do you want to do searches that match the XML tags? wunder On Dec 10, 2009, at 7:43 PM, Lance Norskog wrote: Or CDATA (much easier to work with). On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote: Hi! I am trying to full text index an XML file. For various reasons, I cannot use Tika or other technology to parse the XML file. The requirement is to full-text index the XML file, including Tags and everything. So, I created a input index spec like this: add doc field name=id1001/field field name=nameNASA Advanced Research Labs/field field name=address1010 Main Street, Chattanooga, FL 32212/field field name=contentlistingid1001/id name NASA Advanced Research Labs / name address1010 main street, chattanooga, FL 32212/address/listing/field /doc /add You need to XML encode the value of the content field. -- Regards, Shalin Shekhar Mangar. -- Lance Norskog goks...@gmail.com
Re: full-text indexing XML files
If you really want to do XML-senstive search, it could be a lot of work in Solr. Lucene is a flat data model, so hierarchy requires a lot of mapping to the schema or fancy, slow queries. There are engines that are designed for XML indexing and search, using XQuery, so consider whether that might be less work overall. XML engines are less mature than Lucene and Solr, so there is a big performance and scalability gap between the best free engines (eXist) and the best commercial engines (Mark Logic, where I work). wunder Walter Underwood Lead Engineer, Mark Logic On Dec 11, 2009, at 9:42 AM, Feroze Daud wrote: Yeah, xml tags as well. Essentially we want to full-text index the file, without the need for stemming the tokens. Will the SOLR analyzer be able to tokenize the document correctly if it does not have any whitespaces (besides those required by XML syntax)? -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Thursday, December 10, 2009 8:00 PM To: solr-user@lucene.apache.org Subject: Re: full-text indexing XML files What kind of searches do you want to do? Do you want to do searches that match the XML tags? wunder On Dec 10, 2009, at 7:43 PM, Lance Norskog wrote: Or CDATA (much easier to work with). On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote: Hi! I am trying to full text index an XML file. For various reasons, I cannot use Tika or other technology to parse the XML file. The requirement is to full-text index the XML file, including Tags and everything. So, I created a input index spec like this: add doc field name=id1001/field field name=nameNASA Advanced Research Labs/field field name=address1010 Main Street, Chattanooga, FL 32212/field field name=contentlistingid1001/id name NASA Advanced Research Labs / name address1010 main street, chattanooga, FL 32212/address/listing/field /doc /add You need to XML encode the value of the content field. -- Regards, Shalin Shekhar Mangar. -- Lance Norskog goks...@gmail.com
Re: full-text indexing XML files
Please post a small sample file that has this problem with CDATA. On Fri, Dec 11, 2009 at 9:41 AM, Feroze Daud fero...@zillow.com wrote: CDATA didn’t work either.It still complained about the input doc not being in correct format. -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Thursday, December 10, 2009 7:43 PM To: solr-user@lucene.apache.org Subject: Re: full-text indexing XML files Or CDATA (much easier to work with). On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote: Hi! I am trying to full text index an XML file. For various reasons, I cannot use Tika or other technology to parse the XML file. The requirement is to full-text index the XML file, including Tags and everything. So, I created a input index spec like this: add doc field name=id1001/field field name=nameNASA Advanced Research Labs/field field name=address1010 Main Street, Chattanooga, FL 32212/field field name=contentlistingid1001/id name NASA Advanced Research Labs / name address1010 main street, chattanooga, FL 32212/address/listing/field /doc /add You need to XML encode the value of the content field. -- Regards, Shalin Shekhar Mangar. -- Lance Norskog goks...@gmail.com -- Lance Norskog goks...@gmail.com
Re: full-text indexing XML files
Or CDATA (much easier to work with). On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote: Hi! I am trying to full text index an XML file. For various reasons, I cannot use Tika or other technology to parse the XML file. The requirement is to full-text index the XML file, including Tags and everything. So, I created a input index spec like this: add doc field name=id1001/field field name=nameNASA Advanced Research Labs/field field name=address1010 Main Street, Chattanooga, FL 32212/field field name=contentlistingid1001/id name NASA Advanced Research Labs / name address1010 main street, chattanooga, FL 32212/address/listing/field /doc /add You need to XML encode the value of the content field. -- Regards, Shalin Shekhar Mangar. -- Lance Norskog goks...@gmail.com
Re: full-text indexing XML files
What kind of searches do you want to do? Do you want to do searches that match the XML tags? wunder On Dec 10, 2009, at 7:43 PM, Lance Norskog wrote: Or CDATA (much easier to work with). On Wed, Dec 9, 2009 at 10:37 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote: Hi! I am trying to full text index an XML file. For various reasons, I cannot use Tika or other technology to parse the XML file. The requirement is to full-text index the XML file, including Tags and everything. So, I created a input index spec like this: add doc field name=id1001/field field name=nameNASA Advanced Research Labs/field field name=address1010 Main Street, Chattanooga, FL 32212/field field name=contentlistingid1001/id name NASA Advanced Research Labs / name address1010 main street, chattanooga, FL 32212/address/listing/field /doc /add You need to XML encode the value of the content field. -- Regards, Shalin Shekhar Mangar. -- Lance Norskog goks...@gmail.com
full-text indexing XML files
Hi! I am trying to full text index an XML file. For various reasons, I cannot use Tika or other technology to parse the XML file. The requirement is to full-text index the XML file, including Tags and everything. So, I created a input index spec like this: add doc field name=id1001/field field name=nameNASA Advanced Research Labs/field field name=address1010 Main Street, Chattanooga, FL 32212/field field name=contentlistingid1001/id name NASA Advanced Research Labs / name address1010 main street, chattanooga, FL 32212/address/listing/field /doc /add When I try to pump this into SLOR with java -jar post.jar I get an exception saying: SimplePostTool: version 1.2 SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported SimplePostTool: POSTing files to http://localhost:8983/solr/update.. SimplePostTool: POSTing file index.doc SimplePostTool: FATAL: Solr returned an error: unexpected_XML_tag_doclisting Any idea what I am doing wrong? Does the Solr index generator support inner XML content in it's field tags? I tried enclosing the innerXML in ![CDATA[]] but that didn't work either. Any help appreciated. Thanks Feroze.
Re: full-text indexing XML files
On Thu, Dec 10, 2009 at 5:13 AM, Feroze Daud fero...@zillow.com wrote: Hi! I am trying to full text index an XML file. For various reasons, I cannot use Tika or other technology to parse the XML file. The requirement is to full-text index the XML file, including Tags and everything. So, I created a input index spec like this: add doc field name=id1001/field field name=nameNASA Advanced Research Labs/field field name=address1010 Main Street, Chattanooga, FL 32212/field field name=contentlistingid1001/id name NASA Advanced Research Labs / name address1010 main street, chattanooga, FL 32212/address/listing/field /doc /add You need to XML encode the value of the content field. -- Regards, Shalin Shekhar Mangar.
Re: Error when indexing XML files
Hi, Please find the schema file attached. Please let me know what I am doing wrong. Regards Chaitali --- On Wed, 10/14/09, Fergus McMenemie fer...@twig.me.uk wrote: From: Fergus McMenemie fer...@twig.me.uk Subject: Re: Error when indexing XML files To: solr-user@lucene.apache.org Date: Wednesday, October 14, 2009, 2:25 AM Hi, I am trying to index XML files using SolrJ. The original XML file contains nested elements. For example, the following is the snippet of the XML file. entry nameSOMETHING /name facilitySOME_OTHER_THING/facility /entry I have added the elements name and facility in Schema.xml file to make these elements indexable. I have changed the XML document above to look like - add doc .. field name=nameSOMETHING/field .. /doc /add Can you send us the Schema.xml file you created? I suspect that one of the fields should be multivalued. field name=facility type=string indexed=true stored=true/ field name=name type=text indexed=true stored=true/ one or other, perhaps both your fields need to be field name=facility type=string indexed=true stored=true multiValued=true/ field name=name type=text indexed=true stored=true multiValued=true/ -- Fergus.
Re: Error when indexing XML files
Hi, Please find the schema file attached. Please let me know what I am doing wrong. Regards Chaitali --- On Wed, 10/14/09, Fergus McMenemie fer...@twig.me.uk wrote: From: Fergus McMenemie fer...@twig.me.uk Subject: Re: Error when indexing XML files To: solr-user@lucene.apache.org Date: Wednesday, October 14, 2009, 2:25 AM Hi, I am trying to index XML files using SolrJ. The original XML file contains nested elements. For example, the following is the snippet of the XML file. entry nameSOMETHING /name facilitySOME_OTHER_THING/facility /entry I have added the elements name and facility in Schema.xml file to make these elements indexable. I have changed the XML document above to look like - add doc .. field name=nameSOMETHING/field .. /doc /add Can you send us the Schema.xml file you created? I suspect that one of the fields should be multivalued. field name=facility type=string indexed=true stored=true/ field name=name type=text indexed=true stored=true/ one or other, perhaps both your fields need to be field name=facility type=string indexed=true stored=true multiValued=true/ field name=name type=text indexed=true stored=true multiValued=true/ -- Fergus
Re: Error when indexing XML files
Hi, I am trying to index XML files using SolrJ. The original XML file contains nested elements. For example, the following is the snippet of the XML file. entry nameSOMETHING /name facilitySOME_OTHER_THING/facility /entry I have added the elements name and facility in Schema.xml file to make these elements indexable. I have changed the XML document above to look like - add doc .. field name=nameSOMETHING/field .. /doc /add Can you send us the Schema.xml file you created? I suspect that one of the fields should be multivalued. -- Fergus.
Re: Error when indexing XML files
Hi, Please find the schema file attached. Please let me know what I am doing wrong. Regards Chaitali --- On Wed, 10/14/09, Fergus McMenemie fer...@twig.me.uk wrote: From: Fergus McMenemie fer...@twig.me.uk Subject: Re: Error when indexing XML files To: solr-user@lucene.apache.org Date: Wednesday, October 14, 2009, 2:25 AM Hi, I am trying to index XML files using SolrJ. The original XML file contains nested elements. For example, the following is the snippet of the XML file. entry nameSOMETHING /name facilitySOME_OTHER_THING/facility /entry I have added the elements name and facility in Schema.xml file to make these elements indexable. I have changed the XML document above to look like - add doc .. field name=nameSOMETHING/field .. /doc /add Can you send us the Schema.xml file you created? I suspect that one of the fields should be multivalued. -- Fergus. ?xml version=1.0 encoding=UTF-8 ? !-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the License); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -- !-- This is the Solr schema file. This file should be named schema.xml and should be in the conf directory under the solr home (i.e. ./solr/conf/schema.xml by default) or located where the classloader for the Solr webapp can find it. This example schema is the recommended starting point for users. It should be kept correct and concise, usable out-of-the-box. For more information, on how to customize this file, please see http://wiki.apache.org/solr/SchemaXml -- schema name=example version=1.1 !-- attribute name is the name of this schema and is only used for display purposes. Applications should change this to reflect the nature of the search collection. version=1.1 is Solr's version number for the schema syntax and semantics. It should not normally be changed by applications. 1.0: multiValued attribute did not exist, all fields are multiValued by nature 1.1: multiValued attribute introduced, false by default -- types !-- field type definitions. The name attribute is just a label to be used by field definitions. The class attribute and any other attributes determine the real behavior of the fieldType. Class names starting with solr refer to java classes in the org.apache.solr.analysis package. -- !-- The StrField type is not analyzed, but indexed/stored verbatim. - StrField and TextField support an optional compressThreshold which limits compression (if enabled in the derived fields) to values which exceed a certain size (in characters). -- fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/ !-- boolean type: true or false -- fieldType name=boolean class=solr.BoolField sortMissingLast=true omitNorms=true/ !-- The optional sortMissingLast and sortMissingFirst attributes are currently supported on types that are sorted internally as strings. - If sortMissingLast=true, then a sort on this field will cause documents without the field to come after documents with the field, regardless of the requested sort order (asc or desc). - If sortMissingFirst=true, then a sort on this field will cause documents without the field to come before documents with the field, regardless of the requested sort order. - If sortMissingLast=false and sortMissingFirst=false (the default), then default lucene sorting will be used which places docs without the field first in an ascending sort and last in a descending sort. -- !-- numeric field types that store and index the text value verbatim (and hence don't support range queries, since the lexicographic ordering isn't equal to the numeric ordering) -- fieldType name=integer class=solr.IntField omitNorms=true/ fieldType name=long class=solr.LongField omitNorms=true/ fieldType name=float class=solr.FloatField omitNorms=true/ fieldType name=double class=solr.DoubleField omitNorms=true/ !-- Numeric field types that manipulate the value into a string value that isn't human-readable in its internal form, but with a lexicographic ordering the same as the numeric
Error when indexing XML files
Hi, I am trying to index XML files using SolrJ. The original XML file contains nested elements. For example, the following is the snippet of the XML file. entry nameSOMETHING /name facilitySOME_OTHER_THING/facility /entry I have added the elements name and facility in Schema.xml file to make these elements indexable. I have changed the XML document above to look like - add doc .. field name=nameSOMETHING/field .. /doc /add I am getting the following error when I start Jetty - org.apache.solr.common.SolrException: ERROR_5457843_multiple_values_encountered_for_non_multiValued_field_facility___tracklesstrackless_ Can anyone please let me know if there is something I am doing wrong ? How can I maintain the parent-child relationship of the original XML file in the modified XML file? Can I not use the original XML file as it is for indexing purposes? Thanks in advance. - Chaitali
Re: Question on modifying solr behavior on indexing xml files..
On Thu, Oct 1, 2009 at 3:10 PM, Thung, Peter C CIV SPAWARSYSCEN-PACIFIC, 56340 peter.th...@navy.mil wrote: 1. In my playing around with sending in an XML document within a an XML CDATA tag, with termVectors=true I noticed the following behavior: personpeter/person collapses to the term personpeterperson instead of person and peter separately. I realize I could try and do a search and replaces of characters like = to a space so that the default parser/indexer can preserve element names. However, I'm wondering if someon could point me to where one might do this withing the solr or apache lucene code as a proper plug in with maybe an example that I could use as a template. Also where in the solrconfig.xml file I would want to change to reference the new parser. Solr is agnostic of the content in a schema field. It does not know that it is XML and hence it will do blind tokenization/filtering as defined for the field type in schema.xml If all you want is to do a full-text search on words found somewhere in that XML, then your approach of replacing = to a space will work fine. You can use the PatternReplaceFilter and specify a regex which matches these special characters and replaces them by a space. filter class=solr.PatternReplaceFilterFactory pattern=([=]) replacement= replace=all/ Or you can use the MappingCharFilter (solr 1.4 feature) and specify a mapping file which has these special characters mapped to a space. charFilter class=solr.MappingCharFilterFactory mapping=special-xml-symbols.txt/ The file should be in the format: characterToBeReplaced = replacementChar However, if you want to preserve the structure of the XML document, it is best to parse it out yourself and put contents into Solr fields before sending it to Solr. You may also want to look at DataImportHandler and XPathEntityProcessor which is commonly used for importing XML files. http://wiki.apache.org/solr/DataImportHandler 2. My other question would also be if this technique would work for XML type messages embedded in Microsoft Excel, or Powerpoint presentations where I would like to preserve knowining xml element term frequencies where I would try and leverage the component that automatically indexes microsoft documents. Would I need to modify that component and customize it? Perhaps somebody who knows about Solr Cell can answer this but I think it should work. -- Regards, Shalin Shekhar Mangar.
Question on modifying solr behavior on indexing xml files..
1. In my playing around with sending in an XML document within a an XML CDATA tag, with termVectors=true I noticed the following behavior: personpeter/person collapses to the term personpeterperson instead of person and peter separately. I realize I could try and do a search and replaces of characters like = to a space so that the default parser/indexer can preserve element names. However, I'm wondering if someon could point me to where one might do this withing the solr or apache lucene code as a proper plug in with maybe an example that I could use as a template. Also where in the solrconfig.xml file I would want to change to reference the new parser. 2. My other question would also be if this technique would work for XML type messages embedded in Microsoft Excel, or Powerpoint presentations where I would like to preserve knowining xml element term frequencies where I would try and leverage the component that automatically indexes microsoft documents. Would I need to modify that component and customize it? -Peter
Re: query regarding Indexing xml files -db-data-config.xml
Hi Noble, Thanks for the reply, As advised I have changed the db-data-config.xml as below. But still the str name=Indexing completed. Added/Updated: 0 documents. Deleted 0 documents./str dataConfig dataSource type=FileDataSource name =xmlindex/ document name=products entity name=xmlfile processor=FileListEntityProcessor fileName=c:\\test\\ipod_other.xml recursive=true rootEntity=false dataSource=null baseDir=${dataimporter.request.xmlDataDir} useSolrAddSchema=true entity name=data processor=XPathEntityProcessor url=${xmlfile.fileAbsolutePath} field column=manu name=manu/ /entity /entity /document /dataConfig Got error as below when baseDir is removed INFO: last commit = 1242683454570 May 18, 2009 2:55:15 PM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is a required attribute Pro cessing Document # 1 at org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.j ava:76) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:299) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:225) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:167) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:324) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:382) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:363) May 18, 2009 2:55:15 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback Please advise. Thanks and regards, Jay 2009/5/17 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com hi , u may not need that enclosing entity , if you only wish to index one file. baseDir is not required if you give absolute path in the fileName. no need to mention forEach or fields if you set useSolrAddSchema=true On Sat, May 16, 2009 at 1:23 AM, jayakeerthi s mail2keer...@gmail.com wrote: Hi All, I am trying to index the fileds from the xml files, here is the configuration that I am using. db-data-config.xml dataConfig dataSource type=FileDataSource name =xmlindex/ document name=products entity name=xmlfile processor=FileListEntityProcessor fileName=c:\test\ipod_other.xml recursive=true rootEntity=false dataSource=null baseDir=${dataimporter.request.xmlDataDir} entity name=data processor=XPathEntityProcessor forEach=/record | /the/record/xpath url=${xmlfile.fileAbsolutePath} field column=manu name=manu/ /entity /entity /document /dataConfig Schema.xml has the field manu The input xml file used to import the field is doc field name=idF8V7067-APL-KIT/field field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field field name=manuBelkin/field field name=catelectronics/field field name=catconnector/field field name=featurescar power adapter, white/field field name=weight4/field field name=price19.95/field field name=popularity1/field field name=inStockfalse/field /doc doing the full-import this is the response I am getting - lst name=statusMessages str name=Total Requests made to DataSource0/str str name=Total Rows Fetched0/str str name=Total Documents Skipped0/str str name=Full Dump Started2009-05-15 11:58:00/str str name=Indexing completed. Added/Updated: 0 documents. Deleted 0 documents./str str name=Committed2009-05-15 11:58:00/str str name=Optimized2009-05-15 11:58:00/str str name=Time taken0:0:0.172/str /lst str name=WARNINGThis response format is experimental. It is likely to change in the future./str /response Do I missing anything here or is there any format on the input xml,?? please help resolving this. Thanks and regards, Jay -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: query regarding Indexing xml files -db-data-config.xml
hi , u may not need that enclosing entity , if you only wish to index one file. baseDir is not required if you give absolute path in the fileName. no need to mention forEach or fields if you set useSolrAddSchema=true On Sat, May 16, 2009 at 1:23 AM, jayakeerthi s mail2keer...@gmail.com wrote: Hi All, I am trying to index the fileds from the xml files, here is the configuration that I am using. db-data-config.xml dataConfig dataSource type=FileDataSource name =xmlindex/ document name=products entity name=xmlfile processor=FileListEntityProcessor fileName=c:\test\ipod_other.xml recursive=true rootEntity=false dataSource=null baseDir=${dataimporter.request.xmlDataDir} entity name=data processor=XPathEntityProcessor forEach=/record | /the/record/xpath url=${xmlfile.fileAbsolutePath} field column=manu name=manu/ /entity /entity /document /dataConfig Schema.xml has the field manu The input xml file used to import the field is doc field name=idF8V7067-APL-KIT/field field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field field name=manuBelkin/field field name=catelectronics/field field name=catconnector/field field name=featurescar power adapter, white/field field name=weight4/field field name=price19.95/field field name=popularity1/field field name=inStockfalse/field /doc doing the full-import this is the response I am getting - lst name=statusMessages str name=Total Requests made to DataSource0/str str name=Total Rows Fetched0/str str name=Total Documents Skipped0/str str name=Full Dump Started2009-05-15 11:58:00/str str name=Indexing completed. Added/Updated: 0 documents. Deleted 0 documents./str str name=Committed2009-05-15 11:58:00/str str name=Optimized2009-05-15 11:58:00/str str name=Time taken0:0:0.172/str /lst str name=WARNINGThis response format is experimental. It is likely to change in the future./str /response Do I missing anything here or is there any format on the input xml,?? please help resolving this. Thanks and regards, Jay -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: query regarding Indexing xml files -db-data-config.xml
Hmmm, I thought that if you were using the XPathEntityProcessor that you have to specify an xpath for each of the fields you want to populate. Unless you are using XPathEntityProcessor's use useSolrAddSchema mode? Fergus. If that is your complete input file then it looks like you are missing the wrapping add/add element: add doc field name=idF8V7067-APL-KIT/ field field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field field name=manuBelkin/field field name=catelectronics/field field name=catconnector/field field name=featurescar power adapter, white/field field name=weight4/field field name=price19.95/field field name=popularity1/field field name=inStockfalse/field /doc /add Is it possible you just forgot to include the add? -Jay On Fri, May 15, 2009 at 12:53 PM, jayakeerthi s mail2keer...@gmail.comwrote: Hi All, I am trying to index the fileds from the xml files, here is the configuration that I am using. db-data-config.xml dataConfig dataSource type=FileDataSource name =xmlindex/ document name=products entity name=xmlfile processor=FileListEntityProcessor fileName=c:\test\ipod_other.xml recursive=true rootEntity=false dataSource=null baseDir=${dataimporter.request.xmlDataDir} entity name=data processor=XPathEntityProcessor forEach=/record | /the/record/xpath url=${xmlfile.fileAbsolutePath} field column=manu name=manu/ /entity /entity /document /dataConfig Schema.xml has the field manu The input xml file used to import the field is doc field name=idF8V7067-APL-KIT/field field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field field name=manuBelkin/field field name=catelectronics/field field name=catconnector/field field name=featurescar power adapter, white/field field name=weight4/field field name=price19.95/field field name=popularity1/field field name=inStockfalse/field /doc doing the full-import this is the response I am getting - lst name=statusMessages str name=Total Requests made to DataSource0/str str name=Total Rows Fetched0/str str name=Total Documents Skipped0/str str name=Full Dump Started2009-05-15 11:58:00/str str name=Indexing completed. Added/Updated: 0 documents. Deleted 0 documents./str str name=Committed2009-05-15 11:58:00/str str name=Optimized2009-05-15 11:58:00/str str name=Time taken0:0:0.172/str /lst str name=WARNINGThis response format is experimental. It is likely to change in the future./str /response Do I missing anything here or is there any format on the input xml,?? please help resolving this. Thanks and regards, Jay -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
query regarding Indexing xml files -db-data-config.xml
Hi All, I am trying to index the fileds from the xml files, here is the configuration that I am using. db-data-config.xml dataConfig dataSource type=FileDataSource name =xmlindex/ document name=products entity name=xmlfile processor=FileListEntityProcessor fileName=c:\test\ipod_other.xml recursive=true rootEntity=false dataSource=null baseDir=${dataimporter.request.xmlDataDir} entity name=data processor=XPathEntityProcessor forEach=/record | /the/record/xpath url=${xmlfile.fileAbsolutePath} field column=manu name=manu/ /entity /entity /document /dataConfig Schema.xml has the field manu The input xml file used to import the field is doc field name=idF8V7067-APL-KIT/field field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field field name=manuBelkin/field field name=catelectronics/field field name=catconnector/field field name=featurescar power adapter, white/field field name=weight4/field field name=price19.95/field field name=popularity1/field field name=inStockfalse/field /doc doing the full-import this is the response I am getting - lst name=statusMessages str name=Total Requests made to DataSource0/str str name=Total Rows Fetched0/str str name=Total Documents Skipped0/str str name=Full Dump Started2009-05-15 11:58:00/str str name=Indexing completed. Added/Updated: 0 documents. Deleted 0 documents./str str name=Committed2009-05-15 11:58:00/str str name=Optimized2009-05-15 11:58:00/str str name=Time taken0:0:0.172/str /lst str name=WARNINGThis response format is experimental. It is likely to change in the future./str /response Do I missing anything here or is there any format on the input xml,?? please help resolving this. Thanks and regards, Jay
Re: query regarding Indexing xml files -db-data-config.xml
If that is your complete input file then it looks like you are missing the wrapping add/add element: add doc field name=idF8V7067-APL-KIT/ field field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field field name=manuBelkin/field field name=catelectronics/field field name=catconnector/field field name=featurescar power adapter, white/field field name=weight4/field field name=price19.95/field field name=popularity1/field field name=inStockfalse/field /doc /add Is it possible you just forgot to include the add? -Jay On Fri, May 15, 2009 at 12:53 PM, jayakeerthi s mail2keer...@gmail.comwrote: Hi All, I am trying to index the fileds from the xml files, here is the configuration that I am using. db-data-config.xml dataConfig dataSource type=FileDataSource name =xmlindex/ document name=products entity name=xmlfile processor=FileListEntityProcessor fileName=c:\test\ipod_other.xml recursive=true rootEntity=false dataSource=null baseDir=${dataimporter.request.xmlDataDir} entity name=data processor=XPathEntityProcessor forEach=/record | /the/record/xpath url=${xmlfile.fileAbsolutePath} field column=manu name=manu/ /entity /entity /document /dataConfig Schema.xml has the field manu The input xml file used to import the field is doc field name=idF8V7067-APL-KIT/field field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field field name=manuBelkin/field field name=catelectronics/field field name=catconnector/field field name=featurescar power adapter, white/field field name=weight4/field field name=price19.95/field field name=popularity1/field field name=inStockfalse/field /doc doing the full-import this is the response I am getting - lst name=statusMessages str name=Total Requests made to DataSource0/str str name=Total Rows Fetched0/str str name=Total Documents Skipped0/str str name=Full Dump Started2009-05-15 11:58:00/str str name=Indexing completed. Added/Updated: 0 documents. Deleted 0 documents./str str name=Committed2009-05-15 11:58:00/str str name=Optimized2009-05-15 11:58:00/str str name=Time taken0:0:0.172/str /lst str name=WARNINGThis response format is experimental. It is likely to change in the future./str /response Do I missing anything here or is there any format on the input xml,?? please help resolving this. Thanks and regards, Jay
Re: query regarding Indexing xml files -db-data-config.xml
Many thanks for the reply The complete input xml file is below I missed to include this earlier. add doc field name=idF8V7067-APL-KIT/field field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field field name=manuBelkin/field field name=catelectronics/field field name=catconnector/field field name=featurescar power adapter, white/field field name=weight4/field field name=price19.95/field field name=popularity1/field field name=inStockfalse/field /doc doc field name=idIW-02/field field name=nameiPod amp; iPod Mini USB 2.0 Cable/field field name=manuBelkin/field field name=catelectronics/field field name=catconnector/field field name=featurescar power adapter for iPod, white/field field name=weight2/field field name=price11.50/field field name=popularity1/field field name=inStockfalse/field /doc /add regards, Jay On Fri, May 15, 2009 at 1:14 PM, Jay Hill jayallenh...@gmail.com wrote: If that is your complete input file then it looks like you are missing the wrapping add/add element: add doc field name=idF8V7067-APL-KIT/ field field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field field name=manuBelkin/field field name=catelectronics/field field name=catconnector/field field name=featurescar power adapter, white/field field name=weight4/field field name=price19.95/field field name=popularity1/field field name=inStockfalse/field /doc /add Is it possible you just forgot to include the add? -Jay On Fri, May 15, 2009 at 12:53 PM, jayakeerthi s mail2keer...@gmail.com wrote: Hi All, I am trying to index the fileds from the xml files, here is the configuration that I am using. db-data-config.xml dataConfig dataSource type=FileDataSource name =xmlindex/ document name=products entity name=xmlfile processor=FileListEntityProcessor fileName=c:\test\ipod_other.xml recursive=true rootEntity=false dataSource=null baseDir=${dataimporter.request.xmlDataDir} entity name=data processor=XPathEntityProcessor forEach=/record | /the/record/xpath url=${xmlfile.fileAbsolutePath} field column=manu name=manu/ /entity /entity /document /dataConfig Schema.xml has the field manu The input xml file used to import the field is doc field name=idF8V7067-APL-KIT/field field name=nameBelkin Mobile Power Cord for iPod w/ Dock/field field name=manuBelkin/field field name=catelectronics/field field name=catconnector/field field name=featurescar power adapter, white/field field name=weight4/field field name=price19.95/field field name=popularity1/field field name=inStockfalse/field /doc doing the full-import this is the response I am getting - lst name=statusMessages str name=Total Requests made to DataSource0/str str name=Total Rows Fetched0/str str name=Total Documents Skipped0/str str name=Full Dump Started2009-05-15 11:58:00/str str name=Indexing completed. Added/Updated: 0 documents. Deleted 0 documents./str str name=Committed2009-05-15 11:58:00/str str name=Optimized2009-05-15 11:58:00/str str name=Time taken0:0:0.172/str /lst str name=WARNINGThis response format is experimental. It is likely to change in the future./str /response Do I missing anything here or is there any format on the input xml,?? please help resolving this. Thanks and regards, Jay
Re: Indexing XML files
Thank you all for the quick responses. They were very helpful. My XML is well-formed, so I ended up implementing my own FieldType: public class XMLField extends TextField { public void write(XMLWriter xmlWriter, String name, Fieldable f) throws IOException { xmlWriter.writePrim(xml, name, f.stringValue(), false); } } I looked at the XSD and there is one thing I don't understand: If the desired way is to conform to the XSD (and hence the types used in XSD), then how would it possible to use user-defined fieldtypes as plugins? Wouldn't they violate the same principle? thanks, mirko Quoting Chris Hostetter [EMAIL PROTECTED]: ... I think Walters got the right idea ... as a general rule, we want to make the XmlResponseWriter bullet proof so that no matter waht data you put into your index, it is garunteed to produce a well formed XML document that conforms to a specified DTD, or XSD (see SOLR-17 for one we already have but we haven't figured out what to do with yet) ... if you're interested in writing a bit of custom java code you could in fact write a new FieldType (which could easily subclass TextField) with a custom write method that just outputs the raw value directly, and then load your field type as a plugin... http://wiki.apache.org/solr/SolrPlugins -Hoss
Re: Indexing XML files
: I looked at the XSD and there is one thing I don't understand: : : If the desired way is to conform to the XSD (and hence the types used in XSD), : then how would it possible to use user-defined fieldtypes as plugins? Wouldn't : they violate the same principle? The XSD is intended to match the behavior of the XmlResponseWriter and the core solr code base ... if you write a new ResponseWriter (or use one of the other built in ResponseWriters like JSON or Ruby) then all bets are off. if you are writing a new FieldType, then you might still be able to use the XSD as is if your data can easily be represented using one of hte primative' types (ie: i might add a new LonLatFieldType class for efficinetly storing/searching geographic coordinates, but when writing as XML the syntax str+37.774395-122.422156/str might work fine) In a case like yours, where you genuinely need to extend the list of valid tags, XMLSchema has a mechanism for that by letting you define your own XSD which can reuse the elements defined in the main XSD. (the same way DTDs can reuse elements from other DTDs) all of this being a somewhat theoretical issue: since Solr doens't currently do anything with that XSD ... I assume if/when it does, it will be voluntary (ie: there might be a config option to have it include an XSD of your choice in the XML header of the responses so you can validate if you choose to) -Hoss
Re: Indexing XML files
couldn't you use a cdata section? Chris Hostetter wrote: Since XML is the transport for sending data to Solr, you need to make sure all field values are XML escaped. If you wanted to index a plain text title and that tile contained an ampersand character Sense Sensability ...you would need to XML escape that as... Sense amp; Sensability ...Solr internally will treat that consistently as the JAva string Sense Sensability and when it comes time to return that string back to your query clients, will output it in whatever form is appropraite for your ResponseWriter -- if that's XML, then it will be XML escaped again, if it's JSON or something ike it, it can probably be left alone. The same holds tru for any other characters you wna to include in your field values: Solr doens't care that they *value* itself is an XML string, just that you properly escape the value in your XML adddoc message to Solr... add doc field name=titleAs You Like it/field field name=authorShakespeare, William/field field name=recordlt;myxmlgt;here goes the xml...lt;/myxmlgt;/field /doc /add ...does that make sense? : Ideally, I would like to store the xml as is, and index only the content : removing the xml-tags (I believe there is HTMLStripWhitespaceAnalyzer for : that). : And output the result as an xml (so, simple escaping does not work for me). the escaping is just to send the data to Solr -- once sent, Solr will process the unescaped string when deailing with analyzers, etc exactly as you'd expect. -Hoss
Re: Indexing XML files
On 12/6/06, Graham O'Regan [EMAIL PROTECTED] wrote: couldn't you use a cdata section? That's just another form of escaping. Mirko actually want's the XML field value to be part of the XML of Solr's response, not encapsulated by it. -Yonik
Indexing XML files
Hi, I am trying to index an xml file as a field in lucene, see example below: add doc field name=titleAs You Like it/field field name=authorShakespeare, William/field field name=recordmyxmlhere goes the xml.../myxml/field /doc /add I can index the title and author fields because they are strings, but the record field is an xml itself and I bump into some problems as I cannot directly input an xml file using the post.sh script (solr complains). I wonder what would be the correct (and relatively simple) way of doing it. Ideally, I would like to store the xml as is, and index only the content removing the xml-tags (I believe there is HTMLStripWhitespaceAnalyzer for that). And output the result as an xml (so, simple escaping does not work for me). So far, I had the idea of escaping the xml record and then unescaping it for inner storage and using the analyzer for indexing (which would possible require creating a class like XMLField or such). thanks, mirko
Re: Indexing XML files
On 12/5/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: You are right, it is escaped. But my question is: (how) can I make it unescaped? I don't think solr will support such functionality. The xml that solr uses to return data is completely orthogonal to the xml embedded in the data, and mixing the two would have utterly unpredictable results. What if a document contained a str ... element? That could crash the parsing code, or leave it vulnerable to injection attacks. Try using the JSON output format if you absolutely have no way of unescaping the resulting data (though I'd expect that any self-respecting xml parser would do that for you). -MIke
Re: Indexing XML files
Hi, Thanks for the quick response. Now, I have one more question. Is it possible to get the result for a query back in the following form (considering the input is the escaped xml, what you mentioned before): response responseHeader status0/status QTime0/QTime /responseHeader result numFound=1 start=0 doc str name=labelAs You Like It (Promptbook of McVicars 1860)/str str name=authorShakespeare, William,/str str name=recordmyxml.../myxml/str /doc /result /response Note, that the here the xml data is not escaped. If yes, what do I have to do to get such results back? Would str need to be replaced with a type, say, xml which has a different write method? Or will I only be able to display escaped xml within str (and any other types). If so, why? thanks, mirko Quoting Chris Hostetter [EMAIL PROTECTED]: Since XML is the transport for sending data to Solr, you need to make sure all field values are XML escaped. If you wanted to index a plain text title and that tile contained an ampersand character Sense Sensability ...you would need to XML escape that as... Sense amp; Sensability ...Solr internally will treat that consistently as the JAva string Sense Sensability and when it comes time to return that string back to your query clients, will output it in whatever form is appropraite for your ResponseWriter -- if that's XML, then it will be XML escaped again, if it's JSON or something ike it, it can probably be left alone. The same holds tru for any other characters you wna to include in your field values: Solr doens't care that they *value* itself is an XML string, just that you properly escape the value in your XML adddoc message to Solr... add doc field name=titleAs You Like it/field field name=authorShakespeare, William/field field name=recordlt;myxmlgt;here goes the xml...lt;/myxmlgt;/field /doc /add ...does that make sense? : Ideally, I would like to store the xml as is, and index only the content : removing the xml-tags (I believe there is HTMLStripWhitespaceAnalyzer for : that). : And output the result as an xml (so, simple escaping does not work for me). the escaping is just to send the data to Solr -- once sent, Solr will process the unescaped string when deailing with analyzers, etc exactly as you'd expect. -Hoss
Re: Indexing XML files
On 12/5/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Thanks for the quick response. Now, I have one more question. Is it possible to get the result for a query back in the following form (considering the input is the escaped xml, what you mentioned before): response responseHeader status0/status QTime0/QTime /responseHeader result numFound=1 start=0 doc str name=labelAs You Like It (Promptbook of McVicars 1860)/str str name=authorShakespeare, William,/str str name=recordmyxml.../myxml/str /doc /result /response Note, that the here the xml data is not escaped. I bet it is escaped, but your browser has helpfully displayed it as unescaped. Try doing CTRL-U in firefox to see the real source for the reply. -Yonik
Re: Indexing XML files
Hi, the idea is to apply XSLT transformation on the result. But it seems that I would have to apply two transformations in a row, one which unescapes the escaped node and a second which performs the actual transformation... mirko Quoting Yonik Seeley [EMAIL PROTECTED]: On 12/5/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: You are right, it is escaped. But my question is: (how) can I make it unescaped? For what purpose? If you use an XML parser, the values it gives back to you will be unescaped. -Yonik
Re: Indexing XML files
At some point, it would be simpler to write a custom response handler and generate the output in your desired XML format. wunder On 12/5/06 1:52 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi, the idea is to apply XSLT transformation on the result. But it seems that I would have to apply two transformations in a row, one which unescapes the escaped node and a second which performs the actual transformation... mirko Quoting Yonik Seeley [EMAIL PROTECTED]: On 12/5/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: You are right, it is escaped. But my question is: (how) can I make it unescaped? For what purpose? If you use an XML parser, the values it gives back to you will be unescaped. -Yonik
Re: Indexing XML files
: At some point, it would be simpler to write a custom response handler : and generate the output in your desired XML format. I think Walters got the right idea ... as a general rule, we want to make the XmlResponseWriter bullet proof so that no matter waht data you put into your index, it is garunteed to produce a well formed XML document that conforms to a specified DTD, or XSD (see SOLR-17 for one we already have but we haven't figured out what to do with yet) But I can certainly understand your use case: you know you have wellformed XML values in some fields, and want to be able ot apply a simple XSL transform on the whole response, and use XPath selectors to pull data out of your response fields. the best approach i can think of that should work for you out of the box is what you already said: two XSL trnasforms ... one can be applied on the Solr server using the qt=xslt response -- just create an XSL that generates XML and unescapes the fields you know will contain wellformed XML data -- then apply your second transform client side (or using a proxy) if you're interested in writing a bit of custom java code you could in fact write a new FieldType (which could easily subclass TextField) with a custom write method that just outputs the raw value directly, and then load your field type as a plugin... http://wiki.apache.org/solr/SolrPlugins -Hoss