[Parsing] Date Fields
All, I am ingesting a lot of RSS feeds as part of my application and I keep getting the same error. WARNING: Could not parse a Date field java.text.ParseException: Unparseable date: Mon, 06 Dec 2010 23:31:38 + at java.text.DateFormat.parse(Unknown Source) at org.apache.solr.handler.dataimport.DateFormatTransformer.process(Date FormatTransformer.java:89) at org.apache.solr.handler.dataimport.DateFormatTransformer.transformRow (DateFormatTransformer.java:69) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransf ormer(EntityProcessorWrapper.java:195) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent ityProcessorWrapper.java:241) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:357) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:383) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j ava:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java :180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo rter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j ava:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja va:370) Dec 11, 2010 6:25:47 PM org.apache.solr.handler.dataimport.DocBuilder finish INFO: Import completed successfully Dec 11, 2010 6:25:47 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDelete s=false) Are there any tips or tricks to getting standard RSS update fields to import correctly? An example for a DIH config XML file is as follows: entity name=CBS pk=link datasource=filedatasource url=http://feeds.cbsnews.com/CBSNewsMain?format=xml; processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/item transformer=DateFormatTransformer,HTMLStripTransformer field column=source xpath=/rss/channel/title commonField=true / field column=source-link xpath=/rss/channel/link commonField=true / field column=subject xpath=/rss/channel/description commonField=true / field column=titlexpath=/rss/channel/item/title / field column=link xpath=/rss/channel/item/link / field column=description xpath=/rss/channel/item/description stripHTML=true / field column=creator xpath=/rss/channel/item/creator / field column=item-subject xpath=/rss/channel/item/subject / field column=author xpath=/rss/channel/item/author / field column=comments xpath=/rss/channel/item/comments / field column=pubdate xpath=/rss/channel/item/pubDate dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / /entity Any tips on this would be really appreciated as I need to query based on the date the article was published. Thanks, Adam
Re: [Parsing] Date Fields
Dates in Solr have a very specific format, see: http://lucene.apache.org/solr/api/org/apache/solr/schema/DateField.html Best Erick On Sat, Dec 11, 2010 at 6:32 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: All, I am ingesting a lot of RSS feeds as part of my application and I keep getting the same error. WARNING: Could not parse a Date field java.text.ParseException: Unparseable date: Mon, 06 Dec 2010 23:31:38 + at java.text.DateFormat.parse(Unknown Source) at org.apache.solr.handler.dataimport.DateFormatTransformer.process(Date FormatTransformer.java:89) at org.apache.solr.handler.dataimport.DateFormatTransformer.transformRow (DateFormatTransformer.java:69) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransf ormer(EntityProcessorWrapper.java:195) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent ityProcessorWrapper.java:241) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:357) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:383) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j ava:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java :180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo rter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j ava:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja va:370) Dec 11, 2010 6:25:47 PM org.apache.solr.handler.dataimport.DocBuilder finish INFO: Import completed successfully Dec 11, 2010 6:25:47 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDelete s=false) Are there any tips or tricks to getting standard RSS update fields to import correctly? An example for a DIH config XML file is as follows: entity name=CBS pk=link datasource=filedatasource url=http://feeds.cbsnews.com/CBSNewsMain?format=xml; processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/item transformer=DateFormatTransformer,HTMLStripTransformer field column=source xpath=/rss/channel/title commonField=true / field column=source-link xpath=/rss/channel/link commonField=true / field column=subject xpath=/rss/channel/description commonField=true / field column=titlexpath=/rss/channel/item/title / field column=link xpath=/rss/channel/item/link / field column=description xpath=/rss/channel/item/description stripHTML=true / field column=creator xpath=/rss/channel/item/creator / field column=item-subject xpath=/rss/channel/item/subject / field column=author xpath=/rss/channel/item/author / field column=comments xpath=/rss/channel/item/comments / field column=pubdate xpath=/rss/channel/item/pubDate dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / /entity Any tips on this would be really appreciated as I need to query based on the date the article was published. Thanks, Adam
Re: [Parsing] Date Fields
Here's the problem, at the end of the DIH file: field column=pubdate xpath=/rss/channel/item/pubDate dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / /entity This says parse this timestamp into a Java Date object using this date-time spec. This string uses the UTC timestamp format that Solr reads. You need to change this date-format string to the format of your incoming timestamps. The JDK Date class and innumerable tutorials for it are online. Cheers, Lance Norskog On Sat, Dec 11, 2010 at 4:10 PM, Erick Erickson erickerick...@gmail.com wrote: Dates in Solr have a very specific format, see: http://lucene.apache.org/solr/api/org/apache/solr/schema/DateField.html Best Erick On Sat, Dec 11, 2010 at 6:32 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: All, I am ingesting a lot of RSS feeds as part of my application and I keep getting the same error. WARNING: Could not parse a Date field java.text.ParseException: Unparseable date: Mon, 06 Dec 2010 23:31:38 + at java.text.DateFormat.parse(Unknown Source) at org.apache.solr.handler.dataimport.DateFormatTransformer.process(Date FormatTransformer.java:89) at org.apache.solr.handler.dataimport.DateFormatTransformer.transformRow (DateFormatTransformer.java:69) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransf ormer(EntityProcessorWrapper.java:195) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent ityProcessorWrapper.java:241) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:357) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde r.java:383) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j ava:242) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java :180) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo rter.java:331) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j ava:389) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja va:370) Dec 11, 2010 6:25:47 PM org.apache.solr.handler.dataimport.DocBuilder finish INFO: Import completed successfully Dec 11, 2010 6:25:47 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDelete s=false) Are there any tips or tricks to getting standard RSS update fields to import correctly? An example for a DIH config XML file is as follows: entity name=CBS pk=link datasource=filedatasource url=http://feeds.cbsnews.com/CBSNewsMain?format=xml; processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/item transformer=DateFormatTransformer,HTMLStripTransformer field column=source xpath=/rss/channel/title commonField=true / field column=source-link xpath=/rss/channel/link commonField=true / field column=subject xpath=/rss/channel/description commonField=true / field column=title xpath=/rss/channel/item/title / field column=link xpath=/rss/channel/item/link / field column=description xpath=/rss/channel/item/description stripHTML=true / field column=creator xpath=/rss/channel/item/creator / field column=item-subject xpath=/rss/channel/item/subject / field column=author xpath=/rss/channel/item/author / field column=comments xpath=/rss/channel/item/comments / field column=pubdate xpath=/rss/channel/item/pubDate dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' / /entity Any tips on this would be really appreciated as I need to query based on the date the article was published. Thanks, Adam -- Lance Norskog goks...@gmail.com