[Parsing] Date Fields

2010-12-11 Thread Adam Estrada
All,

I am ingesting a lot of RSS feeds as part of my application and I keep
getting the same error.

WARNING: Could not parse a Date field
java.text.ParseException: Unparseable date: Mon, 06 Dec 2010 23:31:38
+
at java.text.DateFormat.parse(Unknown Source)
at
org.apache.solr.handler.dataimport.DateFormatTransformer.process(Date
FormatTransformer.java:89)
at
org.apache.solr.handler.dataimport.DateFormatTransformer.transformRow
(DateFormatTransformer.java:69)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransf
ormer(EntityProcessorWrapper.java:195)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
ityProcessorWrapper.java:241)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
r.java:357)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
r.java:383)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
ava:242)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
:180)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
rter.java:331)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
ava:389)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja
va:370)
Dec 11, 2010 6:25:47 PM org.apache.solr.handler.dataimport.DocBuilder finish
INFO: Import completed successfully
Dec 11, 2010 6:25:47 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start
commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDelete
s=false)

Are there any tips or tricks to getting standard RSS update fields to
import correctly?

An example for a DIH config XML file is as follows:

  entity name=CBS
pk=link
datasource=filedatasource
url=http://feeds.cbsnews.com/CBSNewsMain?format=xml;
processor=XPathEntityProcessor
forEach=/rss/channel | /rss/channel/item
transformer=DateFormatTransformer,HTMLStripTransformer
 field column=source   xpath=/rss/channel/title
commonField=true /
field column=source-link  xpath=/rss/channel/link
 commonField=true /
field column=subject  xpath=/rss/channel/description
commonField=true /
field column=titlexpath=/rss/channel/item/title /
field column=link xpath=/rss/channel/item/link /
field column=description  xpath=/rss/channel/item/description
stripHTML=true /
field column=creator  xpath=/rss/channel/item/creator /
field column=item-subject xpath=/rss/channel/item/subject /
field column=author   xpath=/rss/channel/item/author /
field column=comments xpath=/rss/channel/item/comments /
field column=pubdate  xpath=/rss/channel/item/pubDate
dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' /
  /entity

Any tips on this would be really appreciated as I need to query based on the
date the article was published.

Thanks,
Adam


Re: [Parsing] Date Fields

2010-12-11 Thread Erick Erickson
Dates in Solr have a very specific format, see:
http://lucene.apache.org/solr/api/org/apache/solr/schema/DateField.html

Best
Erick

On Sat, Dec 11, 2010 at 6:32 PM, Adam Estrada estrada.adam.gro...@gmail.com
 wrote:

 All,

 I am ingesting a lot of RSS feeds as part of my application and I keep
 getting the same error.

 WARNING: Could not parse a Date field
 java.text.ParseException: Unparseable date: Mon, 06 Dec 2010 23:31:38
 +
at java.text.DateFormat.parse(Unknown Source)
at
 org.apache.solr.handler.dataimport.DateFormatTransformer.process(Date
 FormatTransformer.java:89)
at
 org.apache.solr.handler.dataimport.DateFormatTransformer.transformRow
 (DateFormatTransformer.java:69)
at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransf
 ormer(EntityProcessorWrapper.java:195)
at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
 ityProcessorWrapper.java:241)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
 r.java:357)
at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
 r.java:383)
at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
 ava:242)
at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
 :180)
at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
 rter.java:331)
at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
 ava:389)
at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja
 va:370)
 Dec 11, 2010 6:25:47 PM org.apache.solr.handler.dataimport.DocBuilder
 finish
 INFO: Import completed successfully
 Dec 11, 2010 6:25:47 PM org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: start
 commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDelete
 s=false)

 Are there any tips or tricks to getting standard RSS update fields to
 import correctly?

 An example for a DIH config XML file is as follows:

  entity name=CBS
pk=link
datasource=filedatasource
url=http://feeds.cbsnews.com/CBSNewsMain?format=xml;
processor=XPathEntityProcessor
forEach=/rss/channel | /rss/channel/item
transformer=DateFormatTransformer,HTMLStripTransformer
 field column=source   xpath=/rss/channel/title
 commonField=true /
field column=source-link  xpath=/rss/channel/link
  commonField=true /
field column=subject  xpath=/rss/channel/description
 commonField=true /
field column=titlexpath=/rss/channel/item/title /
field column=link xpath=/rss/channel/item/link /
field column=description  xpath=/rss/channel/item/description
 stripHTML=true /
field column=creator  xpath=/rss/channel/item/creator /
field column=item-subject xpath=/rss/channel/item/subject /
field column=author   xpath=/rss/channel/item/author /
field column=comments xpath=/rss/channel/item/comments /
field column=pubdate  xpath=/rss/channel/item/pubDate
 dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' /
  /entity

 Any tips on this would be really appreciated as I need to query based on
 the
 date the article was published.

 Thanks,
 Adam



Re: [Parsing] Date Fields

2010-12-11 Thread Lance Norskog
Here's the problem, at the end of the DIH file:
   field column=pubdate  xpath=/rss/channel/item/pubDate
dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' /
 /entity

This says parse this timestamp into a Java Date object using this
date-time spec. This string uses the UTC timestamp format that Solr
reads. You need to change this date-format string to the format of
your incoming timestamps. The JDK Date class and innumerable tutorials
for it are online.

Cheers,

Lance Norskog

On Sat, Dec 11, 2010 at 4:10 PM, Erick Erickson erickerick...@gmail.com wrote:
 Dates in Solr have a very specific format, see:
 http://lucene.apache.org/solr/api/org/apache/solr/schema/DateField.html

 Best
 Erick

 On Sat, Dec 11, 2010 at 6:32 PM, Adam Estrada estrada.adam.gro...@gmail.com
 wrote:

 All,

 I am ingesting a lot of RSS feeds as part of my application and I keep
 getting the same error.

 WARNING: Could not parse a Date field
 java.text.ParseException: Unparseable date: Mon, 06 Dec 2010 23:31:38
 +
        at java.text.DateFormat.parse(Unknown Source)
        at
 org.apache.solr.handler.dataimport.DateFormatTransformer.process(Date
 FormatTransformer.java:89)
        at
 org.apache.solr.handler.dataimport.DateFormatTransformer.transformRow
 (DateFormatTransformer.java:69)
        at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransf
 ormer(EntityProcessorWrapper.java:195)
        at
 org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
 ityProcessorWrapper.java:241)
        at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
 r.java:357)
        at
 org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
 r.java:383)
        at
 org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
 ava:242)
        at
 org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
 :180)
        at
 org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
 rter.java:331)
        at
 org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
 ava:389)
        at
 org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja
 va:370)
 Dec 11, 2010 6:25:47 PM org.apache.solr.handler.dataimport.DocBuilder
 finish
 INFO: Import completed successfully
 Dec 11, 2010 6:25:47 PM org.apache.solr.update.DirectUpdateHandler2 commit
 INFO: start
 commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDelete
 s=false)

 Are there any tips or tricks to getting standard RSS update fields to
 import correctly?

 An example for a DIH config XML file is as follows:

      entity name=CBS
        pk=link
        datasource=filedatasource
        url=http://feeds.cbsnews.com/CBSNewsMain?format=xml;
        processor=XPathEntityProcessor
        forEach=/rss/channel | /rss/channel/item
        transformer=DateFormatTransformer,HTMLStripTransformer
         field column=source       xpath=/rss/channel/title
 commonField=true /
        field column=source-link  xpath=/rss/channel/link
  commonField=true /
        field column=subject      xpath=/rss/channel/description
 commonField=true /
        field column=title        xpath=/rss/channel/item/title /
        field column=link         xpath=/rss/channel/item/link /
        field column=description  xpath=/rss/channel/item/description
 stripHTML=true /
        field column=creator      xpath=/rss/channel/item/creator /
        field column=item-subject xpath=/rss/channel/item/subject /
        field column=author       xpath=/rss/channel/item/author /
        field column=comments     xpath=/rss/channel/item/comments /
        field column=pubdate      xpath=/rss/channel/item/pubDate
 dateTimeFormat=-MM-dd'T'hh:mm:ss'Z' /
      /entity

 Any tips on this would be really appreciated as I need to query based on
 the
 date the article was published.

 Thanks,
 Adam





-- 
Lance Norskog
goks...@gmail.com