Hey ho,
i have a problem with a url that seems to be an vcf document.
Let me explain:
When I try to build an solr index, this url is responsible for this
error message:
SEVERE: org.apache.solr.common.SolrException: ERROR:
[http://cms.uni-kassel.de/asl/en/fb/staff.html?tx_wtdirectory_pi1%5BvCard%5D=10]
multiple values encountered for non multiValued field title:
[Universität Kassel, Fachbereich 6 ASL: Faculty Members,
Lolita_Hörnlein.vcf]
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:242)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:147)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
The url is:
http://cms.uni-kassel.de/asl/en/fb/staff.html?tx_wtdirectory_pi1%5BvCard%5D=10
When I download it separately it delivers following response:
Status=OK - 200
Date=Fri, 05 Aug 2011 11:09:12 GMT
Server=Apache/2.2.3 (Debian) mod_ssl/2.2.3 OpenSSL/0.9.8c
X-Powered-By=PHP/5.2.0-8+etch16
Content-Disposition=attachment; filename=Lolita_Hörnlein.vcf
Pragma=public
Content-Type=text/directory
Set-Cookie=fe_typo_user=316c4c91100f95fb57c5e8d39d32f99d; path=/asl/
Via=1.1 cms.uni-kassel.de
Vary=Accept-Encoding
Content-Encoding=gzip
Content-Length=5043
Keep-Alive=timeout=15, max=99
Connection=Keep-Alive
I have inspected this file and find out that it is corrupted, it seems
that besides the prober vcf data, there is generated html code in this
file. This seems to be a misbehaviour from some plugin in the cms.
My Question is how to handle such files. It looks like the parser sets
to much values in the title field, so solr can't handle it.
For a quick solution it would be best if I could configure tika in that
way, that it won't parse the vcf. But I don't know how to do that.
Any suggestions for this problem?
Thank you very much.