CLASSIFICATION: UNCLASSIFIED Thanks Yonik and Eric,
If I set -filetypes csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,rtf,htm,html,txt would this prevent indexing of xml files? Why does the simple post tool index .cfm files with this or default settings? Thanks, Kris ~~~~~~~~~~~~~~~~~~~~~~~~~~ Kris T. Musshorn FileMaker Developer - Contractor – Catapult Technology Inc. US Army Research Lab Aberdeen Proving Ground Application Management & Development Branch 410-278-7251 kris.t.musshorn....@mail.mil ~~~~~~~~~~~~~~~~~~~~~~~~~~ -----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, July 15, 2016 12:30 PM To: solr-user <solr-user@lucene.apache.org> Subject: [Non-DoD Source] Re: SimplePostTool error (UNCLASSIFIED) simplePostTool is just that, simple. It's intended to get you started. It is not a full-featured web crawler. As such, if you're encountering wonky web pages that are not well formed HTML there's no guarantee that it'll handle them gracefully. Crawling websites is a pain, so if you require something robust I'd investigate Nutch (which integrates with Solr/Lucene) or similar. Best, Erick On Fri, Jul 15, 2016 at 9:01 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) <kris.t.musshorn....@mail.mil> wrote: > CLASSIFICATION: UNCLASSIFIED > > How do I correct this error when running the simple post tool against a > website? > The tool successfully indexed for about 30 mins before throwing this error > and terminating. > > [Fatal Error] :642:15: XML document structures must start and end within the > same entity. > Exception in thread "main" java.lang.RuntimeException: > org.xml.sax.SAXParseException; lineNumber: 642; columnNumber: 15; XML > document structures must start and end within the same entity. > at > org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1219) > at > org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:601) > at > org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618) > at > org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618) > at > org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618) > at > org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:618) > at > org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:548) > at > org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:351) > at > org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:182) > at > org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:167) > Caused by: org.xml.sax.SAXParseException; lineNumber: 642; columnNumber: 15; > XML document structures must start and end within the same entity. > at > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257) > at > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) > at > org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1028) > at > org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1201) > ... 9 more > > Thanks, > Kris > > ~~~~~~~~~~~~~~~~~~~~~~~~~~ > Kris T. Musshorn > FileMaker Developer - Contractor - Catapult Technology Inc. > US Army Research Lab > Aberdeen Proving Ground > Application Management & Development Branch > 410-278-7251 > kris.t.musshorn....@mail.mil > ~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > CLASSIFICATION: UNCLASSIFIED CLASSIFICATION: UNCLASSIFIED