You should look at Nutch apache solution that has Solr client support, it has all the index options you need and has schema to build Solr collection with all required fields for indexing.
We have used it and works well, supports sitemap.xml to simplify indexing. On Fri, Apr 12, 2019 at 6:43 AM Jan Høydahl <jan....@cominvent.com> wrote: > I think there may actually be a bug. I was not able to crawl some other > web site either. > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > > > 11. apr. 2019 kl. 18:55 skrev Erick Erickson <erickerick...@gmail.com>: > > > > You are sending malformed XML to Solr. This can be something as silly as > having extra spaces at the beginning. I’d capture the page being sent to > Solr and put it in a formatter to check it…. > > > > Best, > > Erick > > > >> On Apr 11, 2019, at 3:49 AM, Shivprasad Shetty < > shivpras...@orioninc.com> wrote: > >> > >> Hello Team, > >> > >> > >> I am working on solr for the first time and got the setup > done. Now I have created a core using command line and want to perform > webcrawl of a third party site. > >> If I try it with individual links, I am able to do the crawl and index > it to the core.This was done using > > >> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar > post.jar http://www.example.com > >> > >> Now what I intend to do is to give a url and using the recursive option > (-Drecursive) and let it crawl the entire site. > >> Note that I am pointing to a website that has around 125 pages and I am > using the below command > > >> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update > -Drecursive=yes -jar post.jar http://www.example.com and > >> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update > -Drecursive=2 -jar post.jar http://www.example.com > >> > >> and I am getting the below error message. > >> Error: > >> > >> > >> POSTed web resource http://www.example.com (depth: 0) > >> [Fatal Error] :1:1: Content is not allowed in prolog. > >> Exception in thread "main" java.lang.RuntimeException: > org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is > not allowed in prolog. > >> at > org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252) > >> at > org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616) > >> at > org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563) > >> at > org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365) > >> at > org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187) > >> at > org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172) > >> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: > 1; Content is not allowed in prolog. > >> at > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source) > >> at > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown > Source) > >> at javax.xml.parsers.DocumentBuilder.parse(Unknown Source) > >> at > org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061) > >> at > org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232) > >> ... 5 more > >> > >> > >> > >> I would be very grateful if anyone could get me to solve this issue I > have been trying to fix for a couple of days. > >> > >> > >> Regards, > >> ShivprasadS > >> > >> > >> Confidentiality Notice: This e-mail message, including any attachments, > is for the sole use of the intended recipient(s) and may contain > confidential and privileged information. Any unauthorized review, use, > disclosure or distribution is prohibited. If you are not the intended > recipient, please contact the sender by reply e-mail, delete and then > destroy all copies of the original message. > > > > -- -- CONFIDENTIALITY NOTICE: The information contained in this email is privileged and confidential and intended only for the use of the individual or entity to whom it is addressed. If you receive this message in error, please notify the sender immediately at 613-729-1100 and destroy the original message and all copies. Thank you.