Re: solr 7.0.1: exception running post to crawl simple website
Toby, Your mention of "-recursive" causing a problem reminded me of a simple crawl (of the 7.0 Ref Guide) using bin/post I was trying to get to work the other day and couldn't. The order of the parameters seems to make a difference with what error you get (this is using 7.1): 1. "./bin/post -c gettingstarted -delay 10 https://lucene.apache.org/solr/guide/7_0 -recursive" yields the stack trace in the previous message: POSTed web resource https://lucene.apache.org/solr/guide/7_0 (depth: 0) [Fatal Error] :1:1: Content is not allowed in prolog. Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252) at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616) at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563) at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365) at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187) at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172) Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061) at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232) ... 5 more 2. "./bin/post -c gettingstarted -delay 10 -recursive https://lucene.apache.org/solr/guide/7_0; yields: No files, directories, URLs, -d strings, or stdin were specified. See './bin/post -h' for usage instructions. 3. "./bin/post -c gettingstarted http://lucene.apache.org/solr/guide/7_0 -recursive -delay 10" yields: Unrecognized argument: 10 If this was intended to be a data file, it does not exist relative to /Applications/Solr/solr-7.1.0 4. "./bin/post -c gettingstarted -delay 10 https://lucene.apache.org/solr/guide/7_0; successfully gets the document, but only the single page at that URL. It does not extract any of the content of the page besides the title and metadata Tika adds. I'd say we should probably file a JIRA for it. If the parsing is wrong (as it seems to me to be), that's a different problem, but the fact you can't use recursive at all is a bug AFAICT. Cassandra On Fri, Oct 27, 2017 at 11:03 AM, toby1851wrote: > Amrit Sarkar wrote >> The above is SAXParse, runtime exception. Nothing can be done at Solr end >> except curating your own data. > > I'm trying to replace a solr-4.6.0 system (which has been working > brilliantly for 3 years!) with solr-7.1.0. I'm running into this exact same > problem. > > I do not believe it is a data curation problem. (Even if it were, it's very > unfriendly just to bomb out with a stack trace. And it's seriously annoying > that there's a 14 line error message about a parsing problem, but it > entirely neglects to mention what it was trying to parse! Was it a file, a > URL...?) > > Anyway, the symptoms I'm seeing are that a simple "post -c foo https://...; > works fine. But the moment I turn on recursion, it fails before fetching a > second page. It doesn't matter what the first page is. Really: when I made > no progress with the site that I'm actually trying to index, I tried another > of my sites, then Google, then eBay... In every case, I get something like > this: > > $ post -c mycollection https://www.ebay.co.uk -recursive 1 -delay 10 > ... > POSTed web resource https://www.ebay.co.uk (depth: 0) > ... [ 10s delay ] > [Fatal Error] :1:1: Content is not allowed in prolog. > ... > > I've been looking at the code, and also what's going with strace. As far as > I can see, at the point where the exception occurs, we are parsing data (a > copy of the page, presumably) that has come from the solr server itself. > That appears to be a chunk of JSON with embedded XML. The inner XML does > look to at least start correctly. The fact that we're getting an error at > line 1 column 1 every single time makes me suspect that we're feeding the > wrong thing to the SAX parser. > > Anyway, I'm going to go and look at nutch as I need something working very > soon. > > But could somebody who is familiar with this code take another look? > > Cheers, > > Toby. > > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote > The above is SAXParse, runtime exception. Nothing can be done at Solr end > except curating your own data. I'm trying to replace a solr-4.6.0 system (which has been working brilliantly for 3 years!) with solr-7.1.0. I'm running into this exact same problem. I do not believe it is a data curation problem. (Even if it were, it's very unfriendly just to bomb out with a stack trace. And it's seriously annoying that there's a 14 line error message about a parsing problem, but it entirely neglects to mention what it was trying to parse! Was it a file, a URL...?) Anyway, the symptoms I'm seeing are that a simple "post -c foo https://...; works fine. But the moment I turn on recursion, it fails before fetching a second page. It doesn't matter what the first page is. Really: when I made no progress with the site that I'm actually trying to index, I tried another of my sites, then Google, then eBay... In every case, I get something like this: $ post -c mycollection https://www.ebay.co.uk -recursive 1 -delay 10 ... POSTed web resource https://www.ebay.co.uk (depth: 0) ... [ 10s delay ] [Fatal Error] :1:1: Content is not allowed in prolog. ... I've been looking at the code, and also what's going with strace. As far as I can see, at the point where the exception occurs, we are parsing data (a copy of the page, presumably) that has come from the solr server itself. That appears to be a chunk of JSON with embedded XML. The inner XML does look to at least start correctly. The fact that we're getting an error at line 1 column 1 every single time makes me suspect that we're feeding the wrong thing to the SAX parser. Anyway, I'm going to go and look at nutch as I need something working very soon. But could somebody who is familiar with this code take another look? Cheers, Toby. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: solr 7.0.1: exception running post to crawl simple website
On 2017-10-13 04:19 PM, Kevin Layer wrote: Amrit Sarkar wrote: Kevin, fileType => md is not recognizable format in SimplePostTool, anyway, moving on. OK, thanks. Looks like I'll have to abandon using solr for this project (or find another way to crawl the site). Thank you for all the help, though. I appreciate it. Ha, these messages crash my android mail client! Now... Did you try Nutch? Or the Narconex HTTP crawler? Tika? Or any Python crawler, posting its documents to th Solr API. cheers -- Rick
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> Kevin, >> >> fileType => md is not recognizable format in SimplePostTool, anyway, moving >> on. OK, thanks. Looks like I'll have to abandon using solr for this project (or find another way to crawl the site). Thank you for all the help, though. I appreciate it. >> The above is SAXParse, runtime exception. Nothing can be done at Solr end >> except curating your own data. >> Some helpful links: >> https://stackoverflow.com/questions/2599919/java-parsing-xml-document-gives-content-not-allowed-in-prolog-error >> https://stackoverflow.com/questions/3030903/content-is-not-allowed-in-prolog-when-parsing-perfectly-valid-xml-on-gae >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 8:48 PM, Kevin Layerwrote: >> >> > Amrit Sarkar wrote: >> > >> > >> Kevin, >> > >> >> > >> I am not able to replicate the issue on my system, which is bit annoying >> > >> for me. Try this out for last time: >> > >> >> > >> docker exec -it --user=solr solr bin/post -c handbook >> > >> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 >> > -filetypes html >> > >> >> > >> and have Content-Type: "html" and "text/html", try with both. >> > >> > With text/html I get and your command I get >> > >> > quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook >> > http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes >> > html >> > /docker-java-home/jre/bin/java -classpath >> > /opt/solr/dist/solr-core-7.0.1.jar >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=html -Dc=handbook >> > -Ddata=web org.apache.solr.util.SimplePostTool >> > http://quadra.franz.com:9091/index.md >> > SimplePostTool version 5.0.0 >> > Posting web pages to Solr url http://localhost:8983/solr/ >> > handbook/update/extract >> > Entering auto mode. Indexing pages with content-types corresponding to >> > file endings html >> > SimplePostTool: WARNING: Never crawl an external web site faster than >> > every 10 seconds, your IP will probably be blocked >> > Entering recursive mode, depth=10, delay=0s >> > Entering crawl at level 0 (1 links total, 1 new) >> > POSTed web resource http://quadra.franz.com:9091/index.md (depth: 0) >> > [Fatal Error] :1:1: Content is not allowed in prolog. >> > Exception in thread "main" java.lang.RuntimeException: >> > org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is >> > not allowed in prolog. >> > at org.apache.solr.util.SimplePostTool$PageFetcher. >> > getLinksFromWebPage(SimplePostTool.java:1252) >> > at org.apache.solr.util.SimplePostTool.webCrawl( >> > SimplePostTool.java:616) >> > at org.apache.solr.util.SimplePostTool.postWebPages( >> > SimplePostTool.java:563) >> > at org.apache.solr.util.SimplePostTool.doWebMode( >> > SimplePostTool.java:365) >> > at org.apache.solr.util.SimplePostTool.execute( >> > SimplePostTool.java:187) >> > at org.apache.solr.util.SimplePostTool.main( >> > SimplePostTool.java:172) >> > Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; >> > Content is not allowed in prolog. >> > at com.sun.org.apache.xerces.internal.parsers.DOMParser. >> > parse(DOMParser.java:257) >> > at com.sun.org.apache.xerces.internal.jaxp. >> > DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) >> > at javax.xml.parsers.DocumentBuilder.parse( >> > DocumentBuilder.java:121) >> > at org.apache.solr.util.SimplePostTool.makeDom( >> > SimplePostTool.java:1061) >> > at org.apache.solr.util.SimplePostTool$PageFetcher. >> > getLinksFromWebPage(SimplePostTool.java:1232) >> > ... 5 more >> > >> > >> > When I use "-filetype md" back to the regular output that doesn't scan >> > anything. >> > >> > >> > >> >> > >> If you get past this hurdle this hurdle, let me know. >> > >> >> > >> Amrit Sarkar >> > >> Search Engineer >> > >> Lucidworks, Inc. >> > >> 415-589-9269 >> > >> www.lucidworks.com >> > >> Twitter http://twitter.com/lucidworks >> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> > >> >> > >> On Fri, Oct 13, 2017 at 8:22 PM, Kevin Layer wrote: >> > >> >> > >> > Amrit Sarkar wrote: >> > >> > >> > >> > >> ah oh, dockers. They are placed under [solr-home]/server/log/solr/ >> > log >> > >> > in >> > >> > >> the machine. I haven't played much with docker, any way you can >> > get that >> > >> > >> file from that location. >> > >> > >> > >> > I see these files: >> > >> > >> > >> > /opt/solr/server/logs/archived >> > >> > /opt/solr/server/logs/solr_gc.log.0.current >> > >> > /opt/solr/server/logs/solr.log >> > >> > /opt/solr/server/solr/handbook/data/tlog >> > >> > >> > >> > The 3rd one has very little info. Attached: >> > >> > >> > >> > >> > >> > 2017-10-11 15:28:09.564 INFO (main) [ ] o.e.j.s.Server >> > >> >
Re: solr 7.0.1: exception running post to crawl simple website
Kevin, fileType => md is not recognizable format in SimplePostTool, anyway, moving on. The above is SAXParse, runtime exception. Nothing can be done at Solr end except curating your own data. Some helpful links: https://stackoverflow.com/questions/2599919/java-parsing-xml-document-gives-content-not-allowed-in-prolog-error https://stackoverflow.com/questions/3030903/content-is-not-allowed-in-prolog-when-parsing-perfectly-valid-xml-on-gae Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Fri, Oct 13, 2017 at 8:48 PM, Kevin Layerwrote: > Amrit Sarkar wrote: > > >> Kevin, > >> > >> I am not able to replicate the issue on my system, which is bit annoying > >> for me. Try this out for last time: > >> > >> docker exec -it --user=solr solr bin/post -c handbook > >> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 > -filetypes html > >> > >> and have Content-Type: "html" and "text/html", try with both. > > With text/html I get and your command I get > > quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook > http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes > html > /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=html -Dc=handbook > -Ddata=web org.apache.solr.util.SimplePostTool > http://quadra.franz.com:9091/index.md > SimplePostTool version 5.0.0 > Posting web pages to Solr url http://localhost:8983/solr/ > handbook/update/extract > Entering auto mode. Indexing pages with content-types corresponding to > file endings html > SimplePostTool: WARNING: Never crawl an external web site faster than > every 10 seconds, your IP will probably be blocked > Entering recursive mode, depth=10, delay=0s > Entering crawl at level 0 (1 links total, 1 new) > POSTed web resource http://quadra.franz.com:9091/index.md (depth: 0) > [Fatal Error] :1:1: Content is not allowed in prolog. > Exception in thread "main" java.lang.RuntimeException: > org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is > not allowed in prolog. > at org.apache.solr.util.SimplePostTool$PageFetcher. > getLinksFromWebPage(SimplePostTool.java:1252) > at org.apache.solr.util.SimplePostTool.webCrawl( > SimplePostTool.java:616) > at org.apache.solr.util.SimplePostTool.postWebPages( > SimplePostTool.java:563) > at org.apache.solr.util.SimplePostTool.doWebMode( > SimplePostTool.java:365) > at org.apache.solr.util.SimplePostTool.execute( > SimplePostTool.java:187) > at org.apache.solr.util.SimplePostTool.main( > SimplePostTool.java:172) > Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; > Content is not allowed in prolog. > at com.sun.org.apache.xerces.internal.parsers.DOMParser. > parse(DOMParser.java:257) > at com.sun.org.apache.xerces.internal.jaxp. > DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) > at javax.xml.parsers.DocumentBuilder.parse( > DocumentBuilder.java:121) > at org.apache.solr.util.SimplePostTool.makeDom( > SimplePostTool.java:1061) > at org.apache.solr.util.SimplePostTool$PageFetcher. > getLinksFromWebPage(SimplePostTool.java:1232) > ... 5 more > > > When I use "-filetype md" back to the regular output that doesn't scan > anything. > > > >> > >> If you get past this hurdle this hurdle, let me know. > >> > >> Amrit Sarkar > >> Search Engineer > >> Lucidworks, Inc. > >> 415-589-9269 > >> www.lucidworks.com > >> Twitter http://twitter.com/lucidworks > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > >> > >> On Fri, Oct 13, 2017 at 8:22 PM, Kevin Layer wrote: > >> > >> > Amrit Sarkar wrote: > >> > > >> > >> ah oh, dockers. They are placed under [solr-home]/server/log/solr/ > log > >> > in > >> > >> the machine. I haven't played much with docker, any way you can > get that > >> > >> file from that location. > >> > > >> > I see these files: > >> > > >> > /opt/solr/server/logs/archived > >> > /opt/solr/server/logs/solr_gc.log.0.current > >> > /opt/solr/server/logs/solr.log > >> > /opt/solr/server/solr/handbook/data/tlog > >> > > >> > The 3rd one has very little info. Attached: > >> > > >> > > >> > 2017-10-11 15:28:09.564 INFO (main) [ ] o.e.j.s.Server > >> > jetty-9.3.14.v20161028 > >> > 2017-10-11 15:28:10.668 INFO (main) [ ] o.a.s.s.SolrDispatchFilter > >> > ___ _ Welcome to Apache Solr™ version 7.0.1 > >> > 2017-10-11 15:28:10.669 INFO (main) [ ] o.a.s.s.SolrDispatchFilter > / > >> > __| ___| |_ _ Starting in standalone mode on port 8983 > >> > 2017-10-11 15:28:10.670 INFO (main) [ ] o.a.s.s.SolrDispatchFilter > \__ > >> > \/ _ \ | '_| Install dir: /opt/solr, Default config dir: > >> > /opt/solr/server/solr/configsets/_default/conf > >> > 2017-10-11 15:28:10.707 INFO (main) [ ]
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> Kevin, >> >> I am not able to replicate the issue on my system, which is bit annoying >> for me. Try this out for last time: >> >> docker exec -it --user=solr solr bin/post -c handbook >> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes html >> >> and have Content-Type: "html" and "text/html", try with both. With text/html I get and your command I get quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes html /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=html -Dc=handbook -Ddata=web org.apache.solr.util.SimplePostTool http://quadra.franz.com:9091/index.md SimplePostTool version 5.0.0 Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract Entering auto mode. Indexing pages with content-types corresponding to file endings html SimplePostTool: WARNING: Never crawl an external web site faster than every 10 seconds, your IP will probably be blocked Entering recursive mode, depth=10, delay=0s Entering crawl at level 0 (1 links total, 1 new) POSTed web resource http://quadra.franz.com:9091/index.md (depth: 0) [Fatal Error] :1:1: Content is not allowed in prolog. Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252) at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616) at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563) at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365) at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187) at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172) Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061) at org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232) ... 5 more When I use "-filetype md" back to the regular output that doesn't scan anything. >> >> If you get past this hurdle this hurdle, let me know. >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 8:22 PM, Kevin Layerwrote: >> >> > Amrit Sarkar wrote: >> > >> > >> ah oh, dockers. They are placed under [solr-home]/server/log/solr/log >> > in >> > >> the machine. I haven't played much with docker, any way you can get that >> > >> file from that location. >> > >> > I see these files: >> > >> > /opt/solr/server/logs/archived >> > /opt/solr/server/logs/solr_gc.log.0.current >> > /opt/solr/server/logs/solr.log >> > /opt/solr/server/solr/handbook/data/tlog >> > >> > The 3rd one has very little info. Attached: >> > >> > >> > 2017-10-11 15:28:09.564 INFO (main) [ ] o.e.j.s.Server >> > jetty-9.3.14.v20161028 >> > 2017-10-11 15:28:10.668 INFO (main) [ ] o.a.s.s.SolrDispatchFilter >> > ___ _ Welcome to Apache Solr™ version 7.0.1 >> > 2017-10-11 15:28:10.669 INFO (main) [ ] o.a.s.s.SolrDispatchFilter / >> > __| ___| |_ _ Starting in standalone mode on port 8983 >> > 2017-10-11 15:28:10.670 INFO (main) [ ] o.a.s.s.SolrDispatchFilter \__ >> > \/ _ \ | '_| Install dir: /opt/solr, Default config dir: >> > /opt/solr/server/solr/configsets/_default/conf >> > 2017-10-11 15:28:10.707 INFO (main) [ ] o.a.s.s.SolrDispatchFilter >> > |___/\___/_|_|Start time: 2017-10-11T15:28:10.674Z >> > 2017-10-11 15:28:10.747 INFO (main) [ ] o.a.s.c.SolrResourceLoader >> > Using system property solr.solr.home: /opt/solr/server/solr >> > 2017-10-11 15:28:10.763 INFO (main) [ ] o.a.s.c.SolrXmlConfig Loading >> > container configuration from /opt/solr/server/solr/solr.xml >> > 2017-10-11 15:28:11.062 INFO (main) [ ] o.a.s.c.SolrResourceLoader >> > [null] Added 0 libs to classloader, from paths: [] >> > 2017-10-11 15:28:12.514 INFO (main) [ ] o.a.s.c.CorePropertiesLocator >> > Found 0 core definitions underneath /opt/solr/server/solr >> > 2017-10-11 15:28:12.635 INFO (main) [ ] o.e.j.s.Server Started @4304ms >> > 2017-10-11 15:29:00.971 INFO (qtp1911006827-13) [ ] >> > o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system >> > params={wt=json} status=0 QTime=108
Re: solr 7.0.1: exception running post to crawl simple website
Kevin, I am not able to replicate the issue on my system, which is bit annoying for me. Try this out for last time: docker exec -it --user=solr solr bin/post -c handbook http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes html and have Content-Type: "html" and "text/html", try with both. If you get past this hurdle this hurdle, let me know. Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Fri, Oct 13, 2017 at 8:22 PM, Kevin Layerwrote: > Amrit Sarkar wrote: > > >> ah oh, dockers. They are placed under [solr-home]/server/log/solr/log > in > >> the machine. I haven't played much with docker, any way you can get that > >> file from that location. > > I see these files: > > /opt/solr/server/logs/archived > /opt/solr/server/logs/solr_gc.log.0.current > /opt/solr/server/logs/solr.log > /opt/solr/server/solr/handbook/data/tlog > > The 3rd one has very little info. Attached: > > > 2017-10-11 15:28:09.564 INFO (main) [ ] o.e.j.s.Server > jetty-9.3.14.v20161028 > 2017-10-11 15:28:10.668 INFO (main) [ ] o.a.s.s.SolrDispatchFilter > ___ _ Welcome to Apache Solr™ version 7.0.1 > 2017-10-11 15:28:10.669 INFO (main) [ ] o.a.s.s.SolrDispatchFilter / > __| ___| |_ _ Starting in standalone mode on port 8983 > 2017-10-11 15:28:10.670 INFO (main) [ ] o.a.s.s.SolrDispatchFilter \__ > \/ _ \ | '_| Install dir: /opt/solr, Default config dir: > /opt/solr/server/solr/configsets/_default/conf > 2017-10-11 15:28:10.707 INFO (main) [ ] o.a.s.s.SolrDispatchFilter > |___/\___/_|_|Start time: 2017-10-11T15:28:10.674Z > 2017-10-11 15:28:10.747 INFO (main) [ ] o.a.s.c.SolrResourceLoader > Using system property solr.solr.home: /opt/solr/server/solr > 2017-10-11 15:28:10.763 INFO (main) [ ] o.a.s.c.SolrXmlConfig Loading > container configuration from /opt/solr/server/solr/solr.xml > 2017-10-11 15:28:11.062 INFO (main) [ ] o.a.s.c.SolrResourceLoader > [null] Added 0 libs to classloader, from paths: [] > 2017-10-11 15:28:12.514 INFO (main) [ ] o.a.s.c.CorePropertiesLocator > Found 0 core definitions underneath /opt/solr/server/solr > 2017-10-11 15:28:12.635 INFO (main) [ ] o.e.j.s.Server Started @4304ms > 2017-10-11 15:29:00.971 INFO (qtp1911006827-13) [ ] > o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system > params={wt=json} status=0 QTime=108 > 2017-10-11 15:29:01.080 INFO (qtp1911006827-18) [ ] > o.a.s.c.TransientSolrCoreCacheDefault > Allocating transient cache for 2147483647 transient cores > 2017-10-11 15:29:01.083 INFO (qtp1911006827-18) [ ] > o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores > params={core=handbook=STATUS=json} status=0 QTime=5 > 2017-10-11 15:29:01.194 INFO (qtp1911006827-19) [ ] > o.a.s.h.a.CoreAdminOperation core create command > name=handbook=CREATE=handbook=json > 2017-10-11 15:29:01.342 INFO (qtp1911006827-19) [ x:handbook] > o.a.s.c.SolrResourceLoader [handbook] Added 51 libs to classloader, from > paths: [/opt/solr/contrib/clustering/lib, /opt/solr/contrib/extraction/lib, > /opt/solr/contrib/langid/lib, /opt/solr/contrib/velocity/lib, > /opt/solr/dist] > 2017-10-11 15:29:01.504 INFO (qtp1911006827-19) [ x:handbook] > o.a.s.c.SolrConfig Using Lucene MatchVersion: 7.0.1 > 2017-10-11 15:29:01.969 INFO (qtp1911006827-19) [ x:handbook] > o.a.s.s.IndexSchema [handbook] Schema name=default-config > 2017-10-11 15:29:03.678 INFO (qtp1911006827-19) [ x:handbook] > o.a.s.s.IndexSchema Loaded schema default-config/1.6 with uniqueid field id > 2017-10-11 15:29:03.806 INFO (qtp1911006827-19) [ x:handbook] > o.a.s.c.CoreContainer Creating SolrCore 'handbook' using configuration from > instancedir /opt/solr/server/solr/handbook, trusted=true > 2017-10-11 15:29:03.853 INFO (qtp1911006827-19) [ x:handbook] > o.a.s.c.SolrCore solr.RecoveryStrategy.Builder > 2017-10-11 15:29:03.866 INFO (qtp1911006827-19) [ x:handbook] > o.a.s.c.SolrCore [[handbook] ] Opening new SolrCore at > [/opt/solr/server/solr/handbook], dataDir=[/opt/solr/server/ > solr/handbook/data/] > 2017-10-11 15:29:04.180 INFO (qtp1911006827-19) [ x:handbook] > o.a.s.r.XSLTResponseWriter xsltCacheLifetimeSeconds=5 > 2017-10-11 15:29:05.100 INFO (qtp1911006827-19) [ x:handbook] > o.a.s.u.UpdateHandler Using UpdateLog implementation: > org.apache.solr.update.UpdateLog > 2017-10-11 15:29:05.101 INFO (qtp1911006827-19) [ x:handbook] > o.a.s.u.UpdateLog Initializing UpdateLog: dataDir= defaultSyncLevel=FLUSH > numRecordsToKeep=100 maxNumLogsToKeep=10 numVersionBuckets=65536 > 2017-10-11 15:29:05.150 INFO (qtp1911006827-19) [ x:handbook] > o.a.s.u.CommitTracker Hard AutoCommit: if uncommited for 15000ms; > 2017-10-11 15:29:05.151 INFO (qtp1911006827-19) [ x:handbook] > o.a.s.u.CommitTracker Soft AutoCommit: disabled > 2017-10-11 15:29:05.199 INFO (qtp1911006827-19) [ x:handbook] >
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> ah oh, dockers. They are placed under [solr-home]/server/log/solr/log in >> the machine. I haven't played much with docker, any way you can get that >> file from that location. I see these files: /opt/solr/server/logs/archived /opt/solr/server/logs/solr_gc.log.0.current /opt/solr/server/logs/solr.log /opt/solr/server/solr/handbook/data/tlog The 3rd one has very little info. Attached: 2017-10-11 15:28:09.564 INFO (main) [ ] o.e.j.s.Server jetty-9.3.14.v20161028 2017-10-11 15:28:10.668 INFO (main) [ ] o.a.s.s.SolrDispatchFilter ___ _ Welcome to Apache Solr™ version 7.0.1 2017-10-11 15:28:10.669 INFO (main) [ ] o.a.s.s.SolrDispatchFilter / __| ___| |_ _ Starting in standalone mode on port 8983 2017-10-11 15:28:10.670 INFO (main) [ ] o.a.s.s.SolrDispatchFilter \__ \/ _ \ | '_| Install dir: /opt/solr, Default config dir: /opt/solr/server/solr/configsets/_default/conf 2017-10-11 15:28:10.707 INFO (main) [ ] o.a.s.s.SolrDispatchFilter |___/\___/_|_|Start time: 2017-10-11T15:28:10.674Z 2017-10-11 15:28:10.747 INFO (main) [ ] o.a.s.c.SolrResourceLoader Using system property solr.solr.home: /opt/solr/server/solr 2017-10-11 15:28:10.763 INFO (main) [ ] o.a.s.c.SolrXmlConfig Loading container configuration from /opt/solr/server/solr/solr.xml 2017-10-11 15:28:11.062 INFO (main) [ ] o.a.s.c.SolrResourceLoader [null] Added 0 libs to classloader, from paths: [] 2017-10-11 15:28:12.514 INFO (main) [ ] o.a.s.c.CorePropertiesLocator Found 0 core definitions underneath /opt/solr/server/solr 2017-10-11 15:28:12.635 INFO (main) [ ] o.e.j.s.Server Started @4304ms 2017-10-11 15:29:00.971 INFO (qtp1911006827-13) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={wt=json} status=0 QTime=108 2017-10-11 15:29:01.080 INFO (qtp1911006827-18) [ ] o.a.s.c.TransientSolrCoreCacheDefault Allocating transient cache for 2147483647 transient cores 2017-10-11 15:29:01.083 INFO (qtp1911006827-18) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={core=handbook=STATUS=json} status=0 QTime=5 2017-10-11 15:29:01.194 INFO (qtp1911006827-19) [ ] o.a.s.h.a.CoreAdminOperation core create command name=handbook=CREATE=handbook=json 2017-10-11 15:29:01.342 INFO (qtp1911006827-19) [ x:handbook] o.a.s.c.SolrResourceLoader [handbook] Added 51 libs to classloader, from paths: [/opt/solr/contrib/clustering/lib, /opt/solr/contrib/extraction/lib, /opt/solr/contrib/langid/lib, /opt/solr/contrib/velocity/lib, /opt/solr/dist] 2017-10-11 15:29:01.504 INFO (qtp1911006827-19) [ x:handbook] o.a.s.c.SolrConfig Using Lucene MatchVersion: 7.0.1 2017-10-11 15:29:01.969 INFO (qtp1911006827-19) [ x:handbook] o.a.s.s.IndexSchema [handbook] Schema name=default-config 2017-10-11 15:29:03.678 INFO (qtp1911006827-19) [ x:handbook] o.a.s.s.IndexSchema Loaded schema default-config/1.6 with uniqueid field id 2017-10-11 15:29:03.806 INFO (qtp1911006827-19) [ x:handbook] o.a.s.c.CoreContainer Creating SolrCore 'handbook' using configuration from instancedir /opt/solr/server/solr/handbook, trusted=true 2017-10-11 15:29:03.853 INFO (qtp1911006827-19) [ x:handbook] o.a.s.c.SolrCore solr.RecoveryStrategy.Builder 2017-10-11 15:29:03.866 INFO (qtp1911006827-19) [ x:handbook] o.a.s.c.SolrCore [[handbook] ] Opening new SolrCore at [/opt/solr/server/solr/handbook], dataDir=[/opt/solr/server/solr/handbook/data/] 2017-10-11 15:29:04.180 INFO (qtp1911006827-19) [ x:handbook] o.a.s.r.XSLTResponseWriter xsltCacheLifetimeSeconds=5 2017-10-11 15:29:05.100 INFO (qtp1911006827-19) [ x:handbook] o.a.s.u.UpdateHandler Using UpdateLog implementation: org.apache.solr.update.UpdateLog 2017-10-11 15:29:05.101 INFO (qtp1911006827-19) [ x:handbook] o.a.s.u.UpdateLog Initializing UpdateLog: dataDir= defaultSyncLevel=FLUSH numRecordsToKeep=100 maxNumLogsToKeep=10 numVersionBuckets=65536 2017-10-11 15:29:05.150 INFO (qtp1911006827-19) [ x:handbook] o.a.s.u.CommitTracker Hard AutoCommit: if uncommited for 15000ms; 2017-10-11 15:29:05.151 INFO (qtp1911006827-19) [ x:handbook] o.a.s.u.CommitTracker Soft AutoCommit: disabled 2017-10-11 15:29:05.199 INFO (qtp1911006827-19) [ x:handbook] o.a.s.s.SolrIndexSearcher Opening [Searcher@2b9fd97b[handbook] main] 2017-10-11 15:29:05.229 INFO (qtp1911006827-19) [ x:handbook] o.a.s.r.ManagedResourceStorage File-based storage initialized to use dir: /opt/solr/server/solr/handbook/conf 2017-10-11 15:29:05.266 INFO (qtp1911006827-19) [ x:handbook] o.a.s.h.c.SpellCheckComponent Initializing spell checkers 2017-10-11 15:29:05.283 INFO (qtp1911006827-19) [ x:handbook] o.a.s.s.DirectSolrSpellChecker init: {name=default,field=_text_,classname=solr.DirectSolrSpellChecker,distanceMeasure=internal,accuracy=0.5,maxEdits=2,minPrefix=1,maxInspections=5,minQueryLength=4,maxQueryFrequency=0.01} 2017-10-11 15:29:05.318 INFO (qtp1911006827-19) [ x:handbook]
Re: solr 7.0.1: exception running post to crawl simple website
pardon: [solr-home]/server/log/solr.log Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Fri, Oct 13, 2017 at 8:10 PM, Amrit Sarkarwrote: > ah oh, dockers. They are placed under [solr-home]/server/log/solr/log in > the machine. I haven't played much with docker, any way you can get that > file from that location. > > Amrit Sarkar > Search Engineer > Lucidworks, Inc. > 415-589-9269 > www.lucidworks.com > Twitter http://twitter.com/lucidworks > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > > On Fri, Oct 13, 2017 at 8:08 PM, Kevin Layer wrote: > >> Amrit Sarkar wrote: >> >> >> Hi Kevin, >> >> >> >> Can you post the solr log in the mail thread. I don't think it handled >> the >> >> .md by itself by first glance at code. >> >> How do I extract the log you want? >> >> >> >> >> >> Amrit Sarkar >> >> Search Engineer >> >> Lucidworks, Inc. >> >> 415-589-9269 >> >> www.lucidworks.com >> >> Twitter http://twitter.com/lucidworks >> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> >> >> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer wrote: >> >> >> >> > Amrit Sarkar wrote: >> >> > >> >> > >> Kevin, >> >> > >> >> >> > >> Just put "html" too and give it a shot. These are the types it is >> >> > expecting: >> >> > >> >> > Same thing. >> >> > >> >> > >> >> >> > >> mimeMap = new HashMap<>(); >> >> > >> mimeMap.put("xml", "application/xml"); >> >> > >> mimeMap.put("csv", "text/csv"); >> >> > >> mimeMap.put("json", "application/json"); >> >> > >> mimeMap.put("jsonl", "application/json"); >> >> > >> mimeMap.put("pdf", "application/pdf"); >> >> > >> mimeMap.put("rtf", "text/rtf"); >> >> > >> mimeMap.put("html", "text/html"); >> >> > >> mimeMap.put("htm", "text/html"); >> >> > >> mimeMap.put("doc", "application/msword"); >> >> > >> mimeMap.put("docx", >> >> > >> "application/vnd.openxmlformats-officedocument. >> >> > wordprocessingml.document"); >> >> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint"); >> >> > >> mimeMap.put("pptx", >> >> > >> "application/vnd.openxmlformats-officedocument. >> >> > presentationml.presentation"); >> >> > >> mimeMap.put("xls", "application/vnd.ms-excel"); >> >> > >> mimeMap.put("xlsx", >> >> > >> "application/vnd.openxmlformats-officedocument.spreadsheetml >> .sheet"); >> >> > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); >> >> > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); >> >> > >> mimeMap.put("odp", "application/vnd.oasis.opendoc >> ument.presentation"); >> >> > >> mimeMap.put("otp", "application/vnd.oasis.opendoc >> ument.presentation"); >> >> > >> mimeMap.put("ods", "application/vnd.oasis.opendoc >> ument.spreadsheet"); >> >> > >> mimeMap.put("ots", "application/vnd.oasis.opendoc >> ument.spreadsheet"); >> >> > >> mimeMap.put("txt", "text/plain"); >> >> > >> mimeMap.put("log", "text/plain"); >> >> > >> >> >> > >> The keys are the types supported. >> >> > >> >> >> > >> >> >> > >> Amrit Sarkar >> >> > >> Search Engineer >> >> > >> Lucidworks, Inc. >> >> > >> 415-589-9269 >> >> > >> www.lucidworks.com >> >> > >> Twitter http://twitter.com/lucidworks >> >> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> > >> >> >> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar < >> sarkaramr...@gmail.com> >> >> > >> wrote: >> >> > >> >> >> > >> > Ah! >> >> > >> > >> >> > >> > Only supported type is: text/html; encoding=utf-8 >> >> > >> > >> >> > >> > I am not confident of this either :) but this should work. >> >> > >> > >> >> > >> > See the code-snippet below: >> >> > >> > >> >> > >> > .. >> >> > >> > >> >> > >> > if(res.httpStatus == 200) { >> >> > >> > // Raw content type of form "text/html; encoding=utf-8" >> >> > >> > String rawContentType = conn.getContentType(); >> >> > >> > String type = rawContentType.split(";")[0]; >> >> > >> > if(typeSupported(type) || "*".equals(fileTypes)) { >> >> > >> > String encoding = conn.getContentEncoding(); >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > >> > Amrit Sarkar >> >> > >> > Search Engineer >> >> > >> > Lucidworks, Inc. >> >> > >> > 415-589-9269 >> >> > >> > www.lucidworks.com >> >> > >> > Twitter http://twitter.com/lucidworks >> >> > >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> > >> > >> >> > >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer >> wrote: >> >> > >> > >> >> > >> >> Amrit Sarkar wrote: >> >> > >> >> >> >> > >> >> >> Strange, >> >> > >> >> >> >> >> > >> >> >> Can you add: "text/html;charset=utf-8". This is >> wiki.apache.org >> >> > page's >> >> > >> >> >> Content-Type. Let's see what it says now. >> >> > >> >> >> >> > >> >> Same thing. Verified Content-Type: >> >> > >> >> >> >> > >> >> quadra[git:master]$ wget -S -O /dev/null >> http://quadra:9091/index.md >> >> > |& >> >> > >> >> grep Content-Type >> >> > >> >>
Re: solr 7.0.1: exception running post to crawl simple website
ah oh, dockers. They are placed under [solr-home]/server/log/solr/log in the machine. I haven't played much with docker, any way you can get that file from that location. Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Fri, Oct 13, 2017 at 8:08 PM, Kevin Layerwrote: > Amrit Sarkar wrote: > > >> Hi Kevin, > >> > >> Can you post the solr log in the mail thread. I don't think it handled > the > >> .md by itself by first glance at code. > > How do I extract the log you want? > > > >> > >> Amrit Sarkar > >> Search Engineer > >> Lucidworks, Inc. > >> 415-589-9269 > >> www.lucidworks.com > >> Twitter http://twitter.com/lucidworks > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > >> > >> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer wrote: > >> > >> > Amrit Sarkar wrote: > >> > > >> > >> Kevin, > >> > >> > >> > >> Just put "html" too and give it a shot. These are the types it is > >> > expecting: > >> > > >> > Same thing. > >> > > >> > >> > >> > >> mimeMap = new HashMap<>(); > >> > >> mimeMap.put("xml", "application/xml"); > >> > >> mimeMap.put("csv", "text/csv"); > >> > >> mimeMap.put("json", "application/json"); > >> > >> mimeMap.put("jsonl", "application/json"); > >> > >> mimeMap.put("pdf", "application/pdf"); > >> > >> mimeMap.put("rtf", "text/rtf"); > >> > >> mimeMap.put("html", "text/html"); > >> > >> mimeMap.put("htm", "text/html"); > >> > >> mimeMap.put("doc", "application/msword"); > >> > >> mimeMap.put("docx", > >> > >> "application/vnd.openxmlformats-officedocument. > >> > wordprocessingml.document"); > >> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint"); > >> > >> mimeMap.put("pptx", > >> > >> "application/vnd.openxmlformats-officedocument. > >> > presentationml.presentation"); > >> > >> mimeMap.put("xls", "application/vnd.ms-excel"); > >> > >> mimeMap.put("xlsx", > >> > >> "application/vnd.openxmlformats-officedocument. > spreadsheetml.sheet"); > >> > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); > >> > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); > >> > >> mimeMap.put("odp", "application/vnd.oasis. > opendocument.presentation"); > >> > >> mimeMap.put("otp", "application/vnd.oasis. > opendocument.presentation"); > >> > >> mimeMap.put("ods", "application/vnd.oasis. > opendocument.spreadsheet"); > >> > >> mimeMap.put("ots", "application/vnd.oasis. > opendocument.spreadsheet"); > >> > >> mimeMap.put("txt", "text/plain"); > >> > >> mimeMap.put("log", "text/plain"); > >> > >> > >> > >> The keys are the types supported. > >> > >> > >> > >> > >> > >> Amrit Sarkar > >> > >> Search Engineer > >> > >> Lucidworks, Inc. > >> > >> 415-589-9269 > >> > >> www.lucidworks.com > >> > >> Twitter http://twitter.com/lucidworks > >> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > >> > >> > >> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar < > sarkaramr...@gmail.com> > >> > >> wrote: > >> > >> > >> > >> > Ah! > >> > >> > > >> > >> > Only supported type is: text/html; encoding=utf-8 > >> > >> > > >> > >> > I am not confident of this either :) but this should work. > >> > >> > > >> > >> > See the code-snippet below: > >> > >> > > >> > >> > .. > >> > >> > > >> > >> > if(res.httpStatus == 200) { > >> > >> > // Raw content type of form "text/html; encoding=utf-8" > >> > >> > String rawContentType = conn.getContentType(); > >> > >> > String type = rawContentType.split(";")[0]; > >> > >> > if(typeSupported(type) || "*".equals(fileTypes)) { > >> > >> > String encoding = conn.getContentEncoding(); > >> > >> > > >> > >> > > >> > >> > > >> > >> > > >> > >> > Amrit Sarkar > >> > >> > Search Engineer > >> > >> > Lucidworks, Inc. > >> > >> > 415-589-9269 > >> > >> > www.lucidworks.com > >> > >> > Twitter http://twitter.com/lucidworks > >> > >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > >> > >> > > >> > >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer > wrote: > >> > >> > > >> > >> >> Amrit Sarkar wrote: > >> > >> >> > >> > >> >> >> Strange, > >> > >> >> >> > >> > >> >> >> Can you add: "text/html;charset=utf-8". This is > wiki.apache.org > >> > page's > >> > >> >> >> Content-Type. Let's see what it says now. > >> > >> >> > >> > >> >> Same thing. Verified Content-Type: > >> > >> >> > >> > >> >> quadra[git:master]$ wget -S -O /dev/null > http://quadra:9091/index.md > >> > |& > >> > >> >> grep Content-Type > >> > >> >> Content-Type: text/html;charset=utf-8 > >> > >> >> quadra[git:master]$ ] > >> > >> >> > >> > >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c > >> > handbook > >> > >> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes > md > >> > >> >> /docker-java-home/jre/bin/java -classpath > >> > /opt/solr/dist/solr-core-7.0.1.jar > >> > >> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook > >> >
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> Hi Kevin, >> >> Can you post the solr log in the mail thread. I don't think it handled the >> .md by itself by first glance at code. Note that when I use the admin web interface, and click on "Logging" on the left, I just see a spinner that implies it's trying to retrieve the logs (I see headers "Time (Local) Level CoreLogger Message"), but no log entries. It's been like this for 10 minutes. >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layerwrote: >> >> > Amrit Sarkar wrote: >> > >> > >> Kevin, >> > >> >> > >> Just put "html" too and give it a shot. These are the types it is >> > expecting: >> > >> > Same thing. >> > >> > >> >> > >> mimeMap = new HashMap<>(); >> > >> mimeMap.put("xml", "application/xml"); >> > >> mimeMap.put("csv", "text/csv"); >> > >> mimeMap.put("json", "application/json"); >> > >> mimeMap.put("jsonl", "application/json"); >> > >> mimeMap.put("pdf", "application/pdf"); >> > >> mimeMap.put("rtf", "text/rtf"); >> > >> mimeMap.put("html", "text/html"); >> > >> mimeMap.put("htm", "text/html"); >> > >> mimeMap.put("doc", "application/msword"); >> > >> mimeMap.put("docx", >> > >> "application/vnd.openxmlformats-officedocument. >> > wordprocessingml.document"); >> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint"); >> > >> mimeMap.put("pptx", >> > >> "application/vnd.openxmlformats-officedocument. >> > presentationml.presentation"); >> > >> mimeMap.put("xls", "application/vnd.ms-excel"); >> > >> mimeMap.put("xlsx", >> > >> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); >> > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); >> > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); >> > >> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation"); >> > >> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation"); >> > >> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet"); >> > >> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet"); >> > >> mimeMap.put("txt", "text/plain"); >> > >> mimeMap.put("log", "text/plain"); >> > >> >> > >> The keys are the types supported. >> > >> >> > >> >> > >> Amrit Sarkar >> > >> Search Engineer >> > >> Lucidworks, Inc. >> > >> 415-589-9269 >> > >> www.lucidworks.com >> > >> Twitter http://twitter.com/lucidworks >> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> > >> >> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar >> > >> wrote: >> > >> >> > >> > Ah! >> > >> > >> > >> > Only supported type is: text/html; encoding=utf-8 >> > >> > >> > >> > I am not confident of this either :) but this should work. >> > >> > >> > >> > See the code-snippet below: >> > >> > >> > >> > .. >> > >> > >> > >> > if(res.httpStatus == 200) { >> > >> > // Raw content type of form "text/html; encoding=utf-8" >> > >> > String rawContentType = conn.getContentType(); >> > >> > String type = rawContentType.split(";")[0]; >> > >> > if(typeSupported(type) || "*".equals(fileTypes)) { >> > >> > String encoding = conn.getContentEncoding(); >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > Amrit Sarkar >> > >> > Search Engineer >> > >> > Lucidworks, Inc. >> > >> > 415-589-9269 >> > >> > www.lucidworks.com >> > >> > Twitter http://twitter.com/lucidworks >> > >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> > >> > >> > >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer wrote: >> > >> > >> > >> >> Amrit Sarkar wrote: >> > >> >> >> > >> >> >> Strange, >> > >> >> >> >> > >> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org >> > page's >> > >> >> >> Content-Type. Let's see what it says now. >> > >> >> >> > >> >> Same thing. Verified Content-Type: >> > >> >> >> > >> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md >> > |& >> > >> >> grep Content-Type >> > >> >> Content-Type: text/html;charset=utf-8 >> > >> >> quadra[git:master]$ ] >> > >> >> >> > >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c >> > handbook >> > >> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md >> > >> >> /docker-java-home/jre/bin/java -classpath >> > /opt/solr/dist/solr-core-7.0.1.jar >> > >> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook >> > -Ddata=web >> > >> >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >> > >> >> SimplePostTool version 5.0.0 >> > >> >> Posting web pages to Solr url http://localhost:8983/solr/han >> > >> >> dbook/update/extract >> > >> >> Entering auto mode. Indexing pages with content-types corresponding >> > to >> > >> >> file endings md >> > >> >> SimplePostTool: WARNING: Never crawl an external web site faster than >> > >> >> every 10 seconds,
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> Hi Kevin, >> >> Can you post the solr log in the mail thread. I don't think it handled the >> .md by itself by first glance at code. How do I extract the log you want? >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layerwrote: >> >> > Amrit Sarkar wrote: >> > >> > >> Kevin, >> > >> >> > >> Just put "html" too and give it a shot. These are the types it is >> > expecting: >> > >> > Same thing. >> > >> > >> >> > >> mimeMap = new HashMap<>(); >> > >> mimeMap.put("xml", "application/xml"); >> > >> mimeMap.put("csv", "text/csv"); >> > >> mimeMap.put("json", "application/json"); >> > >> mimeMap.put("jsonl", "application/json"); >> > >> mimeMap.put("pdf", "application/pdf"); >> > >> mimeMap.put("rtf", "text/rtf"); >> > >> mimeMap.put("html", "text/html"); >> > >> mimeMap.put("htm", "text/html"); >> > >> mimeMap.put("doc", "application/msword"); >> > >> mimeMap.put("docx", >> > >> "application/vnd.openxmlformats-officedocument. >> > wordprocessingml.document"); >> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint"); >> > >> mimeMap.put("pptx", >> > >> "application/vnd.openxmlformats-officedocument. >> > presentationml.presentation"); >> > >> mimeMap.put("xls", "application/vnd.ms-excel"); >> > >> mimeMap.put("xlsx", >> > >> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); >> > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); >> > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); >> > >> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation"); >> > >> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation"); >> > >> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet"); >> > >> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet"); >> > >> mimeMap.put("txt", "text/plain"); >> > >> mimeMap.put("log", "text/plain"); >> > >> >> > >> The keys are the types supported. >> > >> >> > >> >> > >> Amrit Sarkar >> > >> Search Engineer >> > >> Lucidworks, Inc. >> > >> 415-589-9269 >> > >> www.lucidworks.com >> > >> Twitter http://twitter.com/lucidworks >> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> > >> >> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar >> > >> wrote: >> > >> >> > >> > Ah! >> > >> > >> > >> > Only supported type is: text/html; encoding=utf-8 >> > >> > >> > >> > I am not confident of this either :) but this should work. >> > >> > >> > >> > See the code-snippet below: >> > >> > >> > >> > .. >> > >> > >> > >> > if(res.httpStatus == 200) { >> > >> > // Raw content type of form "text/html; encoding=utf-8" >> > >> > String rawContentType = conn.getContentType(); >> > >> > String type = rawContentType.split(";")[0]; >> > >> > if(typeSupported(type) || "*".equals(fileTypes)) { >> > >> > String encoding = conn.getContentEncoding(); >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > Amrit Sarkar >> > >> > Search Engineer >> > >> > Lucidworks, Inc. >> > >> > 415-589-9269 >> > >> > www.lucidworks.com >> > >> > Twitter http://twitter.com/lucidworks >> > >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> > >> > >> > >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer wrote: >> > >> > >> > >> >> Amrit Sarkar wrote: >> > >> >> >> > >> >> >> Strange, >> > >> >> >> >> > >> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org >> > page's >> > >> >> >> Content-Type. Let's see what it says now. >> > >> >> >> > >> >> Same thing. Verified Content-Type: >> > >> >> >> > >> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md >> > |& >> > >> >> grep Content-Type >> > >> >> Content-Type: text/html;charset=utf-8 >> > >> >> quadra[git:master]$ ] >> > >> >> >> > >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c >> > handbook >> > >> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md >> > >> >> /docker-java-home/jre/bin/java -classpath >> > /opt/solr/dist/solr-core-7.0.1.jar >> > >> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook >> > -Ddata=web >> > >> >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >> > >> >> SimplePostTool version 5.0.0 >> > >> >> Posting web pages to Solr url http://localhost:8983/solr/han >> > >> >> dbook/update/extract >> > >> >> Entering auto mode. Indexing pages with content-types corresponding >> > to >> > >> >> file endings md >> > >> >> SimplePostTool: WARNING: Never crawl an external web site faster than >> > >> >> every 10 seconds, your IP will probably be blocked >> > >> >> Entering recursive mode, depth=10, delay=0s >> > >> >> Entering crawl at level 0 (1 links total, 1 new) >> > >> >> SimplePostTool: WARNING: Skipping URL with unsupported type text/html >> > >>
Re: solr 7.0.1: exception running post to crawl simple website
Hi Kevin, Can you post the solr log in the mail thread. I don't think it handled the .md by itself by first glance at code. Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layerwrote: > Amrit Sarkar wrote: > > >> Kevin, > >> > >> Just put "html" too and give it a shot. These are the types it is > expecting: > > Same thing. > > >> > >> mimeMap = new HashMap<>(); > >> mimeMap.put("xml", "application/xml"); > >> mimeMap.put("csv", "text/csv"); > >> mimeMap.put("json", "application/json"); > >> mimeMap.put("jsonl", "application/json"); > >> mimeMap.put("pdf", "application/pdf"); > >> mimeMap.put("rtf", "text/rtf"); > >> mimeMap.put("html", "text/html"); > >> mimeMap.put("htm", "text/html"); > >> mimeMap.put("doc", "application/msword"); > >> mimeMap.put("docx", > >> "application/vnd.openxmlformats-officedocument. > wordprocessingml.document"); > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint"); > >> mimeMap.put("pptx", > >> "application/vnd.openxmlformats-officedocument. > presentationml.presentation"); > >> mimeMap.put("xls", "application/vnd.ms-excel"); > >> mimeMap.put("xlsx", > >> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); > >> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation"); > >> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation"); > >> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet"); > >> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet"); > >> mimeMap.put("txt", "text/plain"); > >> mimeMap.put("log", "text/plain"); > >> > >> The keys are the types supported. > >> > >> > >> Amrit Sarkar > >> Search Engineer > >> Lucidworks, Inc. > >> 415-589-9269 > >> www.lucidworks.com > >> Twitter http://twitter.com/lucidworks > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > >> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar > >> wrote: > >> > >> > Ah! > >> > > >> > Only supported type is: text/html; encoding=utf-8 > >> > > >> > I am not confident of this either :) but this should work. > >> > > >> > See the code-snippet below: > >> > > >> > .. > >> > > >> > if(res.httpStatus == 200) { > >> > // Raw content type of form "text/html; encoding=utf-8" > >> > String rawContentType = conn.getContentType(); > >> > String type = rawContentType.split(";")[0]; > >> > if(typeSupported(type) || "*".equals(fileTypes)) { > >> > String encoding = conn.getContentEncoding(); > >> > > >> > > >> > > >> > > >> > Amrit Sarkar > >> > Search Engineer > >> > Lucidworks, Inc. > >> > 415-589-9269 > >> > www.lucidworks.com > >> > Twitter http://twitter.com/lucidworks > >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > >> > > >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer wrote: > >> > > >> >> Amrit Sarkar wrote: > >> >> > >> >> >> Strange, > >> >> >> > >> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org > page's > >> >> >> Content-Type. Let's see what it says now. > >> >> > >> >> Same thing. Verified Content-Type: > >> >> > >> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md > |& > >> >> grep Content-Type > >> >> Content-Type: text/html;charset=utf-8 > >> >> quadra[git:master]$ ] > >> >> > >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c > handbook > >> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md > >> >> /docker-java-home/jre/bin/java -classpath > /opt/solr/dist/solr-core-7.0.1.jar > >> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook > -Ddata=web > >> >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md > >> >> SimplePostTool version 5.0.0 > >> >> Posting web pages to Solr url http://localhost:8983/solr/han > >> >> dbook/update/extract > >> >> Entering auto mode. Indexing pages with content-types corresponding > to > >> >> file endings md > >> >> SimplePostTool: WARNING: Never crawl an external web site faster than > >> >> every 10 seconds, your IP will probably be blocked > >> >> Entering recursive mode, depth=10, delay=0s > >> >> Entering crawl at level 0 (1 links total, 1 new) > >> >> SimplePostTool: WARNING: Skipping URL with unsupported type text/html > >> >> SimplePostTool: WARNING: The URL http://quadra:9091/index.md > returned a > >> >> HTTP result status of 415 > >> >> 0 web pages indexed. > >> >> COMMITting Solr index changes to http://localhost:8983/solr/han > >> >> dbook/update/extract... > >> >> Time spent: 0:00:00.531 > >> >> quadra[git:master]$ > >> >> > >> >> Kevin > >> >> > >> >> >> > >> >> >> Amrit Sarkar > >> >> >> Search Engineer > >> >> >> Lucidworks, Inc. > >> >> >> 415-589-9269 > >> >> >>
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> Kevin, >> >> Just put "html" too and give it a shot. These are the types it is expecting: Same thing. >> >> mimeMap = new HashMap<>(); >> mimeMap.put("xml", "application/xml"); >> mimeMap.put("csv", "text/csv"); >> mimeMap.put("json", "application/json"); >> mimeMap.put("jsonl", "application/json"); >> mimeMap.put("pdf", "application/pdf"); >> mimeMap.put("rtf", "text/rtf"); >> mimeMap.put("html", "text/html"); >> mimeMap.put("htm", "text/html"); >> mimeMap.put("doc", "application/msword"); >> mimeMap.put("docx", >> "application/vnd.openxmlformats-officedocument.wordprocessingml.document"); >> mimeMap.put("ppt", "application/vnd.ms-powerpoint"); >> mimeMap.put("pptx", >> "application/vnd.openxmlformats-officedocument.presentationml.presentation"); >> mimeMap.put("xls", "application/vnd.ms-excel"); >> mimeMap.put("xlsx", >> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); >> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation"); >> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation"); >> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet"); >> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet"); >> mimeMap.put("txt", "text/plain"); >> mimeMap.put("log", "text/plain"); >> >> The keys are the types supported. >> >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar>> wrote: >> >> > Ah! >> > >> > Only supported type is: text/html; encoding=utf-8 >> > >> > I am not confident of this either :) but this should work. >> > >> > See the code-snippet below: >> > >> > .. >> > >> > if(res.httpStatus == 200) { >> > // Raw content type of form "text/html; encoding=utf-8" >> > String rawContentType = conn.getContentType(); >> > String type = rawContentType.split(";")[0]; >> > if(typeSupported(type) || "*".equals(fileTypes)) { >> > String encoding = conn.getContentEncoding(); >> > >> > >> > >> > >> > Amrit Sarkar >> > Search Engineer >> > Lucidworks, Inc. >> > 415-589-9269 >> > www.lucidworks.com >> > Twitter http://twitter.com/lucidworks >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> > >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer wrote: >> > >> >> Amrit Sarkar wrote: >> >> >> >> >> Strange, >> >> >> >> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's >> >> >> Content-Type. Let's see what it says now. >> >> >> >> Same thing. Verified Content-Type: >> >> >> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& >> >> grep Content-Type >> >> Content-Type: text/html;charset=utf-8 >> >> quadra[git:master]$ ] >> >> >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook >> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md >> >> /docker-java-home/jre/bin/java -classpath >> >> /opt/solr/dist/solr-core-7.0.1.jar >> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web >> >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >> >> SimplePostTool version 5.0.0 >> >> Posting web pages to Solr url http://localhost:8983/solr/han >> >> dbook/update/extract >> >> Entering auto mode. Indexing pages with content-types corresponding to >> >> file endings md >> >> SimplePostTool: WARNING: Never crawl an external web site faster than >> >> every 10 seconds, your IP will probably be blocked >> >> Entering recursive mode, depth=10, delay=0s >> >> Entering crawl at level 0 (1 links total, 1 new) >> >> SimplePostTool: WARNING: Skipping URL with unsupported type text/html >> >> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a >> >> HTTP result status of 415 >> >> 0 web pages indexed. >> >> COMMITting Solr index changes to http://localhost:8983/solr/han >> >> dbook/update/extract... >> >> Time spent: 0:00:00.531 >> >> quadra[git:master]$ >> >> >> >> Kevin >> >> >> >> >> >> >> >> Amrit Sarkar >> >> >> Search Engineer >> >> >> Lucidworks, Inc. >> >> >> 415-589-9269 >> >> >> www.lucidworks.com >> >> >> Twitter http://twitter.com/lucidworks >> >> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> >> >> >> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer wrote: >> >> >> >> >> >> > OK, so I hacked markserv to add Content-Type text/html, but now I get >> >> >> > >> >> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html >> >> >> > >> >> >> > What is it expecting? >> >> >> > >> >> >> > $ docker exec -it --user=solr solr bin/post -c handbook >> >> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md >> >> >> >
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> Reference to the code: >> >> . >> >> String rawContentType = conn.getContentType(); >> String type = rawContentType.split(";")[0]; >> if(typeSupported(type) || "*".equals(fileTypes)) { >> String encoding = conn.getContentEncoding(); >> >> . >> >> protected boolean typeSupported(String type) { >> for(String key : mimeMap.keySet()) { >> if(mimeMap.get(key).equals(type)) { >> if(fileTypes.contains(key)) >> return true; >> } >> } >> return false; >> } >> >> . >> >> It has another check for fileTypes, I can see the page ending with .md >> (which you are indexing) and not .html. Let's hope now this is not the >> issue. Did you see the "-filetypes md" at the end of the post command line? Shouldn't that handle it? Kevin >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 7:04 PM, Amrit Sarkar>> wrote: >> >> > Kevin, >> > >> > Just put "html" too and give it a shot. These are the types it is >> > expecting: >> > >> > mimeMap = new HashMap<>(); >> > mimeMap.put("xml", "application/xml"); >> > mimeMap.put("csv", "text/csv"); >> > mimeMap.put("json", "application/json"); >> > mimeMap.put("jsonl", "application/json"); >> > mimeMap.put("pdf", "application/pdf"); >> > mimeMap.put("rtf", "text/rtf"); >> > mimeMap.put("html", "text/html"); >> > mimeMap.put("htm", "text/html"); >> > mimeMap.put("doc", "application/msword"); >> > mimeMap.put("docx", >> > "application/vnd.openxmlformats-officedocument.wordprocessingml.document"); >> > mimeMap.put("ppt", "application/vnd.ms-powerpoint"); >> > mimeMap.put("pptx", >> > "application/vnd.openxmlformats-officedocument.presentationml.presentation"); >> > mimeMap.put("xls", "application/vnd.ms-excel"); >> > mimeMap.put("xlsx", >> > "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); >> > mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); >> > mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); >> > mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation"); >> > mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation"); >> > mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet"); >> > mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet"); >> > mimeMap.put("txt", "text/plain"); >> > mimeMap.put("log", "text/plain"); >> > >> > The keys are the types supported. >> > >> > >> > Amrit Sarkar >> > Search Engineer >> > Lucidworks, Inc. >> > 415-589-9269 >> > www.lucidworks.com >> > Twitter http://twitter.com/lucidworks >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> > >> > On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar >> > wrote: >> > >> >> Ah! >> >> >> >> Only supported type is: text/html; encoding=utf-8 >> >> >> >> I am not confident of this either :) but this should work. >> >> >> >> See the code-snippet below: >> >> >> >> .. >> >> >> >> if(res.httpStatus == 200) { >> >> // Raw content type of form "text/html; encoding=utf-8" >> >> String rawContentType = conn.getContentType(); >> >> String type = rawContentType.split(";")[0]; >> >> if(typeSupported(type) || "*".equals(fileTypes)) { >> >> String encoding = conn.getContentEncoding(); >> >> >> >> >> >> >> >> >> >> Amrit Sarkar >> >> Search Engineer >> >> Lucidworks, Inc. >> >> 415-589-9269 >> >> www.lucidworks.com >> >> Twitter http://twitter.com/lucidworks >> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> >> >> On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer wrote: >> >> >> >>> Amrit Sarkar wrote: >> >>> >> >>> >> Strange, >> >>> >> >> >>> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org >> >>> page's >> >>> >> Content-Type. Let's see what it says now. >> >>> >> >>> Same thing. Verified Content-Type: >> >>> >> >>> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& >> >>> grep Content-Type >> >>> Content-Type: text/html;charset=utf-8 >> >>> quadra[git:master]$ ] >> >>> >> >>> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c >> >>> handbook http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes >> >>> md >> >>> /docker-java-home/jre/bin/java -classpath >> >>> /opt/solr/dist/solr-core-7.0.1.jar >> >>> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook >> >>> -Ddata=web >> >>> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >> >>> SimplePostTool version 5.0.0 >> >>> Posting web pages to Solr url http://localhost:8983/solr/han >> >>> dbook/update/extract >> >>> Entering auto mode. Indexing pages with content-types corresponding to >> >>> file endings md >> >>> SimplePostTool: WARNING: Never crawl an external web site faster than >> >>> every 10 seconds, your IP will probably be blocked >> >>> Entering
Re: solr 7.0.1: exception running post to crawl simple website
Ah! Only supported type is: text/html; encoding=utf-8 I am not confident of this either :) but this should work. See the code-snippet below: .. if(res.httpStatus == 200) { // Raw content type of form "text/html; encoding=utf-8" String rawContentType = conn.getContentType(); String type = rawContentType.split(";")[0]; if(typeSupported(type) || "*".equals(fileTypes)) { String encoding = conn.getContentEncoding(); Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layerwrote: > Amrit Sarkar wrote: > > >> Strange, > >> > >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's > >> Content-Type. Let's see what it says now. > > Same thing. Verified Content-Type: > > quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& > grep Content-Type > Content-Type: text/html;charset=utf-8 > quadra[git:master]$ ] > > quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md > /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md > SimplePostTool version 5.0.0 > Posting web pages to Solr url http://localhost:8983/solr/ > handbook/update/extract > Entering auto mode. Indexing pages with content-types corresponding to > file endings md > SimplePostTool: WARNING: Never crawl an external web site faster than > every 10 seconds, your IP will probably be blocked > Entering recursive mode, depth=10, delay=0s > Entering crawl at level 0 (1 links total, 1 new) > SimplePostTool: WARNING: Skipping URL with unsupported type text/html > SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a > HTTP result status of 415 > 0 web pages indexed. > COMMITting Solr index changes to http://localhost:8983/solr/ > handbook/update/extract... > Time spent: 0:00:00.531 > quadra[git:master]$ > > Kevin > > >> > >> Amrit Sarkar > >> Search Engineer > >> Lucidworks, Inc. > >> 415-589-9269 > >> www.lucidworks.com > >> Twitter http://twitter.com/lucidworks > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > >> > >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer wrote: > >> > >> > OK, so I hacked markserv to add Content-Type text/html, but now I get > >> > > >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html > >> > > >> > What is it expecting? > >> > > >> > $ docker exec -it --user=solr solr bin/post -c handbook > >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md > >> > /docker-java-home/jre/bin/java -classpath > /opt/solr/dist/solr-core-7.0.1.jar > >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook > -Ddata=web > >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md > >> > SimplePostTool version 5.0.0 > >> > Posting web pages to Solr url http://localhost:8983/solr/ > >> > handbook/update/extract > >> > Entering auto mode. Indexing pages with content-types corresponding to > >> > file endings md > >> > SimplePostTool: WARNING: Never crawl an external web site faster than > >> > every 10 seconds, your IP will probably be blocked > >> > Entering recursive mode, depth=10, delay=0s > >> > Entering crawl at level 0 (1 links total, 1 new) > >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html > >> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md > returned a > >> > HTTP result status of 415 > >> > 0 web pages indexed. > >> > COMMITting Solr index changes to http://localhost:8983/solr/ > >> > handbook/update/extract... > >> > Time spent: 0:00:03.882 > >> > $ > >> > > >> > Thanks. > >> > > >> > Kevin > >> > >
Re: solr 7.0.1: exception running post to crawl simple website
Kevin, Just put "html" too and give it a shot. These are the types it is expecting: mimeMap = new HashMap<>(); mimeMap.put("xml", "application/xml"); mimeMap.put("csv", "text/csv"); mimeMap.put("json", "application/json"); mimeMap.put("jsonl", "application/json"); mimeMap.put("pdf", "application/pdf"); mimeMap.put("rtf", "text/rtf"); mimeMap.put("html", "text/html"); mimeMap.put("htm", "text/html"); mimeMap.put("doc", "application/msword"); mimeMap.put("docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"); mimeMap.put("ppt", "application/vnd.ms-powerpoint"); mimeMap.put("pptx", "application/vnd.openxmlformats-officedocument.presentationml.presentation"); mimeMap.put("xls", "application/vnd.ms-excel"); mimeMap.put("xlsx", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation"); mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation"); mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet"); mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet"); mimeMap.put("txt", "text/plain"); mimeMap.put("log", "text/plain"); The keys are the types supported. Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkarwrote: > Ah! > > Only supported type is: text/html; encoding=utf-8 > > I am not confident of this either :) but this should work. > > See the code-snippet below: > > .. > > if(res.httpStatus == 200) { > // Raw content type of form "text/html; encoding=utf-8" > String rawContentType = conn.getContentType(); > String type = rawContentType.split(";")[0]; > if(typeSupported(type) || "*".equals(fileTypes)) { > String encoding = conn.getContentEncoding(); > > > > > Amrit Sarkar > Search Engineer > Lucidworks, Inc. > 415-589-9269 > www.lucidworks.com > Twitter http://twitter.com/lucidworks > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer wrote: > >> Amrit Sarkar wrote: >> >> >> Strange, >> >> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's >> >> Content-Type. Let's see what it says now. >> >> Same thing. Verified Content-Type: >> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& >> grep Content-Type >> Content-Type: text/html;charset=utf-8 >> quadra[git:master]$ ] >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md >> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >> SimplePostTool version 5.0.0 >> Posting web pages to Solr url http://localhost:8983/solr/han >> dbook/update/extract >> Entering auto mode. Indexing pages with content-types corresponding to >> file endings md >> SimplePostTool: WARNING: Never crawl an external web site faster than >> every 10 seconds, your IP will probably be blocked >> Entering recursive mode, depth=10, delay=0s >> Entering crawl at level 0 (1 links total, 1 new) >> SimplePostTool: WARNING: Skipping URL with unsupported type text/html >> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a >> HTTP result status of 415 >> 0 web pages indexed. >> COMMITting Solr index changes to http://localhost:8983/solr/han >> dbook/update/extract... >> Time spent: 0:00:00.531 >> quadra[git:master]$ >> >> Kevin >> >> >> >> >> Amrit Sarkar >> >> Search Engineer >> >> Lucidworks, Inc. >> >> 415-589-9269 >> >> www.lucidworks.com >> >> Twitter http://twitter.com/lucidworks >> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> >> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer wrote: >> >> >> >> > OK, so I hacked markserv to add Content-Type text/html, but now I get >> >> > >> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html >> >> > >> >> > What is it expecting? >> >> > >> >> > $ docker exec -it --user=solr solr bin/post -c handbook >> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md >> >> > /docker-java-home/jre/bin/java -classpath >> /opt/solr/dist/solr-core-7.0.1.jar >> >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook >> -Ddata=web >> >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >> >> > SimplePostTool version 5.0.0 >> >> > Posting web pages to Solr url http://localhost:8983/solr/ >> >> > handbook/update/extract >> >> > Entering auto mode. Indexing pages with content-types corresponding >> to >> >> >
Re: solr 7.0.1: exception running post to crawl simple website
Reference to the code: . String rawContentType = conn.getContentType(); String type = rawContentType.split(";")[0]; if(typeSupported(type) || "*".equals(fileTypes)) { String encoding = conn.getContentEncoding(); . protected boolean typeSupported(String type) { for(String key : mimeMap.keySet()) { if(mimeMap.get(key).equals(type)) { if(fileTypes.contains(key)) return true; } } return false; } . It has another check for fileTypes, I can see the page ending with .md (which you are indexing) and not .html. Let's hope now this is not the issue. Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Fri, Oct 13, 2017 at 7:04 PM, Amrit Sarkarwrote: > Kevin, > > Just put "html" too and give it a shot. These are the types it is > expecting: > > mimeMap = new HashMap<>(); > mimeMap.put("xml", "application/xml"); > mimeMap.put("csv", "text/csv"); > mimeMap.put("json", "application/json"); > mimeMap.put("jsonl", "application/json"); > mimeMap.put("pdf", "application/pdf"); > mimeMap.put("rtf", "text/rtf"); > mimeMap.put("html", "text/html"); > mimeMap.put("htm", "text/html"); > mimeMap.put("doc", "application/msword"); > mimeMap.put("docx", > "application/vnd.openxmlformats-officedocument.wordprocessingml.document"); > mimeMap.put("ppt", "application/vnd.ms-powerpoint"); > mimeMap.put("pptx", > "application/vnd.openxmlformats-officedocument.presentationml.presentation"); > mimeMap.put("xls", "application/vnd.ms-excel"); > mimeMap.put("xlsx", > "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); > mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); > mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); > mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation"); > mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation"); > mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet"); > mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet"); > mimeMap.put("txt", "text/plain"); > mimeMap.put("log", "text/plain"); > > The keys are the types supported. > > > Amrit Sarkar > Search Engineer > Lucidworks, Inc. > 415-589-9269 > www.lucidworks.com > Twitter http://twitter.com/lucidworks > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > > On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar > wrote: > >> Ah! >> >> Only supported type is: text/html; encoding=utf-8 >> >> I am not confident of this either :) but this should work. >> >> See the code-snippet below: >> >> .. >> >> if(res.httpStatus == 200) { >> // Raw content type of form "text/html; encoding=utf-8" >> String rawContentType = conn.getContentType(); >> String type = rawContentType.split(";")[0]; >> if(typeSupported(type) || "*".equals(fileTypes)) { >> String encoding = conn.getContentEncoding(); >> >> >> >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer wrote: >> >>> Amrit Sarkar wrote: >>> >>> >> Strange, >>> >> >>> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org >>> page's >>> >> Content-Type. Let's see what it says now. >>> >>> Same thing. Verified Content-Type: >>> >>> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& >>> grep Content-Type >>> Content-Type: text/html;charset=utf-8 >>> quadra[git:master]$ ] >>> >>> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c >>> handbook http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes >>> md >>> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar >>> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web >>> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >>> SimplePostTool version 5.0.0 >>> Posting web pages to Solr url http://localhost:8983/solr/han >>> dbook/update/extract >>> Entering auto mode. Indexing pages with content-types corresponding to >>> file endings md >>> SimplePostTool: WARNING: Never crawl an external web site faster than >>> every 10 seconds, your IP will probably be blocked >>> Entering recursive mode, depth=10, delay=0s >>> Entering crawl at level 0 (1 links total, 1 new) >>> SimplePostTool: WARNING: Skipping URL with unsupported type text/html >>> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a >>> HTTP result status of 415 >>> 0 web pages indexed. >>> COMMITting Solr index changes to http://localhost:8983/solr/han >>> dbook/update/extract... >>> Time spent: 0:00:00.531 >>> quadra[git:master]$ >>> >>> Kevin >>> >>> >> >>> >> Amrit Sarkar >>> >> Search Engineer >>> >> Lucidworks, Inc. >>> >> 415-589-9269 >>> >>
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> Strange, >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's >> Content-Type. Let's see what it says now. Same thing. Verified Content-Type: quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& grep Content-Type Content-Type: text/html;charset=utf-8 quadra[git:master]$ ] quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web org.apache.solr.util.SimplePostTool http://quadra:9091/index.md SimplePostTool version 5.0.0 Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract Entering auto mode. Indexing pages with content-types corresponding to file endings md SimplePostTool: WARNING: Never crawl an external web site faster than every 10 seconds, your IP will probably be blocked Entering recursive mode, depth=10, delay=0s Entering crawl at level 0 (1 links total, 1 new) SimplePostTool: WARNING: Skipping URL with unsupported type text/html SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a HTTP result status of 415 0 web pages indexed. COMMITting Solr index changes to http://localhost:8983/solr/handbook/update/extract... Time spent: 0:00:00.531 quadra[git:master]$ Kevin >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layerwrote: >> >> > OK, so I hacked markserv to add Content-Type text/html, but now I get >> > >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html >> > >> > What is it expecting? >> > >> > $ docker exec -it --user=solr solr bin/post -c handbook >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md >> > /docker-java-home/jre/bin/java -classpath >> > /opt/solr/dist/solr-core-7.0.1.jar >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >> > SimplePostTool version 5.0.0 >> > Posting web pages to Solr url http://localhost:8983/solr/ >> > handbook/update/extract >> > Entering auto mode. Indexing pages with content-types corresponding to >> > file endings md >> > SimplePostTool: WARNING: Never crawl an external web site faster than >> > every 10 seconds, your IP will probably be blocked >> > Entering recursive mode, depth=10, delay=0s >> > Entering crawl at level 0 (1 links total, 1 new) >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html >> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a >> > HTTP result status of 415 >> > 0 web pages indexed. >> > COMMITting Solr index changes to http://localhost:8983/solr/ >> > handbook/update/extract... >> > Time spent: 0:00:03.882 >> > $ >> > >> > Thanks. >> > >> > Kevin >> >
Re: solr 7.0.1: exception running post to crawl simple website
Strange, Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's Content-Type. Let's see what it says now. Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layerwrote: > OK, so I hacked markserv to add Content-Type text/html, but now I get > > SimplePostTool: WARNING: Skipping URL with unsupported type text/html > > What is it expecting? > > $ docker exec -it --user=solr solr bin/post -c handbook > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md > /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md > SimplePostTool version 5.0.0 > Posting web pages to Solr url http://localhost:8983/solr/ > handbook/update/extract > Entering auto mode. Indexing pages with content-types corresponding to > file endings md > SimplePostTool: WARNING: Never crawl an external web site faster than > every 10 seconds, your IP will probably be blocked > Entering recursive mode, depth=10, delay=0s > Entering crawl at level 0 (1 links total, 1 new) > SimplePostTool: WARNING: Skipping URL with unsupported type text/html > SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a > HTTP result status of 415 > 0 web pages indexed. > COMMITting Solr index changes to http://localhost:8983/solr/ > handbook/update/extract... > Time spent: 0:00:03.882 > $ > > Thanks. > > Kevin >
Re: solr 7.0.1: exception running post to crawl simple website
OK, so I hacked markserv to add Content-Type text/html, but now I get SimplePostTool: WARNING: Skipping URL with unsupported type text/html What is it expecting? $ docker exec -it --user=solr solr bin/post -c handbook http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web org.apache.solr.util.SimplePostTool http://quadra:9091/index.md SimplePostTool version 5.0.0 Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract Entering auto mode. Indexing pages with content-types corresponding to file endings md SimplePostTool: WARNING: Never crawl an external web site faster than every 10 seconds, your IP will probably be blocked Entering recursive mode, depth=10, delay=0s Entering crawl at level 0 (1 links total, 1 new) SimplePostTool: WARNING: Skipping URL with unsupported type text/html SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a HTTP result status of 415 0 web pages indexed. COMMITting Solr index changes to http://localhost:8983/solr/handbook/update/extract... Time spent: 0:00:03.882 $ Thanks. Kevin
Re: solr 7.0.1: exception running post to crawl simple website
Amrit Sarkar wrote: >> Kevin, >> >> You are getting NPE at: >> >> String type = rawContentType.split(";")[0]; //HERE - rawContentType is NULL >> >> // related code >> >> String rawContentType = conn.getContentType(); >> >> public String getContentType() { >> return getHeaderField("content-type"); >> } >> >> HttpURLConnection conn = (HttpURLConnection) u.openConnection(); >> >> Can you check at your webpage level headers are properly set and it >> has key "content-type". Amrit, this is markserv, and I just used wget to prove you are correct, there is no Content-Type header. Thanks for the help! I'll see if I can hack markserv to add that, and try again. Kevin
Re: solr 7.0.1: exception running post to crawl simple website
Kevin, You are getting NPE at: String type = rawContentType.split(";")[0]; //HERE - rawContentType is NULL // related code String rawContentType = conn.getContentType(); public String getContentType() { return getHeaderField("content-type"); } HttpURLConnection conn = (HttpURLConnection) u.openConnection(); Can you check at your webpage level headers are properly set and it has key "content-type". Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Wed, Oct 11, 2017 at 9:08 PM, Kevin Layerwrote: > I want to use solr to index a markdown website. The files > are in native markdown, but they are served in HTML (by markserv). > > Here's what I did: > > docker run --name solr -d -p 8983:8983 -t solr > docker exec -it --user=solr solr bin/solr create_core -c handbook > > Then, to crawl the site: > > quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook > http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes md > /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web > org.apache.solr.util.SimplePostTool http://quadra.franz.com:9091/index.md > SimplePostTool version 5.0.0 > Posting web pages to Solr url http://localhost:8983/solr/ > handbook/update/extract > Entering auto mode. Indexing pages with content-types corresponding to > file endings md > SimplePostTool: WARNING: Never crawl an external web site faster than > every 10 seconds, your IP will probably be blocked > Entering recursive mode, depth=10, delay=0s > Entering crawl at level 0 (1 links total, 1 new) > Exception in thread "main" java.lang.NullPointerException > at org.apache.solr.util.SimplePostTool$PageFetcher. > readPageFromUrl(SimplePostTool.java:1138) > at org.apache.solr.util.SimplePostTool.webCrawl( > SimplePostTool.java:603) > at org.apache.solr.util.SimplePostTool.postWebPages( > SimplePostTool.java:563) > at org.apache.solr.util.SimplePostTool.doWebMode( > SimplePostTool.java:365) > at org.apache.solr.util.SimplePostTool.execute( > SimplePostTool.java:187) > at org.apache.solr.util.SimplePostTool.main( > SimplePostTool.java:172) > quadra[git:master]$ > > > Any ideas on what I did wrong? > > Thanks. > > Kevin >
solr 7.0.1: exception running post to crawl simple website
I want to use solr to index a markdown website. The files are in native markdown, but they are served in HTML (by markserv). Here's what I did: docker run --name solr -d -p 8983:8983 -t solr docker exec -it --user=solr solr bin/solr create_core -c handbook Then, to crawl the site: quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes md /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web org.apache.solr.util.SimplePostTool http://quadra.franz.com:9091/index.md SimplePostTool version 5.0.0 Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract Entering auto mode. Indexing pages with content-types corresponding to file endings md SimplePostTool: WARNING: Never crawl an external web site faster than every 10 seconds, your IP will probably be blocked Entering recursive mode, depth=10, delay=0s Entering crawl at level 0 (1 links total, 1 new) Exception in thread "main" java.lang.NullPointerException at org.apache.solr.util.SimplePostTool$PageFetcher.readPageFromUrl(SimplePostTool.java:1138) at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:603) at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563) at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365) at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187) at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172) quadra[git:master]$ Any ideas on what I did wrong? Thanks. Kevin