Re: solr 7.0.1: exception running post to crawl simple website

2017-10-27 Thread Cassandra Targett
Toby,

Your mention of "-recursive" causing a problem reminded me of a simple
crawl (of the 7.0 Ref Guide) using bin/post I was trying to get to
work the other day and couldn't.

The order of the parameters seems to make a difference with what error
you get (this is using 7.1):

1. "./bin/post -c gettingstarted -delay 10
https://lucene.apache.org/solr/guide/7_0 -recursive"

yields the stack trace in the previous message:

POSTed web resource https://lucene.apache.org/solr/guide/7_0 (depth: 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "main" java.lang.RuntimeException:
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content
is not allowed in prolog.
at 
org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
at org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
at org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber:
1; Content is not allowed in prolog.
at 
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at 
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
at 
org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
... 5 more

2. "./bin/post -c gettingstarted -delay 10 -recursive
https://lucene.apache.org/solr/guide/7_0;

yields:

No files, directories, URLs, -d strings, or stdin were specified.

See './bin/post -h' for usage instructions.

3. "./bin/post -c gettingstarted
http://lucene.apache.org/solr/guide/7_0 -recursive -delay 10"

yields:

Unrecognized argument: 10

If this was intended to be a data file, it does not exist relative to
/Applications/Solr/solr-7.1.0

4. "./bin/post -c gettingstarted -delay 10
https://lucene.apache.org/solr/guide/7_0;

successfully gets the document, but only the single page at that URL.
It does not extract any of the content of the page besides the title
and metadata Tika adds.

I'd say we should probably file a JIRA for it. If the parsing is wrong
(as it seems to me to be), that's a different problem, but the fact
you can't use recursive at all is a bug AFAICT.

Cassandra

On Fri, Oct 27, 2017 at 11:03 AM, toby1851  wrote:
> Amrit Sarkar wrote
>> The above is SAXParse, runtime exception. Nothing can be done at Solr end
>> except curating your own data.
>
> I'm trying to replace a solr-4.6.0 system (which has been working
> brilliantly for 3 years!) with solr-7.1.0. I'm running into this exact same
> problem.
>
> I do not believe it is a data curation problem. (Even if it were, it's very
> unfriendly just to bomb out with a stack trace. And it's seriously annoying
> that there's a 14 line error message about a parsing problem, but it
> entirely neglects to mention what it was trying to parse! Was it a file, a
> URL...?)
>
> Anyway, the symptoms I'm seeing are that a simple "post -c foo https://...;
> works fine. But the moment I turn on recursion, it fails before fetching a
> second page. It doesn't matter what the first page is. Really: when I made
> no progress with the site that I'm actually trying to index, I tried another
> of my sites, then Google, then eBay... In every case, I get something like
> this:
>
> $ post -c mycollection https://www.ebay.co.uk -recursive 1 -delay 10
> ...
> POSTed web resource https://www.ebay.co.uk (depth: 0)
> ... [ 10s delay ]
> [Fatal Error] :1:1: Content is not allowed in prolog.
> ...
>
> I've been looking at the code, and also what's going with strace. As far as
> I can see, at the point where the exception occurs, we are parsing data (a
> copy of the page, presumably) that has come from the solr server itself.
> That appears to be a chunk of JSON with embedded XML. The inner XML does
> look to at least start correctly. The fact that we're getting an error at
> line 1 column 1 every single time makes me suspect that we're feeding the
> wrong thing to the SAX parser.
>
> Anyway, I'm going to go and look at nutch as I need something working very
> soon.
>
> But could somebody who is familiar with this code take another look?
>
> Cheers,
>
> Toby.
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-27 Thread toby1851
Amrit Sarkar wrote
> The above is SAXParse, runtime exception. Nothing can be done at Solr end
> except curating your own data.

I'm trying to replace a solr-4.6.0 system (which has been working
brilliantly for 3 years!) with solr-7.1.0. I'm running into this exact same
problem.

I do not believe it is a data curation problem. (Even if it were, it's very
unfriendly just to bomb out with a stack trace. And it's seriously annoying
that there's a 14 line error message about a parsing problem, but it
entirely neglects to mention what it was trying to parse! Was it a file, a
URL...?)

Anyway, the symptoms I'm seeing are that a simple "post -c foo https://...;
works fine. But the moment I turn on recursion, it fails before fetching a
second page. It doesn't matter what the first page is. Really: when I made
no progress with the site that I'm actually trying to index, I tried another
of my sites, then Google, then eBay... In every case, I get something like
this:

$ post -c mycollection https://www.ebay.co.uk -recursive 1 -delay 10
...
POSTed web resource https://www.ebay.co.uk (depth: 0)
... [ 10s delay ]
[Fatal Error] :1:1: Content is not allowed in prolog.
...

I've been looking at the code, and also what's going with strace. As far as
I can see, at the point where the exception occurs, we are parsing data (a
copy of the page, presumably) that has come from the solr server itself.
That appears to be a chunk of JSON with embedded XML. The inner XML does
look to at least start correctly. The fact that we're getting an error at
line 1 column 1 every single time makes me suspect that we're feeding the
wrong thing to the SAX parser.

Anyway, I'm going to go and look at nutch as I need something working very
soon.

But could somebody who is familiar with this code take another look? 

Cheers,

Toby.




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Rick Leir

On 2017-10-13 04:19 PM, Kevin Layer wrote:

Amrit Sarkar wrote:


Kevin,

fileType => md is not recognizable format in SimplePostTool, anyway, moving
on.

OK, thanks.  Looks like I'll have to abandon using solr for this
project (or find another way to crawl the site).

Thank you for all the help, though.  I appreciate it.

Ha, these messages crash my android mail client!  Now...

Did you try Nutch? Or the Narconex HTTP crawler? Tika? Or any Python 
crawler, posting its documents to th Solr API.

cheers -- Rick


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Kevin,
>> 
>> fileType => md is not recognizable format in SimplePostTool, anyway, moving
>> on.

OK, thanks.  Looks like I'll have to abandon using solr for this
project (or find another way to crawl the site).

Thank you for all the help, though.  I appreciate it.

>> The above is SAXParse, runtime exception. Nothing can be done at Solr end
>> except curating your own data.
>> Some helpful links:
>> https://stackoverflow.com/questions/2599919/java-parsing-xml-document-gives-content-not-allowed-in-prolog-error
>> https://stackoverflow.com/questions/3030903/content-is-not-allowed-in-prolog-when-parsing-perfectly-valid-xml-on-gae
>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 8:48 PM, Kevin Layer  wrote:
>> 
>> > Amrit Sarkar wrote:
>> >
>> > >> Kevin,
>> > >>
>> > >> I am not able to replicate the issue on my system, which is bit annoying
>> > >> for me. Try this out for last time:
>> > >>
>> > >> docker exec -it --user=solr solr bin/post -c handbook
>> > >> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0
>> > -filetypes html
>> > >>
>> > >> and have Content-Type: "html" and "text/html", try with both.
>> >
>> > With text/html I get and your command I get
>> >
>> > quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
>> > http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes
>> > html
>> > /docker-java-home/jre/bin/java -classpath 
>> > /opt/solr/dist/solr-core-7.0.1.jar
>> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=html -Dc=handbook
>> > -Ddata=web org.apache.solr.util.SimplePostTool
>> > http://quadra.franz.com:9091/index.md
>> > SimplePostTool version 5.0.0
>> > Posting web pages to Solr url http://localhost:8983/solr/
>> > handbook/update/extract
>> > Entering auto mode. Indexing pages with content-types corresponding to
>> > file endings html
>> > SimplePostTool: WARNING: Never crawl an external web site faster than
>> > every 10 seconds, your IP will probably be blocked
>> > Entering recursive mode, depth=10, delay=0s
>> > Entering crawl at level 0 (1 links total, 1 new)
>> > POSTed web resource http://quadra.franz.com:9091/index.md (depth: 0)
>> > [Fatal Error] :1:1: Content is not allowed in prolog.
>> > Exception in thread "main" java.lang.RuntimeException:
>> > org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is
>> > not allowed in prolog.
>> > at org.apache.solr.util.SimplePostTool$PageFetcher.
>> > getLinksFromWebPage(SimplePostTool.java:1252)
>> > at org.apache.solr.util.SimplePostTool.webCrawl(
>> > SimplePostTool.java:616)
>> > at org.apache.solr.util.SimplePostTool.postWebPages(
>> > SimplePostTool.java:563)
>> > at org.apache.solr.util.SimplePostTool.doWebMode(
>> > SimplePostTool.java:365)
>> > at org.apache.solr.util.SimplePostTool.execute(
>> > SimplePostTool.java:187)
>> > at org.apache.solr.util.SimplePostTool.main(
>> > SimplePostTool.java:172)
>> > Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1;
>> > Content is not allowed in prolog.
>> > at com.sun.org.apache.xerces.internal.parsers.DOMParser.
>> > parse(DOMParser.java:257)
>> > at com.sun.org.apache.xerces.internal.jaxp.
>> > DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
>> > at javax.xml.parsers.DocumentBuilder.parse(
>> > DocumentBuilder.java:121)
>> > at org.apache.solr.util.SimplePostTool.makeDom(
>> > SimplePostTool.java:1061)
>> > at org.apache.solr.util.SimplePostTool$PageFetcher.
>> > getLinksFromWebPage(SimplePostTool.java:1232)
>> > ... 5 more
>> >
>> >
>> > When I use "-filetype md" back to the regular output that doesn't scan
>> > anything.
>> >
>> >
>> > >>
>> > >> If you get past this hurdle this hurdle, let me know.
>> > >>
>> > >> Amrit Sarkar
>> > >> Search Engineer
>> > >> Lucidworks, Inc.
>> > >> 415-589-9269
>> > >> www.lucidworks.com
>> > >> Twitter http://twitter.com/lucidworks
>> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> > >>
>> > >> On Fri, Oct 13, 2017 at 8:22 PM, Kevin Layer  wrote:
>> > >>
>> > >> > Amrit Sarkar wrote:
>> > >> >
>> > >> > >> ah oh, dockers. They are placed under [solr-home]/server/log/solr/
>> > log
>> > >> > in
>> > >> > >> the machine. I haven't played much with docker, any way you can
>> > get that
>> > >> > >> file from that location.
>> > >> >
>> > >> > I see these files:
>> > >> >
>> > >> > /opt/solr/server/logs/archived
>> > >> > /opt/solr/server/logs/solr_gc.log.0.current
>> > >> > /opt/solr/server/logs/solr.log
>> > >> > /opt/solr/server/solr/handbook/data/tlog
>> > >> >
>> > >> > The 3rd one has very little info.  Attached:
>> > >> >
>> > >> >
>> > >> > 2017-10-11 15:28:09.564 INFO  (main) [   ] o.e.j.s.Server
>> > >> > 

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Kevin,

fileType => md is not recognizable format in SimplePostTool, anyway, moving
on.

The above is SAXParse, runtime exception. Nothing can be done at Solr end
except curating your own data.
Some helpful links:
https://stackoverflow.com/questions/2599919/java-parsing-xml-document-gives-content-not-allowed-in-prolog-error
https://stackoverflow.com/questions/3030903/content-is-not-allowed-in-prolog-when-parsing-perfectly-valid-xml-on-gae

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 8:48 PM, Kevin Layer  wrote:

> Amrit Sarkar wrote:
>
> >> Kevin,
> >>
> >> I am not able to replicate the issue on my system, which is bit annoying
> >> for me. Try this out for last time:
> >>
> >> docker exec -it --user=solr solr bin/post -c handbook
> >> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0
> -filetypes html
> >>
> >> and have Content-Type: "html" and "text/html", try with both.
>
> With text/html I get and your command I get
>
> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes
> html
> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar
> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=html -Dc=handbook
> -Ddata=web org.apache.solr.util.SimplePostTool
> http://quadra.franz.com:9091/index.md
> SimplePostTool version 5.0.0
> Posting web pages to Solr url http://localhost:8983/solr/
> handbook/update/extract
> Entering auto mode. Indexing pages with content-types corresponding to
> file endings html
> SimplePostTool: WARNING: Never crawl an external web site faster than
> every 10 seconds, your IP will probably be blocked
> Entering recursive mode, depth=10, delay=0s
> Entering crawl at level 0 (1 links total, 1 new)
> POSTed web resource http://quadra.franz.com:9091/index.md (depth: 0)
> [Fatal Error] :1:1: Content is not allowed in prolog.
> Exception in thread "main" java.lang.RuntimeException:
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is
> not allowed in prolog.
> at org.apache.solr.util.SimplePostTool$PageFetcher.
> getLinksFromWebPage(SimplePostTool.java:1252)
> at org.apache.solr.util.SimplePostTool.webCrawl(
> SimplePostTool.java:616)
> at org.apache.solr.util.SimplePostTool.postWebPages(
> SimplePostTool.java:563)
> at org.apache.solr.util.SimplePostTool.doWebMode(
> SimplePostTool.java:365)
> at org.apache.solr.util.SimplePostTool.execute(
> SimplePostTool.java:187)
> at org.apache.solr.util.SimplePostTool.main(
> SimplePostTool.java:172)
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1;
> Content is not allowed in prolog.
> at com.sun.org.apache.xerces.internal.parsers.DOMParser.
> parse(DOMParser.java:257)
> at com.sun.org.apache.xerces.internal.jaxp.
> DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
> at javax.xml.parsers.DocumentBuilder.parse(
> DocumentBuilder.java:121)
> at org.apache.solr.util.SimplePostTool.makeDom(
> SimplePostTool.java:1061)
> at org.apache.solr.util.SimplePostTool$PageFetcher.
> getLinksFromWebPage(SimplePostTool.java:1232)
> ... 5 more
>
>
> When I use "-filetype md" back to the regular output that doesn't scan
> anything.
>
>
> >>
> >> If you get past this hurdle this hurdle, let me know.
> >>
> >> Amrit Sarkar
> >> Search Engineer
> >> Lucidworks, Inc.
> >> 415-589-9269
> >> www.lucidworks.com
> >> Twitter http://twitter.com/lucidworks
> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >>
> >> On Fri, Oct 13, 2017 at 8:22 PM, Kevin Layer  wrote:
> >>
> >> > Amrit Sarkar wrote:
> >> >
> >> > >> ah oh, dockers. They are placed under [solr-home]/server/log/solr/
> log
> >> > in
> >> > >> the machine. I haven't played much with docker, any way you can
> get that
> >> > >> file from that location.
> >> >
> >> > I see these files:
> >> >
> >> > /opt/solr/server/logs/archived
> >> > /opt/solr/server/logs/solr_gc.log.0.current
> >> > /opt/solr/server/logs/solr.log
> >> > /opt/solr/server/solr/handbook/data/tlog
> >> >
> >> > The 3rd one has very little info.  Attached:
> >> >
> >> >
> >> > 2017-10-11 15:28:09.564 INFO  (main) [   ] o.e.j.s.Server
> >> > jetty-9.3.14.v20161028
> >> > 2017-10-11 15:28:10.668 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
> >> > ___  _   Welcome to Apache Solr™ version 7.0.1
> >> > 2017-10-11 15:28:10.669 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
> /
> >> > __| ___| |_ _   Starting in standalone mode on port 8983
> >> > 2017-10-11 15:28:10.670 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
> \__
> >> > \/ _ \ | '_|  Install dir: /opt/solr, Default config dir:
> >> > /opt/solr/server/solr/configsets/_default/conf
> >> > 2017-10-11 15:28:10.707 INFO  (main) [   ] 

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Kevin,
>> 
>> I am not able to replicate the issue on my system, which is bit annoying
>> for me. Try this out for last time:
>> 
>> docker exec -it --user=solr solr bin/post -c handbook
>> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes html
>> 
>> and have Content-Type: "html" and "text/html", try with both.

With text/html I get and your command I get

quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook 
http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes html
/docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar 
-Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=html -Dc=handbook -Ddata=web 
org.apache.solr.util.SimplePostTool http://quadra.franz.com:9091/index.md
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file 
endings html
SimplePostTool: WARNING: Never crawl an external web site faster than every 10 
seconds, your IP will probably be blocked
Entering recursive mode, depth=10, delay=0s
Entering crawl at level 0 (1 links total, 1 new)
POSTed web resource http://quadra.franz.com:9091/index.md (depth: 0)
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "main" java.lang.RuntimeException: 
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not 
allowed in prolog.
at 
org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
at 
org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
at 
org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; 
Content is not allowed in prolog.
at 
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:257)
at 
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
at 
org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
... 5 more


When I use "-filetype md" back to the regular output that doesn't scan
anything.


>> 
>> If you get past this hurdle this hurdle, let me know.
>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 8:22 PM, Kevin Layer  wrote:
>> 
>> > Amrit Sarkar wrote:
>> >
>> > >> ah oh, dockers. They are placed under [solr-home]/server/log/solr/log
>> > in
>> > >> the machine. I haven't played much with docker, any way you can get that
>> > >> file from that location.
>> >
>> > I see these files:
>> >
>> > /opt/solr/server/logs/archived
>> > /opt/solr/server/logs/solr_gc.log.0.current
>> > /opt/solr/server/logs/solr.log
>> > /opt/solr/server/solr/handbook/data/tlog
>> >
>> > The 3rd one has very little info.  Attached:
>> >
>> >
>> > 2017-10-11 15:28:09.564 INFO  (main) [   ] o.e.j.s.Server
>> > jetty-9.3.14.v20161028
>> > 2017-10-11 15:28:10.668 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
>> > ___  _   Welcome to Apache Solr™ version 7.0.1
>> > 2017-10-11 15:28:10.669 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter /
>> > __| ___| |_ _   Starting in standalone mode on port 8983
>> > 2017-10-11 15:28:10.670 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter \__
>> > \/ _ \ | '_|  Install dir: /opt/solr, Default config dir:
>> > /opt/solr/server/solr/configsets/_default/conf
>> > 2017-10-11 15:28:10.707 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
>> > |___/\___/_|_|Start time: 2017-10-11T15:28:10.674Z
>> > 2017-10-11 15:28:10.747 INFO  (main) [   ] o.a.s.c.SolrResourceLoader
>> > Using system property solr.solr.home: /opt/solr/server/solr
>> > 2017-10-11 15:28:10.763 INFO  (main) [   ] o.a.s.c.SolrXmlConfig Loading
>> > container configuration from /opt/solr/server/solr/solr.xml
>> > 2017-10-11 15:28:11.062 INFO  (main) [   ] o.a.s.c.SolrResourceLoader
>> > [null] Added 0 libs to classloader, from paths: []
>> > 2017-10-11 15:28:12.514 INFO  (main) [   ] o.a.s.c.CorePropertiesLocator
>> > Found 0 core definitions underneath /opt/solr/server/solr
>> > 2017-10-11 15:28:12.635 INFO  (main) [   ] o.e.j.s.Server Started @4304ms
>> > 2017-10-11 15:29:00.971 INFO  (qtp1911006827-13) [   ]
>> > o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system
>> > params={wt=json} status=0 QTime=108

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Kevin,

I am not able to replicate the issue on my system, which is bit annoying
for me. Try this out for last time:

docker exec -it --user=solr solr bin/post -c handbook
http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes html

and have Content-Type: "html" and "text/html", try with both.

If you get past this hurdle this hurdle, let me know.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 8:22 PM, Kevin Layer  wrote:

> Amrit Sarkar wrote:
>
> >> ah oh, dockers. They are placed under [solr-home]/server/log/solr/log
> in
> >> the machine. I haven't played much with docker, any way you can get that
> >> file from that location.
>
> I see these files:
>
> /opt/solr/server/logs/archived
> /opt/solr/server/logs/solr_gc.log.0.current
> /opt/solr/server/logs/solr.log
> /opt/solr/server/solr/handbook/data/tlog
>
> The 3rd one has very little info.  Attached:
>
>
> 2017-10-11 15:28:09.564 INFO  (main) [   ] o.e.j.s.Server
> jetty-9.3.14.v20161028
> 2017-10-11 15:28:10.668 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
> ___  _   Welcome to Apache Solr™ version 7.0.1
> 2017-10-11 15:28:10.669 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter /
> __| ___| |_ _   Starting in standalone mode on port 8983
> 2017-10-11 15:28:10.670 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter \__
> \/ _ \ | '_|  Install dir: /opt/solr, Default config dir:
> /opt/solr/server/solr/configsets/_default/conf
> 2017-10-11 15:28:10.707 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter
> |___/\___/_|_|Start time: 2017-10-11T15:28:10.674Z
> 2017-10-11 15:28:10.747 INFO  (main) [   ] o.a.s.c.SolrResourceLoader
> Using system property solr.solr.home: /opt/solr/server/solr
> 2017-10-11 15:28:10.763 INFO  (main) [   ] o.a.s.c.SolrXmlConfig Loading
> container configuration from /opt/solr/server/solr/solr.xml
> 2017-10-11 15:28:11.062 INFO  (main) [   ] o.a.s.c.SolrResourceLoader
> [null] Added 0 libs to classloader, from paths: []
> 2017-10-11 15:28:12.514 INFO  (main) [   ] o.a.s.c.CorePropertiesLocator
> Found 0 core definitions underneath /opt/solr/server/solr
> 2017-10-11 15:28:12.635 INFO  (main) [   ] o.e.j.s.Server Started @4304ms
> 2017-10-11 15:29:00.971 INFO  (qtp1911006827-13) [   ]
> o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system
> params={wt=json} status=0 QTime=108
> 2017-10-11 15:29:01.080 INFO  (qtp1911006827-18) [   ] 
> o.a.s.c.TransientSolrCoreCacheDefault
> Allocating transient cache for 2147483647 transient cores
> 2017-10-11 15:29:01.083 INFO  (qtp1911006827-18) [   ]
> o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores
> params={core=handbook=STATUS=json} status=0 QTime=5
> 2017-10-11 15:29:01.194 INFO  (qtp1911006827-19) [   ]
> o.a.s.h.a.CoreAdminOperation core create command
> name=handbook=CREATE=handbook=json
> 2017-10-11 15:29:01.342 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.c.SolrResourceLoader [handbook] Added 51 libs to classloader, from
> paths: [/opt/solr/contrib/clustering/lib, /opt/solr/contrib/extraction/lib,
> /opt/solr/contrib/langid/lib, /opt/solr/contrib/velocity/lib,
> /opt/solr/dist]
> 2017-10-11 15:29:01.504 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.c.SolrConfig Using Lucene MatchVersion: 7.0.1
> 2017-10-11 15:29:01.969 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.s.IndexSchema [handbook] Schema name=default-config
> 2017-10-11 15:29:03.678 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.s.IndexSchema Loaded schema default-config/1.6 with uniqueid field id
> 2017-10-11 15:29:03.806 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.c.CoreContainer Creating SolrCore 'handbook' using configuration from
> instancedir /opt/solr/server/solr/handbook, trusted=true
> 2017-10-11 15:29:03.853 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.c.SolrCore solr.RecoveryStrategy.Builder
> 2017-10-11 15:29:03.866 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.c.SolrCore [[handbook] ] Opening new SolrCore at
> [/opt/solr/server/solr/handbook], dataDir=[/opt/solr/server/
> solr/handbook/data/]
> 2017-10-11 15:29:04.180 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.r.XSLTResponseWriter xsltCacheLifetimeSeconds=5
> 2017-10-11 15:29:05.100 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.u.UpdateHandler Using UpdateLog implementation:
> org.apache.solr.update.UpdateLog
> 2017-10-11 15:29:05.101 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.u.UpdateLog Initializing UpdateLog: dataDir= defaultSyncLevel=FLUSH
> numRecordsToKeep=100 maxNumLogsToKeep=10 numVersionBuckets=65536
> 2017-10-11 15:29:05.150 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.u.CommitTracker Hard AutoCommit: if uncommited for 15000ms;
> 2017-10-11 15:29:05.151 INFO  (qtp1911006827-19) [   x:handbook]
> o.a.s.u.CommitTracker Soft AutoCommit: disabled
> 2017-10-11 15:29:05.199 INFO  (qtp1911006827-19) [   x:handbook]
> 

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> ah oh, dockers. They are placed under [solr-home]/server/log/solr/log in
>> the machine. I haven't played much with docker, any way you can get that
>> file from that location.

I see these files:

/opt/solr/server/logs/archived
/opt/solr/server/logs/solr_gc.log.0.current
/opt/solr/server/logs/solr.log
/opt/solr/server/solr/handbook/data/tlog

The 3rd one has very little info.  Attached:

2017-10-11 15:28:09.564 INFO  (main) [   ] o.e.j.s.Server jetty-9.3.14.v20161028
2017-10-11 15:28:10.668 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter  ___  
_   Welcome to Apache Solr™ version 7.0.1
2017-10-11 15:28:10.669 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter / __| 
___| |_ _   Starting in standalone mode on port 8983
2017-10-11 15:28:10.670 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter \__ \/ _ 
\ | '_|  Install dir: /opt/solr, Default config dir: 
/opt/solr/server/solr/configsets/_default/conf
2017-10-11 15:28:10.707 INFO  (main) [   ] o.a.s.s.SolrDispatchFilter 
|___/\___/_|_|Start time: 2017-10-11T15:28:10.674Z
2017-10-11 15:28:10.747 INFO  (main) [   ] o.a.s.c.SolrResourceLoader Using 
system property solr.solr.home: /opt/solr/server/solr
2017-10-11 15:28:10.763 INFO  (main) [   ] o.a.s.c.SolrXmlConfig Loading 
container configuration from /opt/solr/server/solr/solr.xml
2017-10-11 15:28:11.062 INFO  (main) [   ] o.a.s.c.SolrResourceLoader [null] 
Added 0 libs to classloader, from paths: []
2017-10-11 15:28:12.514 INFO  (main) [   ] o.a.s.c.CorePropertiesLocator Found 
0 core definitions underneath /opt/solr/server/solr
2017-10-11 15:28:12.635 INFO  (main) [   ] o.e.j.s.Server Started @4304ms
2017-10-11 15:29:00.971 INFO  (qtp1911006827-13) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/info/system params={wt=json} status=0 QTime=108
2017-10-11 15:29:01.080 INFO  (qtp1911006827-18) [   ] 
o.a.s.c.TransientSolrCoreCacheDefault Allocating transient cache for 2147483647 
transient cores
2017-10-11 15:29:01.083 INFO  (qtp1911006827-18) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/cores 
params={core=handbook=STATUS=json} status=0 QTime=5
2017-10-11 15:29:01.194 INFO  (qtp1911006827-19) [   ] 
o.a.s.h.a.CoreAdminOperation core create command 
name=handbook=CREATE=handbook=json
2017-10-11 15:29:01.342 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.c.SolrResourceLoader [handbook] Added 51 libs to classloader, from paths: 
[/opt/solr/contrib/clustering/lib, /opt/solr/contrib/extraction/lib, 
/opt/solr/contrib/langid/lib, /opt/solr/contrib/velocity/lib, /opt/solr/dist]
2017-10-11 15:29:01.504 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.c.SolrConfig Using Lucene MatchVersion: 7.0.1
2017-10-11 15:29:01.969 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.s.IndexSchema [handbook] Schema name=default-config
2017-10-11 15:29:03.678 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.s.IndexSchema Loaded schema default-config/1.6 with uniqueid field id
2017-10-11 15:29:03.806 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.c.CoreContainer Creating SolrCore 'handbook' using configuration from 
instancedir /opt/solr/server/solr/handbook, trusted=true
2017-10-11 15:29:03.853 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.c.SolrCore solr.RecoveryStrategy.Builder
2017-10-11 15:29:03.866 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.c.SolrCore [[handbook] ] Opening new SolrCore at 
[/opt/solr/server/solr/handbook], dataDir=[/opt/solr/server/solr/handbook/data/]
2017-10-11 15:29:04.180 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.r.XSLTResponseWriter xsltCacheLifetimeSeconds=5
2017-10-11 15:29:05.100 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.u.UpdateHandler Using UpdateLog implementation: 
org.apache.solr.update.UpdateLog
2017-10-11 15:29:05.101 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.u.UpdateLog Initializing UpdateLog: dataDir= defaultSyncLevel=FLUSH 
numRecordsToKeep=100 maxNumLogsToKeep=10 numVersionBuckets=65536
2017-10-11 15:29:05.150 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.u.CommitTracker Hard AutoCommit: if uncommited for 15000ms; 
2017-10-11 15:29:05.151 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.u.CommitTracker Soft AutoCommit: disabled
2017-10-11 15:29:05.199 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.s.SolrIndexSearcher Opening [Searcher@2b9fd97b[handbook] main]
2017-10-11 15:29:05.229 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.r.ManagedResourceStorage File-based storage initialized to use dir: 
/opt/solr/server/solr/handbook/conf
2017-10-11 15:29:05.266 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.h.c.SpellCheckComponent Initializing spell checkers
2017-10-11 15:29:05.283 INFO  (qtp1911006827-19) [   x:handbook] 
o.a.s.s.DirectSolrSpellChecker init: 
{name=default,field=_text_,classname=solr.DirectSolrSpellChecker,distanceMeasure=internal,accuracy=0.5,maxEdits=2,minPrefix=1,maxInspections=5,minQueryLength=4,maxQueryFrequency=0.01}
2017-10-11 15:29:05.318 INFO  (qtp1911006827-19) [   x:handbook] 

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
pardon: [solr-home]/server/log/solr.log

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 8:10 PM, Amrit Sarkar 
wrote:

> ah oh, dockers. They are placed under [solr-home]/server/log/solr/log in
> the machine. I haven't played much with docker, any way you can get that
> file from that location.
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Fri, Oct 13, 2017 at 8:08 PM, Kevin Layer  wrote:
>
>> Amrit Sarkar wrote:
>>
>> >> Hi Kevin,
>> >>
>> >> Can you post the solr log in the mail thread. I don't think it handled
>> the
>> >> .md by itself by first glance at code.
>>
>> How do I extract the log you want?
>>
>>
>> >>
>> >> Amrit Sarkar
>> >> Search Engineer
>> >> Lucidworks, Inc.
>> >> 415-589-9269
>> >> www.lucidworks.com
>> >> Twitter http://twitter.com/lucidworks
>> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >>
>> >> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer  wrote:
>> >>
>> >> > Amrit Sarkar wrote:
>> >> >
>> >> > >> Kevin,
>> >> > >>
>> >> > >> Just put "html" too and give it a shot. These are the types it is
>> >> > expecting:
>> >> >
>> >> > Same thing.
>> >> >
>> >> > >>
>> >> > >> mimeMap = new HashMap<>();
>> >> > >> mimeMap.put("xml", "application/xml");
>> >> > >> mimeMap.put("csv", "text/csv");
>> >> > >> mimeMap.put("json", "application/json");
>> >> > >> mimeMap.put("jsonl", "application/json");
>> >> > >> mimeMap.put("pdf", "application/pdf");
>> >> > >> mimeMap.put("rtf", "text/rtf");
>> >> > >> mimeMap.put("html", "text/html");
>> >> > >> mimeMap.put("htm", "text/html");
>> >> > >> mimeMap.put("doc", "application/msword");
>> >> > >> mimeMap.put("docx",
>> >> > >> "application/vnd.openxmlformats-officedocument.
>> >> > wordprocessingml.document");
>> >> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
>> >> > >> mimeMap.put("pptx",
>> >> > >> "application/vnd.openxmlformats-officedocument.
>> >> > presentationml.presentation");
>> >> > >> mimeMap.put("xls", "application/vnd.ms-excel");
>> >> > >> mimeMap.put("xlsx",
>> >> > >> "application/vnd.openxmlformats-officedocument.spreadsheetml
>> .sheet");
>> >> > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
>> >> > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
>> >> > >> mimeMap.put("odp", "application/vnd.oasis.opendoc
>> ument.presentation");
>> >> > >> mimeMap.put("otp", "application/vnd.oasis.opendoc
>> ument.presentation");
>> >> > >> mimeMap.put("ods", "application/vnd.oasis.opendoc
>> ument.spreadsheet");
>> >> > >> mimeMap.put("ots", "application/vnd.oasis.opendoc
>> ument.spreadsheet");
>> >> > >> mimeMap.put("txt", "text/plain");
>> >> > >> mimeMap.put("log", "text/plain");
>> >> > >>
>> >> > >> The keys are the types supported.
>> >> > >>
>> >> > >>
>> >> > >> Amrit Sarkar
>> >> > >> Search Engineer
>> >> > >> Lucidworks, Inc.
>> >> > >> 415-589-9269
>> >> > >> www.lucidworks.com
>> >> > >> Twitter http://twitter.com/lucidworks
>> >> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >> > >>
>> >> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar <
>> sarkaramr...@gmail.com>
>> >> > >> wrote:
>> >> > >>
>> >> > >> > Ah!
>> >> > >> >
>> >> > >> > Only supported type is: text/html; encoding=utf-8
>> >> > >> >
>> >> > >> > I am not confident of this either :) but this should work.
>> >> > >> >
>> >> > >> > See the code-snippet below:
>> >> > >> >
>> >> > >> > ..
>> >> > >> >
>> >> > >> > if(res.httpStatus == 200) {
>> >> > >> >   // Raw content type of form "text/html; encoding=utf-8"
>> >> > >> >   String rawContentType = conn.getContentType();
>> >> > >> >   String type = rawContentType.split(";")[0];
>> >> > >> >   if(typeSupported(type) || "*".equals(fileTypes)) {
>> >> > >> > String encoding = conn.getContentEncoding();
>> >> > >> >
>> >> > >> > 
>> >> > >> >
>> >> > >> >
>> >> > >> > Amrit Sarkar
>> >> > >> > Search Engineer
>> >> > >> > Lucidworks, Inc.
>> >> > >> > 415-589-9269
>> >> > >> > www.lucidworks.com
>> >> > >> > Twitter http://twitter.com/lucidworks
>> >> > >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >> > >> >
>> >> > >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer 
>> wrote:
>> >> > >> >
>> >> > >> >> Amrit Sarkar wrote:
>> >> > >> >>
>> >> > >> >> >> Strange,
>> >> > >> >> >>
>> >> > >> >> >> Can you add: "text/html;charset=utf-8". This is
>> wiki.apache.org
>> >> > page's
>> >> > >> >> >> Content-Type. Let's see what it says now.
>> >> > >> >>
>> >> > >> >> Same thing.  Verified Content-Type:
>> >> > >> >>
>> >> > >> >> quadra[git:master]$ wget -S -O /dev/null
>> http://quadra:9091/index.md
>> >> > |&
>> >> > >> >> grep Content-Type
>> >> > >> >>   

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
ah oh, dockers. They are placed under [solr-home]/server/log/solr/log in
the machine. I haven't played much with docker, any way you can get that
file from that location.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 8:08 PM, Kevin Layer  wrote:

> Amrit Sarkar wrote:
>
> >> Hi Kevin,
> >>
> >> Can you post the solr log in the mail thread. I don't think it handled
> the
> >> .md by itself by first glance at code.
>
> How do I extract the log you want?
>
>
> >>
> >> Amrit Sarkar
> >> Search Engineer
> >> Lucidworks, Inc.
> >> 415-589-9269
> >> www.lucidworks.com
> >> Twitter http://twitter.com/lucidworks
> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >>
> >> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer  wrote:
> >>
> >> > Amrit Sarkar wrote:
> >> >
> >> > >> Kevin,
> >> > >>
> >> > >> Just put "html" too and give it a shot. These are the types it is
> >> > expecting:
> >> >
> >> > Same thing.
> >> >
> >> > >>
> >> > >> mimeMap = new HashMap<>();
> >> > >> mimeMap.put("xml", "application/xml");
> >> > >> mimeMap.put("csv", "text/csv");
> >> > >> mimeMap.put("json", "application/json");
> >> > >> mimeMap.put("jsonl", "application/json");
> >> > >> mimeMap.put("pdf", "application/pdf");
> >> > >> mimeMap.put("rtf", "text/rtf");
> >> > >> mimeMap.put("html", "text/html");
> >> > >> mimeMap.put("htm", "text/html");
> >> > >> mimeMap.put("doc", "application/msword");
> >> > >> mimeMap.put("docx",
> >> > >> "application/vnd.openxmlformats-officedocument.
> >> > wordprocessingml.document");
> >> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
> >> > >> mimeMap.put("pptx",
> >> > >> "application/vnd.openxmlformats-officedocument.
> >> > presentationml.presentation");
> >> > >> mimeMap.put("xls", "application/vnd.ms-excel");
> >> > >> mimeMap.put("xlsx",
> >> > >> "application/vnd.openxmlformats-officedocument.
> spreadsheetml.sheet");
> >> > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
> >> > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
> >> > >> mimeMap.put("odp", "application/vnd.oasis.
> opendocument.presentation");
> >> > >> mimeMap.put("otp", "application/vnd.oasis.
> opendocument.presentation");
> >> > >> mimeMap.put("ods", "application/vnd.oasis.
> opendocument.spreadsheet");
> >> > >> mimeMap.put("ots", "application/vnd.oasis.
> opendocument.spreadsheet");
> >> > >> mimeMap.put("txt", "text/plain");
> >> > >> mimeMap.put("log", "text/plain");
> >> > >>
> >> > >> The keys are the types supported.
> >> > >>
> >> > >>
> >> > >> Amrit Sarkar
> >> > >> Search Engineer
> >> > >> Lucidworks, Inc.
> >> > >> 415-589-9269
> >> > >> www.lucidworks.com
> >> > >> Twitter http://twitter.com/lucidworks
> >> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >> > >>
> >> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar <
> sarkaramr...@gmail.com>
> >> > >> wrote:
> >> > >>
> >> > >> > Ah!
> >> > >> >
> >> > >> > Only supported type is: text/html; encoding=utf-8
> >> > >> >
> >> > >> > I am not confident of this either :) but this should work.
> >> > >> >
> >> > >> > See the code-snippet below:
> >> > >> >
> >> > >> > ..
> >> > >> >
> >> > >> > if(res.httpStatus == 200) {
> >> > >> >   // Raw content type of form "text/html; encoding=utf-8"
> >> > >> >   String rawContentType = conn.getContentType();
> >> > >> >   String type = rawContentType.split(";")[0];
> >> > >> >   if(typeSupported(type) || "*".equals(fileTypes)) {
> >> > >> > String encoding = conn.getContentEncoding();
> >> > >> >
> >> > >> > 
> >> > >> >
> >> > >> >
> >> > >> > Amrit Sarkar
> >> > >> > Search Engineer
> >> > >> > Lucidworks, Inc.
> >> > >> > 415-589-9269
> >> > >> > www.lucidworks.com
> >> > >> > Twitter http://twitter.com/lucidworks
> >> > >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >> > >> >
> >> > >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer 
> wrote:
> >> > >> >
> >> > >> >> Amrit Sarkar wrote:
> >> > >> >>
> >> > >> >> >> Strange,
> >> > >> >> >>
> >> > >> >> >> Can you add: "text/html;charset=utf-8". This is
> wiki.apache.org
> >> > page's
> >> > >> >> >> Content-Type. Let's see what it says now.
> >> > >> >>
> >> > >> >> Same thing.  Verified Content-Type:
> >> > >> >>
> >> > >> >> quadra[git:master]$ wget -S -O /dev/null
> http://quadra:9091/index.md
> >> > |&
> >> > >> >> grep Content-Type
> >> > >> >>   Content-Type: text/html;charset=utf-8
> >> > >> >> quadra[git:master]$ ]
> >> > >> >>
> >> > >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c
> >> > handbook
> >> > >> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes
> md
> >> > >> >> /docker-java-home/jre/bin/java -classpath
> >> > /opt/solr/dist/solr-core-7.0.1.jar
> >> > >> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
> >> > 

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Hi Kevin,
>> 
>> Can you post the solr log in the mail thread. I don't think it handled the
>> .md by itself by first glance at code.

Note that when I use the admin web interface, and click on "Logging"
on the left, I just see a spinner that implies it's trying to retrieve
the logs (I see headers "Time (Local)   Level   CoreLogger  Message"),
but no log entries.  It's been like this for 10 minutes.

>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer  wrote:
>> 
>> > Amrit Sarkar wrote:
>> >
>> > >> Kevin,
>> > >>
>> > >> Just put "html" too and give it a shot. These are the types it is
>> > expecting:
>> >
>> > Same thing.
>> >
>> > >>
>> > >> mimeMap = new HashMap<>();
>> > >> mimeMap.put("xml", "application/xml");
>> > >> mimeMap.put("csv", "text/csv");
>> > >> mimeMap.put("json", "application/json");
>> > >> mimeMap.put("jsonl", "application/json");
>> > >> mimeMap.put("pdf", "application/pdf");
>> > >> mimeMap.put("rtf", "text/rtf");
>> > >> mimeMap.put("html", "text/html");
>> > >> mimeMap.put("htm", "text/html");
>> > >> mimeMap.put("doc", "application/msword");
>> > >> mimeMap.put("docx",
>> > >> "application/vnd.openxmlformats-officedocument.
>> > wordprocessingml.document");
>> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
>> > >> mimeMap.put("pptx",
>> > >> "application/vnd.openxmlformats-officedocument.
>> > presentationml.presentation");
>> > >> mimeMap.put("xls", "application/vnd.ms-excel");
>> > >> mimeMap.put("xlsx",
>> > >> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
>> > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
>> > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
>> > >> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
>> > >> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
>> > >> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
>> > >> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
>> > >> mimeMap.put("txt", "text/plain");
>> > >> mimeMap.put("log", "text/plain");
>> > >>
>> > >> The keys are the types supported.
>> > >>
>> > >>
>> > >> Amrit Sarkar
>> > >> Search Engineer
>> > >> Lucidworks, Inc.
>> > >> 415-589-9269
>> > >> www.lucidworks.com
>> > >> Twitter http://twitter.com/lucidworks
>> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> > >>
>> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
>> > >> wrote:
>> > >>
>> > >> > Ah!
>> > >> >
>> > >> > Only supported type is: text/html; encoding=utf-8
>> > >> >
>> > >> > I am not confident of this either :) but this should work.
>> > >> >
>> > >> > See the code-snippet below:
>> > >> >
>> > >> > ..
>> > >> >
>> > >> > if(res.httpStatus == 200) {
>> > >> >   // Raw content type of form "text/html; encoding=utf-8"
>> > >> >   String rawContentType = conn.getContentType();
>> > >> >   String type = rawContentType.split(";")[0];
>> > >> >   if(typeSupported(type) || "*".equals(fileTypes)) {
>> > >> > String encoding = conn.getContentEncoding();
>> > >> >
>> > >> > 
>> > >> >
>> > >> >
>> > >> > Amrit Sarkar
>> > >> > Search Engineer
>> > >> > Lucidworks, Inc.
>> > >> > 415-589-9269
>> > >> > www.lucidworks.com
>> > >> > Twitter http://twitter.com/lucidworks
>> > >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> > >> >
>> > >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:
>> > >> >
>> > >> >> Amrit Sarkar wrote:
>> > >> >>
>> > >> >> >> Strange,
>> > >> >> >>
>> > >> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org
>> > page's
>> > >> >> >> Content-Type. Let's see what it says now.
>> > >> >>
>> > >> >> Same thing.  Verified Content-Type:
>> > >> >>
>> > >> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md
>> > |&
>> > >> >> grep Content-Type
>> > >> >>   Content-Type: text/html;charset=utf-8
>> > >> >> quadra[git:master]$ ]
>> > >> >>
>> > >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c
>> > handbook
>> > >> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> > >> >> /docker-java-home/jre/bin/java -classpath
>> > /opt/solr/dist/solr-core-7.0.1.jar
>> > >> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
>> > -Ddata=web
>> > >> >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> > >> >> SimplePostTool version 5.0.0
>> > >> >> Posting web pages to Solr url http://localhost:8983/solr/han
>> > >> >> dbook/update/extract
>> > >> >> Entering auto mode. Indexing pages with content-types corresponding
>> > to
>> > >> >> file endings md
>> > >> >> SimplePostTool: WARNING: Never crawl an external web site faster than
>> > >> >> every 10 seconds, 

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Hi Kevin,
>> 
>> Can you post the solr log in the mail thread. I don't think it handled the
>> .md by itself by first glance at code.

How do I extract the log you want?


>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer  wrote:
>> 
>> > Amrit Sarkar wrote:
>> >
>> > >> Kevin,
>> > >>
>> > >> Just put "html" too and give it a shot. These are the types it is
>> > expecting:
>> >
>> > Same thing.
>> >
>> > >>
>> > >> mimeMap = new HashMap<>();
>> > >> mimeMap.put("xml", "application/xml");
>> > >> mimeMap.put("csv", "text/csv");
>> > >> mimeMap.put("json", "application/json");
>> > >> mimeMap.put("jsonl", "application/json");
>> > >> mimeMap.put("pdf", "application/pdf");
>> > >> mimeMap.put("rtf", "text/rtf");
>> > >> mimeMap.put("html", "text/html");
>> > >> mimeMap.put("htm", "text/html");
>> > >> mimeMap.put("doc", "application/msword");
>> > >> mimeMap.put("docx",
>> > >> "application/vnd.openxmlformats-officedocument.
>> > wordprocessingml.document");
>> > >> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
>> > >> mimeMap.put("pptx",
>> > >> "application/vnd.openxmlformats-officedocument.
>> > presentationml.presentation");
>> > >> mimeMap.put("xls", "application/vnd.ms-excel");
>> > >> mimeMap.put("xlsx",
>> > >> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
>> > >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
>> > >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
>> > >> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
>> > >> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
>> > >> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
>> > >> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
>> > >> mimeMap.put("txt", "text/plain");
>> > >> mimeMap.put("log", "text/plain");
>> > >>
>> > >> The keys are the types supported.
>> > >>
>> > >>
>> > >> Amrit Sarkar
>> > >> Search Engineer
>> > >> Lucidworks, Inc.
>> > >> 415-589-9269
>> > >> www.lucidworks.com
>> > >> Twitter http://twitter.com/lucidworks
>> > >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> > >>
>> > >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
>> > >> wrote:
>> > >>
>> > >> > Ah!
>> > >> >
>> > >> > Only supported type is: text/html; encoding=utf-8
>> > >> >
>> > >> > I am not confident of this either :) but this should work.
>> > >> >
>> > >> > See the code-snippet below:
>> > >> >
>> > >> > ..
>> > >> >
>> > >> > if(res.httpStatus == 200) {
>> > >> >   // Raw content type of form "text/html; encoding=utf-8"
>> > >> >   String rawContentType = conn.getContentType();
>> > >> >   String type = rawContentType.split(";")[0];
>> > >> >   if(typeSupported(type) || "*".equals(fileTypes)) {
>> > >> > String encoding = conn.getContentEncoding();
>> > >> >
>> > >> > 
>> > >> >
>> > >> >
>> > >> > Amrit Sarkar
>> > >> > Search Engineer
>> > >> > Lucidworks, Inc.
>> > >> > 415-589-9269
>> > >> > www.lucidworks.com
>> > >> > Twitter http://twitter.com/lucidworks
>> > >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> > >> >
>> > >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:
>> > >> >
>> > >> >> Amrit Sarkar wrote:
>> > >> >>
>> > >> >> >> Strange,
>> > >> >> >>
>> > >> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org
>> > page's
>> > >> >> >> Content-Type. Let's see what it says now.
>> > >> >>
>> > >> >> Same thing.  Verified Content-Type:
>> > >> >>
>> > >> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md
>> > |&
>> > >> >> grep Content-Type
>> > >> >>   Content-Type: text/html;charset=utf-8
>> > >> >> quadra[git:master]$ ]
>> > >> >>
>> > >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c
>> > handbook
>> > >> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> > >> >> /docker-java-home/jre/bin/java -classpath
>> > /opt/solr/dist/solr-core-7.0.1.jar
>> > >> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
>> > -Ddata=web
>> > >> >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> > >> >> SimplePostTool version 5.0.0
>> > >> >> Posting web pages to Solr url http://localhost:8983/solr/han
>> > >> >> dbook/update/extract
>> > >> >> Entering auto mode. Indexing pages with content-types corresponding
>> > to
>> > >> >> file endings md
>> > >> >> SimplePostTool: WARNING: Never crawl an external web site faster than
>> > >> >> every 10 seconds, your IP will probably be blocked
>> > >> >> Entering recursive mode, depth=10, delay=0s
>> > >> >> Entering crawl at level 0 (1 links total, 1 new)
>> > >> >> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> > >> 

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Hi Kevin,

Can you post the solr log in the mail thread. I don't think it handled the
.md by itself by first glance at code.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 7:42 PM, Kevin Layer  wrote:

> Amrit Sarkar wrote:
>
> >> Kevin,
> >>
> >> Just put "html" too and give it a shot. These are the types it is
> expecting:
>
> Same thing.
>
> >>
> >> mimeMap = new HashMap<>();
> >> mimeMap.put("xml", "application/xml");
> >> mimeMap.put("csv", "text/csv");
> >> mimeMap.put("json", "application/json");
> >> mimeMap.put("jsonl", "application/json");
> >> mimeMap.put("pdf", "application/pdf");
> >> mimeMap.put("rtf", "text/rtf");
> >> mimeMap.put("html", "text/html");
> >> mimeMap.put("htm", "text/html");
> >> mimeMap.put("doc", "application/msword");
> >> mimeMap.put("docx",
> >> "application/vnd.openxmlformats-officedocument.
> wordprocessingml.document");
> >> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
> >> mimeMap.put("pptx",
> >> "application/vnd.openxmlformats-officedocument.
> presentationml.presentation");
> >> mimeMap.put("xls", "application/vnd.ms-excel");
> >> mimeMap.put("xlsx",
> >> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
> >> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
> >> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
> >> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
> >> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
> >> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
> >> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
> >> mimeMap.put("txt", "text/plain");
> >> mimeMap.put("log", "text/plain");
> >>
> >> The keys are the types supported.
> >>
> >>
> >> Amrit Sarkar
> >> Search Engineer
> >> Lucidworks, Inc.
> >> 415-589-9269
> >> www.lucidworks.com
> >> Twitter http://twitter.com/lucidworks
> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >>
> >> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
> >> wrote:
> >>
> >> > Ah!
> >> >
> >> > Only supported type is: text/html; encoding=utf-8
> >> >
> >> > I am not confident of this either :) but this should work.
> >> >
> >> > See the code-snippet below:
> >> >
> >> > ..
> >> >
> >> > if(res.httpStatus == 200) {
> >> >   // Raw content type of form "text/html; encoding=utf-8"
> >> >   String rawContentType = conn.getContentType();
> >> >   String type = rawContentType.split(";")[0];
> >> >   if(typeSupported(type) || "*".equals(fileTypes)) {
> >> > String encoding = conn.getContentEncoding();
> >> >
> >> > 
> >> >
> >> >
> >> > Amrit Sarkar
> >> > Search Engineer
> >> > Lucidworks, Inc.
> >> > 415-589-9269
> >> > www.lucidworks.com
> >> > Twitter http://twitter.com/lucidworks
> >> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >> >
> >> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:
> >> >
> >> >> Amrit Sarkar wrote:
> >> >>
> >> >> >> Strange,
> >> >> >>
> >> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org
> page's
> >> >> >> Content-Type. Let's see what it says now.
> >> >>
> >> >> Same thing.  Verified Content-Type:
> >> >>
> >> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md
> |&
> >> >> grep Content-Type
> >> >>   Content-Type: text/html;charset=utf-8
> >> >> quadra[git:master]$ ]
> >> >>
> >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c
> handbook
> >> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
> >> >> /docker-java-home/jre/bin/java -classpath
> /opt/solr/dist/solr-core-7.0.1.jar
> >> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
> -Ddata=web
> >> >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
> >> >> SimplePostTool version 5.0.0
> >> >> Posting web pages to Solr url http://localhost:8983/solr/han
> >> >> dbook/update/extract
> >> >> Entering auto mode. Indexing pages with content-types corresponding
> to
> >> >> file endings md
> >> >> SimplePostTool: WARNING: Never crawl an external web site faster than
> >> >> every 10 seconds, your IP will probably be blocked
> >> >> Entering recursive mode, depth=10, delay=0s
> >> >> Entering crawl at level 0 (1 links total, 1 new)
> >> >> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
> >> >> SimplePostTool: WARNING: The URL http://quadra:9091/index.md
> returned a
> >> >> HTTP result status of 415
> >> >> 0 web pages indexed.
> >> >> COMMITting Solr index changes to http://localhost:8983/solr/han
> >> >> dbook/update/extract...
> >> >> Time spent: 0:00:00.531
> >> >> quadra[git:master]$
> >> >>
> >> >> Kevin
> >> >>
> >> >> >>
> >> >> >> Amrit Sarkar
> >> >> >> Search Engineer
> >> >> >> Lucidworks, Inc.
> >> >> >> 415-589-9269
> >> >> >> 

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Kevin,
>> 
>> Just put "html" too and give it a shot. These are the types it is expecting:

Same thing.

>> 
>> mimeMap = new HashMap<>();
>> mimeMap.put("xml", "application/xml");
>> mimeMap.put("csv", "text/csv");
>> mimeMap.put("json", "application/json");
>> mimeMap.put("jsonl", "application/json");
>> mimeMap.put("pdf", "application/pdf");
>> mimeMap.put("rtf", "text/rtf");
>> mimeMap.put("html", "text/html");
>> mimeMap.put("htm", "text/html");
>> mimeMap.put("doc", "application/msword");
>> mimeMap.put("docx",
>> "application/vnd.openxmlformats-officedocument.wordprocessingml.document");
>> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
>> mimeMap.put("pptx",
>> "application/vnd.openxmlformats-officedocument.presentationml.presentation");
>> mimeMap.put("xls", "application/vnd.ms-excel");
>> mimeMap.put("xlsx",
>> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
>> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
>> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
>> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
>> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
>> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
>> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
>> mimeMap.put("txt", "text/plain");
>> mimeMap.put("log", "text/plain");
>> 
>> The keys are the types supported.
>> 
>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
>> wrote:
>> 
>> > Ah!
>> >
>> > Only supported type is: text/html; encoding=utf-8
>> >
>> > I am not confident of this either :) but this should work.
>> >
>> > See the code-snippet below:
>> >
>> > ..
>> >
>> > if(res.httpStatus == 200) {
>> >   // Raw content type of form "text/html; encoding=utf-8"
>> >   String rawContentType = conn.getContentType();
>> >   String type = rawContentType.split(";")[0];
>> >   if(typeSupported(type) || "*".equals(fileTypes)) {
>> > String encoding = conn.getContentEncoding();
>> >
>> > 
>> >
>> >
>> > Amrit Sarkar
>> > Search Engineer
>> > Lucidworks, Inc.
>> > 415-589-9269
>> > www.lucidworks.com
>> > Twitter http://twitter.com/lucidworks
>> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >
>> > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:
>> >
>> >> Amrit Sarkar wrote:
>> >>
>> >> >> Strange,
>> >> >>
>> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's
>> >> >> Content-Type. Let's see what it says now.
>> >>
>> >> Same thing.  Verified Content-Type:
>> >>
>> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |&
>> >> grep Content-Type
>> >>   Content-Type: text/html;charset=utf-8
>> >> quadra[git:master]$ ]
>> >>
>> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
>> >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> >> /docker-java-home/jre/bin/java -classpath 
>> >> /opt/solr/dist/solr-core-7.0.1.jar
>> >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
>> >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> >> SimplePostTool version 5.0.0
>> >> Posting web pages to Solr url http://localhost:8983/solr/han
>> >> dbook/update/extract
>> >> Entering auto mode. Indexing pages with content-types corresponding to
>> >> file endings md
>> >> SimplePostTool: WARNING: Never crawl an external web site faster than
>> >> every 10 seconds, your IP will probably be blocked
>> >> Entering recursive mode, depth=10, delay=0s
>> >> Entering crawl at level 0 (1 links total, 1 new)
>> >> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> >> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
>> >> HTTP result status of 415
>> >> 0 web pages indexed.
>> >> COMMITting Solr index changes to http://localhost:8983/solr/han
>> >> dbook/update/extract...
>> >> Time spent: 0:00:00.531
>> >> quadra[git:master]$
>> >>
>> >> Kevin
>> >>
>> >> >>
>> >> >> Amrit Sarkar
>> >> >> Search Engineer
>> >> >> Lucidworks, Inc.
>> >> >> 415-589-9269
>> >> >> www.lucidworks.com
>> >> >> Twitter http://twitter.com/lucidworks
>> >> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >> >>
>> >> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer  wrote:
>> >> >>
>> >> >> > OK, so I hacked markserv to add Content-Type text/html, but now I get
>> >> >> >
>> >> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> >> >> >
>> >> >> > What is it expecting?
>> >> >> >
>> >> >> > $ docker exec -it --user=solr solr bin/post -c handbook
>> >> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> >> >> > 

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Reference to the code:
>> 
>> .
>> 
>> String rawContentType = conn.getContentType();
>> String type = rawContentType.split(";")[0];
>> if(typeSupported(type) || "*".equals(fileTypes)) {
>>   String encoding = conn.getContentEncoding();
>> 
>> .
>> 
>> protected boolean typeSupported(String type) {
>>   for(String key : mimeMap.keySet()) {
>> if(mimeMap.get(key).equals(type)) {
>>   if(fileTypes.contains(key))
>> return true;
>> }
>>   }
>>   return false;
>> }
>> 
>> .
>> 
>> It has another check for fileTypes, I can see the page ending with .md
>> (which you are indexing) and not .html. Let's hope now this is not the
>> issue.

Did you see the "-filetypes md" at the end of the post command line?
Shouldn't that handle it?

Kevin

>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 7:04 PM, Amrit Sarkar 
>> wrote:
>> 
>> > Kevin,
>> >
>> > Just put "html" too and give it a shot. These are the types it is
>> > expecting:
>> >
>> > mimeMap = new HashMap<>();
>> > mimeMap.put("xml", "application/xml");
>> > mimeMap.put("csv", "text/csv");
>> > mimeMap.put("json", "application/json");
>> > mimeMap.put("jsonl", "application/json");
>> > mimeMap.put("pdf", "application/pdf");
>> > mimeMap.put("rtf", "text/rtf");
>> > mimeMap.put("html", "text/html");
>> > mimeMap.put("htm", "text/html");
>> > mimeMap.put("doc", "application/msword");
>> > mimeMap.put("docx", 
>> > "application/vnd.openxmlformats-officedocument.wordprocessingml.document");
>> > mimeMap.put("ppt", "application/vnd.ms-powerpoint");
>> > mimeMap.put("pptx", 
>> > "application/vnd.openxmlformats-officedocument.presentationml.presentation");
>> > mimeMap.put("xls", "application/vnd.ms-excel");
>> > mimeMap.put("xlsx", 
>> > "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
>> > mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
>> > mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
>> > mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
>> > mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
>> > mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
>> > mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
>> > mimeMap.put("txt", "text/plain");
>> > mimeMap.put("log", "text/plain");
>> >
>> > The keys are the types supported.
>> >
>> >
>> > Amrit Sarkar
>> > Search Engineer
>> > Lucidworks, Inc.
>> > 415-589-9269
>> > www.lucidworks.com
>> > Twitter http://twitter.com/lucidworks
>> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >
>> > On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
>> > wrote:
>> >
>> >> Ah!
>> >>
>> >> Only supported type is: text/html; encoding=utf-8
>> >>
>> >> I am not confident of this either :) but this should work.
>> >>
>> >> See the code-snippet below:
>> >>
>> >> ..
>> >>
>> >> if(res.httpStatus == 200) {
>> >>   // Raw content type of form "text/html; encoding=utf-8"
>> >>   String rawContentType = conn.getContentType();
>> >>   String type = rawContentType.split(";")[0];
>> >>   if(typeSupported(type) || "*".equals(fileTypes)) {
>> >> String encoding = conn.getContentEncoding();
>> >>
>> >> 
>> >>
>> >>
>> >> Amrit Sarkar
>> >> Search Engineer
>> >> Lucidworks, Inc.
>> >> 415-589-9269
>> >> www.lucidworks.com
>> >> Twitter http://twitter.com/lucidworks
>> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >>
>> >> On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:
>> >>
>> >>> Amrit Sarkar wrote:
>> >>>
>> >>> >> Strange,
>> >>> >>
>> >>> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org
>> >>> page's
>> >>> >> Content-Type. Let's see what it says now.
>> >>>
>> >>> Same thing.  Verified Content-Type:
>> >>>
>> >>> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |&
>> >>> grep Content-Type
>> >>>   Content-Type: text/html;charset=utf-8
>> >>> quadra[git:master]$ ]
>> >>>
>> >>> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c
>> >>> handbook http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes
>> >>> md
>> >>> /docker-java-home/jre/bin/java -classpath 
>> >>> /opt/solr/dist/solr-core-7.0.1.jar
>> >>> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook 
>> >>> -Ddata=web
>> >>> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> >>> SimplePostTool version 5.0.0
>> >>> Posting web pages to Solr url http://localhost:8983/solr/han
>> >>> dbook/update/extract
>> >>> Entering auto mode. Indexing pages with content-types corresponding to
>> >>> file endings md
>> >>> SimplePostTool: WARNING: Never crawl an external web site faster than
>> >>> every 10 seconds, your IP will probably be blocked
>> >>> Entering 

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Ah!

Only supported type is: text/html; encoding=utf-8

I am not confident of this either :) but this should work.

See the code-snippet below:

..

if(res.httpStatus == 200) {
  // Raw content type of form "text/html; encoding=utf-8"
  String rawContentType = conn.getContentType();
  String type = rawContentType.split(";")[0];
  if(typeSupported(type) || "*".equals(fileTypes)) {
String encoding = conn.getContentEncoding();




Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:

> Amrit Sarkar wrote:
>
> >> Strange,
> >>
> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's
> >> Content-Type. Let's see what it says now.
>
> Same thing.  Verified Content-Type:
>
> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |&
> grep Content-Type
>   Content-Type: text/html;charset=utf-8
> quadra[git:master]$ ]
>
> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar
> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
> SimplePostTool version 5.0.0
> Posting web pages to Solr url http://localhost:8983/solr/
> handbook/update/extract
> Entering auto mode. Indexing pages with content-types corresponding to
> file endings md
> SimplePostTool: WARNING: Never crawl an external web site faster than
> every 10 seconds, your IP will probably be blocked
> Entering recursive mode, depth=10, delay=0s
> Entering crawl at level 0 (1 links total, 1 new)
> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
> HTTP result status of 415
> 0 web pages indexed.
> COMMITting Solr index changes to http://localhost:8983/solr/
> handbook/update/extract...
> Time spent: 0:00:00.531
> quadra[git:master]$
>
> Kevin
>
> >>
> >> Amrit Sarkar
> >> Search Engineer
> >> Lucidworks, Inc.
> >> 415-589-9269
> >> www.lucidworks.com
> >> Twitter http://twitter.com/lucidworks
> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >>
> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer  wrote:
> >>
> >> > OK, so I hacked markserv to add Content-Type text/html, but now I get
> >> >
> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
> >> >
> >> > What is it expecting?
> >> >
> >> > $ docker exec -it --user=solr solr bin/post -c handbook
> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
> >> > /docker-java-home/jre/bin/java -classpath
> /opt/solr/dist/solr-core-7.0.1.jar
> >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
> -Ddata=web
> >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
> >> > SimplePostTool version 5.0.0
> >> > Posting web pages to Solr url http://localhost:8983/solr/
> >> > handbook/update/extract
> >> > Entering auto mode. Indexing pages with content-types corresponding to
> >> > file endings md
> >> > SimplePostTool: WARNING: Never crawl an external web site faster than
> >> > every 10 seconds, your IP will probably be blocked
> >> > Entering recursive mode, depth=10, delay=0s
> >> > Entering crawl at level 0 (1 links total, 1 new)
> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
> >> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md
> returned a
> >> > HTTP result status of 415
> >> > 0 web pages indexed.
> >> > COMMITting Solr index changes to http://localhost:8983/solr/
> >> > handbook/update/extract...
> >> > Time spent: 0:00:03.882
> >> > $
> >> >
> >> > Thanks.
> >> >
> >> > Kevin
> >> >
>


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Kevin,

Just put "html" too and give it a shot. These are the types it is expecting:

mimeMap = new HashMap<>();
mimeMap.put("xml", "application/xml");
mimeMap.put("csv", "text/csv");
mimeMap.put("json", "application/json");
mimeMap.put("jsonl", "application/json");
mimeMap.put("pdf", "application/pdf");
mimeMap.put("rtf", "text/rtf");
mimeMap.put("html", "text/html");
mimeMap.put("htm", "text/html");
mimeMap.put("doc", "application/msword");
mimeMap.put("docx",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document");
mimeMap.put("ppt", "application/vnd.ms-powerpoint");
mimeMap.put("pptx",
"application/vnd.openxmlformats-officedocument.presentationml.presentation");
mimeMap.put("xls", "application/vnd.ms-excel");
mimeMap.put("xlsx",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
mimeMap.put("txt", "text/plain");
mimeMap.put("log", "text/plain");

The keys are the types supported.


Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
wrote:

> Ah!
>
> Only supported type is: text/html; encoding=utf-8
>
> I am not confident of this either :) but this should work.
>
> See the code-snippet below:
>
> ..
>
> if(res.httpStatus == 200) {
>   // Raw content type of form "text/html; encoding=utf-8"
>   String rawContentType = conn.getContentType();
>   String type = rawContentType.split(";")[0];
>   if(typeSupported(type) || "*".equals(fileTypes)) {
> String encoding = conn.getContentEncoding();
>
> 
>
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:
>
>> Amrit Sarkar wrote:
>>
>> >> Strange,
>> >>
>> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's
>> >> Content-Type. Let's see what it says now.
>>
>> Same thing.  Verified Content-Type:
>>
>> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |&
>> grep Content-Type
>>   Content-Type: text/html;charset=utf-8
>> quadra[git:master]$ ]
>>
>> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
>> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar
>> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
>> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> SimplePostTool version 5.0.0
>> Posting web pages to Solr url http://localhost:8983/solr/han
>> dbook/update/extract
>> Entering auto mode. Indexing pages with content-types corresponding to
>> file endings md
>> SimplePostTool: WARNING: Never crawl an external web site faster than
>> every 10 seconds, your IP will probably be blocked
>> Entering recursive mode, depth=10, delay=0s
>> Entering crawl at level 0 (1 links total, 1 new)
>> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
>> HTTP result status of 415
>> 0 web pages indexed.
>> COMMITting Solr index changes to http://localhost:8983/solr/han
>> dbook/update/extract...
>> Time spent: 0:00:00.531
>> quadra[git:master]$
>>
>> Kevin
>>
>> >>
>> >> Amrit Sarkar
>> >> Search Engineer
>> >> Lucidworks, Inc.
>> >> 415-589-9269
>> >> www.lucidworks.com
>> >> Twitter http://twitter.com/lucidworks
>> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >>
>> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer  wrote:
>> >>
>> >> > OK, so I hacked markserv to add Content-Type text/html, but now I get
>> >> >
>> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> >> >
>> >> > What is it expecting?
>> >> >
>> >> > $ docker exec -it --user=solr solr bin/post -c handbook
>> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> >> > /docker-java-home/jre/bin/java -classpath
>> /opt/solr/dist/solr-core-7.0.1.jar
>> >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
>> -Ddata=web
>> >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> >> > SimplePostTool version 5.0.0
>> >> > Posting web pages to Solr url http://localhost:8983/solr/
>> >> > handbook/update/extract
>> >> > Entering auto mode. Indexing pages with content-types corresponding
>> to
>> >> > 

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Reference to the code:

.

String rawContentType = conn.getContentType();
String type = rawContentType.split(";")[0];
if(typeSupported(type) || "*".equals(fileTypes)) {
  String encoding = conn.getContentEncoding();

.

protected boolean typeSupported(String type) {
  for(String key : mimeMap.keySet()) {
if(mimeMap.get(key).equals(type)) {
  if(fileTypes.contains(key))
return true;
}
  }
  return false;
}

.

It has another check for fileTypes, I can see the page ending with .md
(which you are indexing) and not .html. Let's hope now this is not the
issue.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 7:04 PM, Amrit Sarkar 
wrote:

> Kevin,
>
> Just put "html" too and give it a shot. These are the types it is
> expecting:
>
> mimeMap = new HashMap<>();
> mimeMap.put("xml", "application/xml");
> mimeMap.put("csv", "text/csv");
> mimeMap.put("json", "application/json");
> mimeMap.put("jsonl", "application/json");
> mimeMap.put("pdf", "application/pdf");
> mimeMap.put("rtf", "text/rtf");
> mimeMap.put("html", "text/html");
> mimeMap.put("htm", "text/html");
> mimeMap.put("doc", "application/msword");
> mimeMap.put("docx", 
> "application/vnd.openxmlformats-officedocument.wordprocessingml.document");
> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
> mimeMap.put("pptx", 
> "application/vnd.openxmlformats-officedocument.presentationml.presentation");
> mimeMap.put("xls", "application/vnd.ms-excel");
> mimeMap.put("xlsx", 
> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
> mimeMap.put("txt", "text/plain");
> mimeMap.put("log", "text/plain");
>
> The keys are the types supported.
>
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar 
> wrote:
>
>> Ah!
>>
>> Only supported type is: text/html; encoding=utf-8
>>
>> I am not confident of this either :) but this should work.
>>
>> See the code-snippet below:
>>
>> ..
>>
>> if(res.httpStatus == 200) {
>>   // Raw content type of form "text/html; encoding=utf-8"
>>   String rawContentType = conn.getContentType();
>>   String type = rawContentType.split(";")[0];
>>   if(typeSupported(type) || "*".equals(fileTypes)) {
>> String encoding = conn.getContentEncoding();
>>
>> 
>>
>>
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>>
>> On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer  wrote:
>>
>>> Amrit Sarkar wrote:
>>>
>>> >> Strange,
>>> >>
>>> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org
>>> page's
>>> >> Content-Type. Let's see what it says now.
>>>
>>> Same thing.  Verified Content-Type:
>>>
>>> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |&
>>> grep Content-Type
>>>   Content-Type: text/html;charset=utf-8
>>> quadra[git:master]$ ]
>>>
>>> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c
>>> handbook http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes
>>> md
>>> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar
>>> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
>>> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>>> SimplePostTool version 5.0.0
>>> Posting web pages to Solr url http://localhost:8983/solr/han
>>> dbook/update/extract
>>> Entering auto mode. Indexing pages with content-types corresponding to
>>> file endings md
>>> SimplePostTool: WARNING: Never crawl an external web site faster than
>>> every 10 seconds, your IP will probably be blocked
>>> Entering recursive mode, depth=10, delay=0s
>>> Entering crawl at level 0 (1 links total, 1 new)
>>> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>>> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
>>> HTTP result status of 415
>>> 0 web pages indexed.
>>> COMMITting Solr index changes to http://localhost:8983/solr/han
>>> dbook/update/extract...
>>> Time spent: 0:00:00.531
>>> quadra[git:master]$
>>>
>>> Kevin
>>>
>>> >>
>>> >> Amrit Sarkar
>>> >> Search Engineer
>>> >> Lucidworks, Inc.
>>> >> 415-589-9269
>>> >> 

Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Strange,
>> 
>> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's
>> Content-Type. Let's see what it says now.

Same thing.  Verified Content-Type:

quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& grep 
Content-Type
  Content-Type: text/html;charset=utf-8
quadra[git:master]$ ]

quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook 
http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
/docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar 
-Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web 
org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file 
endings md
SimplePostTool: WARNING: Never crawl an external web site faster than every 10 
seconds, your IP will probably be blocked
Entering recursive mode, depth=10, delay=0s
Entering crawl at level 0 (1 links total, 1 new)
SimplePostTool: WARNING: Skipping URL with unsupported type text/html
SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a HTTP 
result status of 415
0 web pages indexed.
COMMITting Solr index changes to 
http://localhost:8983/solr/handbook/update/extract...
Time spent: 0:00:00.531
quadra[git:master]$ 

Kevin

>> 
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> 
>> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer  wrote:
>> 
>> > OK, so I hacked markserv to add Content-Type text/html, but now I get
>> >
>> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> >
>> > What is it expecting?
>> >
>> > $ docker exec -it --user=solr solr bin/post -c handbook
>> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> > /docker-java-home/jre/bin/java -classpath 
>> > /opt/solr/dist/solr-core-7.0.1.jar
>> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
>> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> > SimplePostTool version 5.0.0
>> > Posting web pages to Solr url http://localhost:8983/solr/
>> > handbook/update/extract
>> > Entering auto mode. Indexing pages with content-types corresponding to
>> > file endings md
>> > SimplePostTool: WARNING: Never crawl an external web site faster than
>> > every 10 seconds, your IP will probably be blocked
>> > Entering recursive mode, depth=10, delay=0s
>> > Entering crawl at level 0 (1 links total, 1 new)
>> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
>> > HTTP result status of 415
>> > 0 web pages indexed.
>> > COMMITting Solr index changes to http://localhost:8983/solr/
>> > handbook/update/extract...
>> > Time spent: 0:00:03.882
>> > $
>> >
>> > Thanks.
>> >
>> > Kevin
>> >


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Strange,

Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's
Content-Type. Let's see what it says now.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer  wrote:

> OK, so I hacked markserv to add Content-Type text/html, but now I get
>
> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>
> What is it expecting?
>
> $ docker exec -it --user=solr solr bin/post -c handbook
> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar
> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
> SimplePostTool version 5.0.0
> Posting web pages to Solr url http://localhost:8983/solr/
> handbook/update/extract
> Entering auto mode. Indexing pages with content-types corresponding to
> file endings md
> SimplePostTool: WARNING: Never crawl an external web site faster than
> every 10 seconds, your IP will probably be blocked
> Entering recursive mode, depth=10, delay=0s
> Entering crawl at level 0 (1 links total, 1 new)
> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
> HTTP result status of 415
> 0 web pages indexed.
> COMMITting Solr index changes to http://localhost:8983/solr/
> handbook/update/extract...
> Time spent: 0:00:03.882
> $
>
> Thanks.
>
> Kevin
>


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
OK, so I hacked markserv to add Content-Type text/html, but now I get

SimplePostTool: WARNING: Skipping URL with unsupported type text/html

What is it expecting?

$ docker exec -it --user=solr solr bin/post -c handbook 
http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
/docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar 
-Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web 
org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file 
endings md
SimplePostTool: WARNING: Never crawl an external web site faster than every 10 
seconds, your IP will probably be blocked
Entering recursive mode, depth=10, delay=0s
Entering crawl at level 0 (1 links total, 1 new)
SimplePostTool: WARNING: Skipping URL with unsupported type text/html
SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a HTTP 
result status of 415
0 web pages indexed.
COMMITting Solr index changes to 
http://localhost:8983/solr/handbook/update/extract...
Time spent: 0:00:03.882
$ 

Thanks.

Kevin


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Kevin Layer
Amrit Sarkar wrote:

>> Kevin,
>> 
>> You are getting NPE at:
>> 
>> String type = rawContentType.split(";")[0]; //HERE - rawContentType is NULL
>> 
>> // related code
>> 
>> String rawContentType = conn.getContentType();
>> 
>> public String getContentType() {
>> return getHeaderField("content-type");
>> }
>> 
>> HttpURLConnection conn = (HttpURLConnection) u.openConnection();
>> 
>> Can you check at your webpage level headers are properly set and it
>> has key "content-type".

Amrit, this is markserv, and I just used wget to prove you are
correct, there is no Content-Type header.

Thanks for the help!  I'll see if I can hack markserv to add that, and
try again.

Kevin


Re: solr 7.0.1: exception running post to crawl simple website

2017-10-13 Thread Amrit Sarkar
Kevin,

You are getting NPE at:

String type = rawContentType.split(";")[0]; //HERE - rawContentType is NULL

// related code

String rawContentType = conn.getContentType();

public String getContentType() {
return getHeaderField("content-type");
}

HttpURLConnection conn = (HttpURLConnection) u.openConnection();

Can you check at your webpage level headers are properly set and it
has key "content-type".


Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Wed, Oct 11, 2017 at 9:08 PM, Kevin Layer  wrote:

> I want to use solr to index a markdown website.  The files
> are in native markdown, but they are served in HTML (by markserv).
>
> Here's what I did:
>
> docker run --name solr -d -p 8983:8983 -t solr
> docker exec -it --user=solr solr bin/solr create_core -c handbook
>
> Then, to crawl the site:
>
> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes md
> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar
> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
> org.apache.solr.util.SimplePostTool http://quadra.franz.com:9091/index.md
> SimplePostTool version 5.0.0
> Posting web pages to Solr url http://localhost:8983/solr/
> handbook/update/extract
> Entering auto mode. Indexing pages with content-types corresponding to
> file endings md
> SimplePostTool: WARNING: Never crawl an external web site faster than
> every 10 seconds, your IP will probably be blocked
> Entering recursive mode, depth=10, delay=0s
> Entering crawl at level 0 (1 links total, 1 new)
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.solr.util.SimplePostTool$PageFetcher.
> readPageFromUrl(SimplePostTool.java:1138)
> at org.apache.solr.util.SimplePostTool.webCrawl(
> SimplePostTool.java:603)
> at org.apache.solr.util.SimplePostTool.postWebPages(
> SimplePostTool.java:563)
> at org.apache.solr.util.SimplePostTool.doWebMode(
> SimplePostTool.java:365)
> at org.apache.solr.util.SimplePostTool.execute(
> SimplePostTool.java:187)
> at org.apache.solr.util.SimplePostTool.main(
> SimplePostTool.java:172)
> quadra[git:master]$
>
>
> Any ideas on what I did wrong?
>
> Thanks.
>
> Kevin
>


solr 7.0.1: exception running post to crawl simple website

2017-10-11 Thread Kevin Layer
I want to use solr to index a markdown website.  The files
are in native markdown, but they are served in HTML (by markserv).

Here's what I did:

docker run --name solr -d -p 8983:8983 -t solr
docker exec -it --user=solr solr bin/solr create_core -c handbook

Then, to crawl the site:

quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook 
http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes md
/docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar 
-Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web 
org.apache.solr.util.SimplePostTool http://quadra.franz.com:9091/index.md
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/handbook/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file 
endings md
SimplePostTool: WARNING: Never crawl an external web site faster than every 10 
seconds, your IP will probably be blocked
Entering recursive mode, depth=10, delay=0s
Entering crawl at level 0 (1 links total, 1 new)
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.solr.util.SimplePostTool$PageFetcher.readPageFromUrl(SimplePostTool.java:1138)
at org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:603)
at 
org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
at 
org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
at org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
at org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
quadra[git:master]$ 


Any ideas on what I did wrong?

Thanks.

Kevin