Kevin, Just put "html" too and give it a shot. These are the types it is expecting:
mimeMap = new HashMap<>(); mimeMap.put("xml", "application/xml"); mimeMap.put("csv", "text/csv"); mimeMap.put("json", "application/json"); mimeMap.put("jsonl", "application/json"); mimeMap.put("pdf", "application/pdf"); mimeMap.put("rtf", "text/rtf"); mimeMap.put("html", "text/html"); mimeMap.put("htm", "text/html"); mimeMap.put("doc", "application/msword"); mimeMap.put("docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"); mimeMap.put("ppt", "application/vnd.ms-powerpoint"); mimeMap.put("pptx", "application/vnd.openxmlformats-officedocument.presentationml.presentation"); mimeMap.put("xls", "application/vnd.ms-excel"); mimeMap.put("xlsx", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation"); mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation"); mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet"); mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet"); mimeMap.put("txt", "text/plain"); mimeMap.put("log", "text/plain"); The keys are the types supported. Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar <sarkaramr...@gmail.com> wrote: > Ah! > > Only supported type is: text/html; encoding=utf-8 > > I am not confident of this either :) but this should work. > > See the code-snippet below: > > ...... > > if(res.httpStatus == 200) { > // Raw content type of form "text/html; encoding=utf-8" > String rawContentType = conn.getContentType(); > String type = rawContentType.split(";")[0]; > if(typeSupported(type) || "*".equals(fileTypes)) { > String encoding = conn.getContentEncoding(); > > .... > > > Amrit Sarkar > Search Engineer > Lucidworks, Inc. > 415-589-9269 > www.lucidworks.com > Twitter http://twitter.com/lucidworks > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer <la...@franz.com> wrote: > >> Amrit Sarkar wrote: >> >> >> Strange, >> >> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's >> >> Content-Type. Let's see what it says now. >> >> Same thing. Verified Content-Type: >> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& >> grep Content-Type >> Content-Type: text/html;charset=utf-8 >> quadra[git:master]$ ] >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md >> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >> SimplePostTool version 5.0.0 >> Posting web pages to Solr url http://localhost:8983/solr/han >> dbook/update/extract >> Entering auto mode. Indexing pages with content-types corresponding to >> file endings md >> SimplePostTool: WARNING: Never crawl an external web site faster than >> every 10 seconds, your IP will probably be blocked >> Entering recursive mode, depth=10, delay=0s >> Entering crawl at level 0 (1 links total, 1 new) >> SimplePostTool: WARNING: Skipping URL with unsupported type text/html >> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a >> HTTP result status of 415 >> 0 web pages indexed. >> COMMITting Solr index changes to http://localhost:8983/solr/han >> dbook/update/extract... >> Time spent: 0:00:00.531 >> quadra[git:master]$ >> >> Kevin >> >> >> >> >> Amrit Sarkar >> >> Search Engineer >> >> Lucidworks, Inc. >> >> 415-589-9269 >> >> www.lucidworks.com >> >> Twitter http://twitter.com/lucidworks >> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> >> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer <la...@franz.com> wrote: >> >> >> >> > OK, so I hacked markserv to add Content-Type text/html, but now I get >> >> > >> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html >> >> > >> >> > What is it expecting? >> >> > >> >> > $ docker exec -it --user=solr solr bin/post -c handbook >> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md >> >> > /docker-java-home/jre/bin/java -classpath >> /opt/solr/dist/solr-core-7.0.1.jar >> >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook >> -Ddata=web >> >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >> >> > SimplePostTool version 5.0.0 >> >> > Posting web pages to Solr url http://localhost:8983/solr/ >> >> > handbook/update/extract >> >> > Entering auto mode. Indexing pages with content-types corresponding >> to >> >> > file endings md >> >> > SimplePostTool: WARNING: Never crawl an external web site faster than >> >> > every 10 seconds, your IP will probably be blocked >> >> > Entering recursive mode, depth=10, delay=0s >> >> > Entering crawl at level 0 (1 links total, 1 new) >> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html >> >> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md >> returned a >> >> > HTTP result status of 415 >> >> > 0 web pages indexed. >> >> > COMMITting Solr index changes to http://localhost:8983/solr/ >> >> > handbook/update/extract... >> >> > Time spent: 0:00:03.882 >> >> > $ >> >> > >> >> > Thanks. >> >> > >> >> > Kevin >> >> > >> > >