Jason Gerlowski created SOLR-11640: -------------------------------------- Summary: QuickStart Tutorial indexes post.jar, other unexpected files Key: SOLR-11640 URL: https://issues.apache.org/jira/browse/SOLR-11640 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: documentation, scripts and tools Affects Versions: master (8.0) Reporter: Jason Gerlowski Priority: Trivial
Currently, the QuickStart tutorial included in the ref guide involves running the following command to index some example documents: {{bin/post -c techproducts example/exampledocs/*}} This ends up attempting to index _all_ the files in that directory, which includes the expected example files, but also as bash script called {{test_utf8.sh}} and the {{post.jar}} JAR file itself. The subsequent tutorial step involves searching results, which can bring up the ugly result: {code} { "id":"/home/jason/checkouts/lucene-solr/solr/example/exampledocs/post.jar", "resourcename":"/home/jason/checkouts/lucene-solr/solr/example/exampledocs/post.jar", "content_type":["application/java-archive"], "content":[" \n \n \n \n \n \n \n \n \n \n \n META-INF/MANIFEST.MF \n Manifest-Version: 1.0\r\nAnt-Version: Apache Ant 1.9.6\r\nCreated-By: 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 (Oracle Corp orati\r\n on)\r\nMain-Class: org.apache.solr.util.SimplePostTool\r\n\r\n \n\n \n \n org/apache/solr/util/RTimer$1.class \n package org.apache.solr.util;\n synchronized class RTimer$1 {\n}\n \n\n \n \n o rg/apache/solr/util/RTimer$NanoTimeTimerImpl.class \n package org.apache.solr.util;\n synchronized class RTimer$NanoTimeTimerImpl implements RTimer$TimerImpl {\n private long start ;\n private void RTimer$NanoTimeTimerImpl();\n public void start ();\n public double elapsed ();\n}\n \n\n \n \n org/apache/solr/util/RTimer$TimerImpl.class \n package org.apache.solr.util;\n public abstra ct interface RTimer$TimerImpl {\n public abstract void start ();\n public abstract double elapsed ();\n}\n \n\n \n \n org/apache/solr/util/RTimer.class \n package org.apache.solr.util;\n p ublic synchronized class RTimer {\n public static final int STARTED = 0;\n public static final int STOPPED = 1;\n public static final int PAUSED = 2;\n protected int s ......[remaining code skipped for brevity]........"], "_version_":1583971861929132032}, {code} It's honestly pretty cool that TIKA can extract code from our post.jar file. It makes sense, but I didn't expect it. But it's probably not what we intended to show to new users. Especially considering that the bin/post invocation in the quick-start tutorial claims to be choosy about what filetypes it will index: {code} Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log {code} >From a quick glance at things, it looks like {{bin/post}} does pass a list of >permissible filetypes to the underlying {{SimplePostTool}}, but that >SimplePostTool doesn't follow this extension whitelist in the particular mode >being invoked by the quickstart tutorial. So this is probably a wider bug, >that the quickstart/tutorial just happens to expose. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org