Jason Gerlowski created SOLR-11640:
--------------------------------------

             Summary: QuickStart Tutorial indexes post.jar, other unexpected 
files
                 Key: SOLR-11640
                 URL: https://issues.apache.org/jira/browse/SOLR-11640
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: documentation, scripts and tools
    Affects Versions: master (8.0)
            Reporter: Jason Gerlowski
            Priority: Trivial


Currently, the QuickStart tutorial included in the ref guide involves running 
the following command to index some example documents: {{bin/post -c 
techproducts example/exampledocs/*}}

This ends up attempting to index _all_ the files in that directory, which 
includes the expected example files, but also as bash script called 
{{test_utf8.sh}} and the {{post.jar}} JAR file itself.

The subsequent tutorial step involves searching results, which can bring up the 
ugly result:
{code}
      {
        
"id":"/home/jason/checkouts/lucene-solr/solr/example/exampledocs/post.jar",
        
"resourcename":"/home/jason/checkouts/lucene-solr/solr/example/exampledocs/post.jar",
        "content_type":["application/java-archive"],
        "content":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n   \n  
META-INF/MANIFEST.MF \n Manifest-Version: 1.0\r\nAnt-Version: Apache Ant 
1.9.6\r\nCreated-By: 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 (Oracle Corp
orati\r\n on)\r\nMain-Class: org.apache.solr.util.SimplePostTool\r\n\r\n \n\n 
\n  \n  org/apache/solr/util/RTimer$1.class \n  package  
org.apache.solr.util;\n synchronized   class  RTimer$1 {\n}\n \n\n \n  \n  o
rg/apache/solr/util/RTimer$NanoTimeTimerImpl.class \n  package  
org.apache.solr.util;\n synchronized   class  RTimer$NanoTimeTimerImpl  
implements  RTimer$TimerImpl {\n     private  long  start ;\n     private  
void RTimer$NanoTimeTimerImpl();\n     public  void  start ();\n     public  
double  elapsed ();\n}\n \n\n \n  \n  
org/apache/solr/util/RTimer$TimerImpl.class \n  package  
org.apache.solr.util;\n public   abstra
ct   interface  RTimer$TimerImpl {\n     public   abstract  void  start ();\n   
  public   abstract  double  elapsed ();\n}\n \n\n \n  \n  
org/apache/solr/util/RTimer.class \n  package  org.apache.solr.util;\n p
ublic   synchronized   class  RTimer {\n     public   static   final  int  
STARTED  = 0;\n     public   static   final  int  STOPPED  = 1;\n     public   
static   final  int  PAUSED  = 2;\n     protected  int  s
  ......[remaining code skipped for brevity]........"],
        "_version_":1583971861929132032},
{code}

It's honestly pretty cool that TIKA can extract code from our post.jar file.  
It makes sense, but I didn't expect it.  But it's probably not what we intended 
to show to new users.  Especially considering that the bin/post invocation in 
the quick-start tutorial claims to be choosy about what filetypes it will index:

{code}
Entering auto mode. File endings considered are 
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
{code}

>From a quick glance at things, it looks like {{bin/post}} does pass a list of 
>permissible filetypes to the underlying {{SimplePostTool}}, but that 
>SimplePostTool doesn't follow this extension whitelist in the particular mode 
>being invoked by the quickstart tutorial.  So this is probably a wider bug, 
>that the quickstart/tutorial just happens to expose.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to