[ 
https://issues.apache.org/jira/browse/SOLR-11640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249893#comment-16249893
 ] 

Jason Gerlowski commented on SOLR-11640:
----------------------------------------

It's possible this is a "feature" and not a "bug".  If that's the case, maybe 
we should clarify the "File endings considered are...." message output by 
SimplePostTool, as it implies a whitelist of file extensions.

> QuickStart Tutorial indexes post.jar, other unexpected files
> ------------------------------------------------------------
>
>                 Key: SOLR-11640
>                 URL: https://issues.apache.org/jira/browse/SOLR-11640
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: documentation, scripts and tools
>    Affects Versions: master (8.0)
>            Reporter: Jason Gerlowski
>            Priority: Trivial
>
> Currently, the QuickStart tutorial included in the ref guide involves running 
> the following command to index some example documents: {{bin/post -c 
> techproducts example/exampledocs/*}}
> This ends up attempting to index _all_ the files in that directory, which 
> includes the expected example files, but also as bash script called 
> {{test_utf8.sh}} and the {{post.jar}} JAR file itself.
> The subsequent tutorial step involves searching results, which can bring up 
> the ugly result:
> {code}
>       {
>         
> "id":"/home/jason/checkouts/lucene-solr/solr/example/exampledocs/post.jar",
>         
> "resourcename":"/home/jason/checkouts/lucene-solr/solr/example/exampledocs/post.jar",
>         "content_type":["application/java-archive"],
>         "content":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n   \n  
> META-INF/MANIFEST.MF \n Manifest-Version: 1.0\r\nAnt-Version: Apache Ant 
> 1.9.6\r\nCreated-By: 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 (Oracle Corp
> orati\r\n on)\r\nMain-Class: org.apache.solr.util.SimplePostTool\r\n\r\n \n\n 
> \n  \n  org/apache/solr/util/RTimer$1.class \n  package  
> org.apache.solr.util;\n synchronized   class  RTimer$1 {\n}\n \n\n \n  \n  o
> rg/apache/solr/util/RTimer$NanoTimeTimerImpl.class \n  package  
> org.apache.solr.util;\n synchronized   class  RTimer$NanoTimeTimerImpl  
> implements  RTimer$TimerImpl {\n     private  long  start ;\n     private  
> void RTimer$NanoTimeTimerImpl();\n     public  void  start ();\n     public  
> double  elapsed ();\n}\n \n\n \n  \n  
> org/apache/solr/util/RTimer$TimerImpl.class \n  package  
> org.apache.solr.util;\n public   abstra
> ct   interface  RTimer$TimerImpl {\n     public   abstract  void  start ();\n 
>     public   abstract  double  elapsed ();\n}\n \n\n \n  \n  
> org/apache/solr/util/RTimer.class \n  package  org.apache.solr.util;\n p
> ublic   synchronized   class  RTimer {\n     public   static   final  int  
> STARTED  = 0;\n     public   static   final  int  STOPPED  = 1;\n     public  
>  static   final  int  PAUSED  = 2;\n     protected  int  s
>   ......[remaining code skipped for brevity]........"],
>         "_version_":1583971861929132032},
> {code}
> It's honestly pretty cool that TIKA can extract code from our post.jar file.  
> It makes sense, but I didn't expect it.  But it's probably not what we 
> intended to show to new users.  Especially considering that the bin/post 
> invocation in the quick-start tutorial claims to be choosy about what 
> filetypes it will index:
> {code}
> Entering auto mode. File endings considered are 
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> {code}
> From a quick glance at things, it looks like {{bin/post}} does pass a list of 
> permissible filetypes to the underlying {{SimplePostTool}}, but that 
> SimplePostTool doesn't follow this extension whitelist in the particular mode 
> being invoked by the quickstart tutorial.  So this is probably a wider bug, 
> that the quickstart/tutorial just happens to expose.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to