subject:"indexing\?\?\?"

Re: HTML sample.html not indexing in Solr 8.8

2021-02-21 Thread Shawn Heisey


On 2/21/2021 3:07 PM, cratervoid wrote:

Thanks Shawn, I copied the solrconfig.xml file from the gettingstarted
example on 7.7.3 installation to the 8.8.0 installation, restarted the
server and it now works. Comparing the two files it looks like as you said
this section was left out of the _default/solrconfig.xml file in version
8.8.0:


 
   true
   ignored_
   _text_
 
   

So those trying out the tutorial will need to add this section to get it to
work for sample.html.



This line from that config also is involved:

  regex=".*\.jar" />


That loads the contrib jars needed for the ExtractingRequestHandler to 
work right.  There are a LOT of jars there.  Tika is a very heavyweight 
piece of software.


Thanks,
Shawn

Re: HTML sample.html not indexing in Solr 8.8

2021-02-21 Thread cratervoid

Thanks Shawn, I copied the solrconfig.xml file from the gettingstarted
example on 7.7.3 installation to the 8.8.0 installation, restarted the
server and it now works. Comparing the two files it looks like as you said
this section was left out of the _default/solrconfig.xml file in version
8.8.0:



  true
  ignored_
  _text_

  

So those trying out the tutorial will need to add this section to get it to
work for sample.html.



On Sat, Feb 20, 2021 at 4:21 PM Shawn Heisey  wrote:

> On 2/20/2021 3:58 PM, cratervoid wrote:
> > SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
> >
> http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
>
> The problem here is that the solrconfig.xml in use by the index named
> "gettingstarted" does not define a handler at /update/extract.
>
> Typically a handler defined at that URL path will utilize the extracting
> request handler class.  This handler uses Tika (another Apache project)
> to extract usable data from rich text formats like PDF, HTML, etc.
>
>
>startup="lazy"
>class="solr.extraction.ExtractingRequestHandler" >
>  
>true
>ignored_
>_text_
>  
>
>
> Note that using this handler will require adding some contrib jars to Solr.
>
> Tika can become very unstable because it deals with undocumented file
> formats, so we do not recommend using that handler in production.  If
> the functionality is important, Tika should be included in a program
> that's separate from Solr, so that if it crashes, it does not take Solr
> down with it.
>
> Thanks,
> Shawn
>

Re: HTML sample.html not indexing in Solr 8.8

2021-02-21 Thread cratervoid

Thanks Alex. I copied the solrconfig.xml over from 7.7.3 to the 8.8.0 conf
folder and restarted the server.  Now indexing works without erroring on
sample.html.  There is 1K difference between the 2 files so I'll diff them
to see what was left out of the 8.8 version.

On Sat, Feb 20, 2021 at 4:27 PM Alexandre Rafalovitch 
wrote:

> Most likely issue is that your core configuration (solrconfig.xml)
> does not have the request handler for that. The same config may have
> had that in 7.x, but changed since.
>
> More details:
> https://lucene.apache.org/solr/guide/8_8/uploading-data-with-solr-cell-using-apache-tika.html
>
> Regards,
>Alex.
>
> On Sat, 20 Feb 2021 at 17:59, cratervoid  wrote:
> >
> > I am trying out indexing the exampledocs in the examples folder with the
> > SimplePostTool on windows 10 using solr 8.8.  All the documents index
> > except sample.html. For that file I get the errors below.  I then
> > downloaded solr 7.7.3 and indexed the exampledocs folder with no errors,
> > including sample.html.
> > ```
> > PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto
> > example\exampledocs\post.jar example\exampledocs\sample.html
> > SimplePostTool version 5.0.0
> > Posting files to [base] url
> > http://localhost:8983/solr/gettingstarted/update...
> > Entering auto mode. File endings considered are
> >
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> > POSTing file sample.html (text/html) to [base]/extract
> > SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
> >
> http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
> > SimplePostTool: WARNING: Response: 
> > 
> > 
> > Error 404 Not Found
> > 
> > HTTP ERROR 404 Not Found
> > 
> > URI:/solr/gettingstarted/update/extract
> > STATUS:404
> > MESSAGE:Not Found
> > SERVLET:default
> > 
> >
> > 
> > 
> > SimplePostTool: WARNING: IOException while reading response:
> > java.io.FileNotFoundException:
> >
> http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
> > 1 files indexed.
> > COMMITting Solr index changes to
> > http://localhost:8983/solr/gettingstarted/update...
> > Time spent: 0:00:00.086
> > ```
> >
> > However the json and all other file types index with no problem. For
> > example:
> > ```
> > PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto
> > example\exampledocs\post.jar example\exampledocs\books.json
> > SimplePostTool version 5.0.0
> > Posting files to [base] url
> > http://localhost:8983/solr/gettingstarted/update...
> > Entering auto mode. File endings considered are
> >
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> > POSTing file books.json (application/json) to [base]/json/docs
> > 1 files indexed.
> > COMMITting Solr index changes to
> > http://localhost:8983/solr/gettingstarted/update...
> > ```
> > Just following this tutorial:[
> >
> https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support][1
> > ]
> >
> >   [1]:
> >
> https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support
>

Re: HTML sample.html not indexing in Solr 8.8

2021-02-20 Thread Alexandre Rafalovitch

Most likely issue is that your core configuration (solrconfig.xml)
does not have the request handler for that. The same config may have
had that in 7.x, but changed since.

More details: 
https://lucene.apache.org/solr/guide/8_8/uploading-data-with-solr-cell-using-apache-tika.html

Regards,
   Alex.

On Sat, 20 Feb 2021 at 17:59, cratervoid  wrote:
>
> I am trying out indexing the exampledocs in the examples folder with the
> SimplePostTool on windows 10 using solr 8.8.  All the documents index
> except sample.html. For that file I get the errors below.  I then
> downloaded solr 7.7.3 and indexed the exampledocs folder with no errors,
> including sample.html.
> ```
> PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto
> example\exampledocs\post.jar example\exampledocs\sample.html
> SimplePostTool version 5.0.0
> Posting files to [base] url
> http://localhost:8983/solr/gettingstarted/update...
> Entering auto mode. File endings considered are
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> POSTing file sample.html (text/html) to [base]/extract
> SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
> http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
> SimplePostTool: WARNING: Response: 
> 
> 
> Error 404 Not Found
> 
> HTTP ERROR 404 Not Found
> 
> URI:/solr/gettingstarted/update/extract
> STATUS:404
> MESSAGE:Not Found
> SERVLET:default
> 
>
> 
> 
> SimplePostTool: WARNING: IOException while reading response:
> java.io.FileNotFoundException:
> http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
> 1 files indexed.
> COMMITting Solr index changes to
> http://localhost:8983/solr/gettingstarted/update...
> Time spent: 0:00:00.086
> ```
>
> However the json and all other file types index with no problem. For
> example:
> ```
> PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto
> example\exampledocs\post.jar example\exampledocs\books.json
> SimplePostTool version 5.0.0
> Posting files to [base] url
> http://localhost:8983/solr/gettingstarted/update...
> Entering auto mode. File endings considered are
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> POSTing file books.json (application/json) to [base]/json/docs
> 1 files indexed.
> COMMITting Solr index changes to
> http://localhost:8983/solr/gettingstarted/update...
> ```
> Just following this tutorial:[
> https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support][1
> ]
>
>   [1]:
> https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support

Re: HTML sample.html not indexing in Solr 8.8

2021-02-20 Thread Shawn Heisey


On 2/20/2021 3:58 PM, cratervoid wrote:

SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html


The problem here is that the solrconfig.xml in use by the index named 
"gettingstarted" does not define a handler at /update/extract.


Typically a handler defined at that URL path will utilize the extracting 
request handler class.  This handler uses Tika (another Apache project) 
to extract usable data from rich text formats like PDF, HTML, etc.


  
  

  true
  ignored_
  _text_

  

Note that using this handler will require adding some contrib jars to Solr.

Tika can become very unstable because it deals with undocumented file 
formats, so we do not recommend using that handler in production.  If 
the functionality is important, Tika should be included in a program 
that's separate from Solr, so that if it crashes, it does not take Solr 
down with it.


Thanks,
Shawn

HTML sample.html not indexing in Solr 8.8

2021-02-20 Thread cratervoid

I am trying out indexing the exampledocs in the examples folder with the
SimplePostTool on windows 10 using solr 8.8. All the documents index
except sample.html. For that file I get the errors below. I then
downloaded solr 7.7.3 and indexed the exampledocs folder with no errors,
including sample.html.
```
PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto
example\exampledocs\post.jar example\exampledocs\sample.html
SimplePostTool version 5.0.0
Posting files to [base] url
http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file sample.html (text/html) to [base]/extract
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
SimplePostTool: WARNING: Response:

Error 404 Not Found

HTTP ERROR 404 Not Found

URI:/solr/gettingstarted/update/extract
STATUS:404
MESSAGE:Not Found
SERVLET:default

SimplePostTool: WARNING: IOException while reading response:
java.io.FileNotFoundException:
http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
1 files indexed.
COMMITting Solr index changes to
http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:00.086
```

However the json and all other file types index with no problem. For
example:
```
PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto
example\exampledocs\post.jar example\exampledocs\books.json
SimplePostTool version 5.0.0
Posting files to [base] url
http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file books.json (application/json) to [base]/json/docs
1 files indexed.
COMMITting Solr index changes to
http://localhost:8983/solr/gettingstarted/update...
```
Just following this tutorial:[
https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support][1
]

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 5647 matches

Mail list logo