[ 
https://issues.apache.org/jira/browse/TIKA-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715414#action_12715414
 ] 

Jukka Zitting commented on TIKA-235:
------------------------------------

Some comments:

Is there a chance to boost selected parts of the Tika site? For example, 
queries like "Office" and "OpenDocument" return lots of dev stuff like mailing 
list traffic, apidocs and Jira issues, while the main "Supported Formats" page 
is buried deep within the search results. It would be nice if we could control 
the boosting for example by setting some specific <meta/> header tags in the 
HTML source. Alternatively, can we set a "Source" criteria that only covers the 
web site?

I have a bit mixed feelings about whether documentation from the Lucid web site 
should be included in the search results. It's useful stuff, but may well 
become a problem if we another similar company comes up in this space. In any 
case, the links to the Lucid web site are broken, s/search/www/ should fix that.

The Jira crawl contains multiple copies of the same documents, with links like 
http://issues.apache.org/jira/browse/TIKA-120?focusedCommentId=NNN#action_NNN, 
where just http://issues.apache.org/jira/browse/TIKA-120 would have been 
sufficient. This is IMHO a design mistake in Jira, but since we can't do much 
about that it would be nice to work around this issue in the crawler.


> Site search powered by Lucene/Solr
> ----------------------------------
>
>                 Key: TIKA-235
>                 URL: https://issues.apache.org/jira/browse/TIKA-235
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: TIKA-235.patch
>
>
> For a number of years now, the Lucene community has been criticized for not 
> eating our own "dog food" when it comes to search. My company has built and 
> hosts a site search (http://search.lucidimagination.com/) that is powered by 
> Apache Solr and Lucene and we'd like to donate it's use to the Lucene 
> community. Additionally, it allows one to search all of the Tika content from 
> a single place, including web, wiki, JIRA and mail archives. See also 
> http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org
> A sample of what it _might_ look like is at 
> http://people.apache.org/~gsingers/tika/    Note, however, I am not entirely 
> sure how Tika deploys just yet, so there are a few issues w/ the display
> Lucid has a fault tolerant setup with replication and fail over as well as 
> monitoring services in place. We are committed to maintaining and expanding 
> the search capabilities on the site.
> The following patch adds the basics to Tika to support the search, but isn't 
> entirely done yet b/c I'm not sure what the Look and Feel Tika wants.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to