[ https://issues.apache.org/jira/browse/TIKA-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715414#action_12715414 ]
Jukka Zitting commented on TIKA-235: ------------------------------------ Some comments: Is there a chance to boost selected parts of the Tika site? For example, queries like "Office" and "OpenDocument" return lots of dev stuff like mailing list traffic, apidocs and Jira issues, while the main "Supported Formats" page is buried deep within the search results. It would be nice if we could control the boosting for example by setting some specific <meta/> header tags in the HTML source. Alternatively, can we set a "Source" criteria that only covers the web site? I have a bit mixed feelings about whether documentation from the Lucid web site should be included in the search results. It's useful stuff, but may well become a problem if we another similar company comes up in this space. In any case, the links to the Lucid web site are broken, s/search/www/ should fix that. The Jira crawl contains multiple copies of the same documents, with links like http://issues.apache.org/jira/browse/TIKA-120?focusedCommentId=NNN#action_NNN, where just http://issues.apache.org/jira/browse/TIKA-120 would have been sufficient. This is IMHO a design mistake in Jira, but since we can't do much about that it would be nice to work around this issue in the crawler. > Site search powered by Lucene/Solr > ---------------------------------- > > Key: TIKA-235 > URL: https://issues.apache.org/jira/browse/TIKA-235 > Project: Tika > Issue Type: New Feature > Reporter: Grant Ingersoll > Priority: Minor > Attachments: TIKA-235.patch > > > For a number of years now, the Lucene community has been criticized for not > eating our own "dog food" when it comes to search. My company has built and > hosts a site search (http://search.lucidimagination.com/) that is powered by > Apache Solr and Lucene and we'd like to donate it's use to the Lucene > community. Additionally, it allows one to search all of the Tika content from > a single place, including web, wiki, JIRA and mail archives. See also > http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org > A sample of what it _might_ look like is at > http://people.apache.org/~gsingers/tika/ Note, however, I am not entirely > sure how Tika deploys just yet, so there are a few issues w/ the display > Lucid has a fault tolerant setup with replication and fail over as well as > monitoring services in place. We are committed to maintaining and expanding > the search capabilities on the site. > The following patch adds the basics to Tika to support the search, but isn't > entirely done yet b/c I'm not sure what the Look and Feel Tika wants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.