document support for file system crawling

2006-08-30 Thread Bruno

Hi there,

browsing through the message thread I tried to find a trail addressing file
system crawls. I want to implement an enterprise search over a networked
filesystem, crawling all sorts of documents, such as html, doc, ppt and pdf.
Nutch provides plugins enabling it to read proprietary formats. 
Is there support for the same functionality in solr?

Bruno
-- 
View this message in context: 
http://www.nabble.com/document-support-for-file-system-crawling-tf2188066.html#a6053318
Sent from the Solr - User forum at Nabble.com.



Can't use SnowballAnalyzer

2006-08-30 Thread Diogo Matos

Hi All,

I'm trying to use the SnowballAnalyzer and for some strange reason i cannot.
I got the following error message in the logs file:

org.apache.solr.core.SolrException: Error instantiating class class
org.apache.lucene.analysis.snowball.SnowballAnalyzer
   at org.apache.solr.core.Config.newInstance(Config.java:213)
   at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:466)
   at org.apache.solr.schema.IndexSchema.readConfig(IndexSchema.java:294)
   at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:67)
   at org.apache.solr.core.SolrCore.init(SolrCore.java:189)
   at org.apache.solr.core.SolrCore.getSolrCore(SolrCore.java:170)
   at org.apache.solr.servlet.SolrServlet.init(SolrServlet.java:74)
   at javax.servlet.GenericServlet.init(GenericServlet.java:211)
   at org.apache.catalina.core.StandardWrapper.loadServlet(
StandardWrapper.java:1105)
   at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java
:932)
   at org.apache.catalina.core.StandardContext.loadOnStartup(
StandardContext.java:3917)
   at org.apache.catalina.core.StandardContext.start(StandardContext.java
:4201)
   at org.apache.catalina.core.ContainerBase.addChildInternal(
ContainerBase.java:759)
   at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java
:739)
   at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:524)
   at org.apache.catalina.startup.HostConfig.deployDirectory(
HostConfig.java:904)
   at org.apache.catalina.startup.HostConfig.deployDirectories(
HostConfig.java:867)
   at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java
:474)
   at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1122)
   at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java
:310)
   at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(
LifecycleSupport.java:119)
   at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1021)
   at org.apache.catalina.core.StandardHost.start(StandardHost.java:718)
   at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1013)
   at org.apache.catalina.core.StandardEngine.start(StandardEngine.java
:442)
   at org.apache.catalina.core.StandardService.start(StandardService.java
:450)
   at org.apache.catalina.core.StandardServer.start(StandardServer.java
:709)
   at org.apache.catalina.startup.Catalina.start(Catalina.java:551)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:39)
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:585)
   at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:294)
   at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:432)
Caused by: java.lang.InstantiationException:
org.apache.lucene.analysis.snowball.SnowballAnalyzer
   at java.lang.Class.newInstance0(Class.java:335)
   at java.lang.Class.newInstance(Class.java:303)
   at org.apache.solr.core.Config.newInstance(Config.java:211)
   ... 33 more


Does any one used it before?
Thank you
Diogo


Re: document support for file system crawling

2006-08-30 Thread Erik Hatcher


On Aug 30, 2006, at 2:42 AM, Bruno wrote:
browsing through the message thread I tried to find a trail  
addressing file
system crawls. I want to implement an enterprise search over a  
networked
filesystem, crawling all sorts of documents, such as html, doc, ppt  
and pdf.

Nutch provides plugins enabling it to read proprietary formats.
Is there support for the same functionality in solr?


No.  Solr is strictly a search server that takes plain text for the  
fields of documents added to it.  The client is responsible parsing  
the text out of these types of documents.  You could borrow the  
document parsing pieces from Lucene's contrib and Nutch and glue them  
together into your client that speaks to Solr, or perhaps Solr isn't  
the right approach for your needs?   It certainly is possible to add  
these capabilities into Solr, but it would be awkward to have to  
stream binary data into XML documents such that Solr could parse them  
on the server side.


Erik




Re: Add doc limit - Follow Up

2006-08-30 Thread Yonik Seeley

On 8/29/06, sangraal aiken [EMAIL PROTECTED] wrote:

The problem only occurs when adding docs that contain ![CDATA[]] tags in
the body of the field tag. The problem also only seems to cause an add
limit on an individual post. I limited the size of my HTTP posts to 5000
documents per post, and the problem never showed up. You do not need to do a
commit after each batch as I previously thought.


That's very interesting... it sounds like perhaps an XPP (the XML
parser) bug that tomcat manages to tickle.
I looked through the XPP changelogs quick - no mention of a problem
like this being fixed.

-Yonik


Re: Can't use SnowballAnalyzer

2006-08-30 Thread Chris Hostetter

: constructor requires the language parameter.  I see SnowballAnalyzer
: mentioned in a comment in the example schema.xml, but there is no
: specification for language.   My guess is you'll need to construct

Whooops ... i just changed that example so as not to misslead people.

FYI: the SnowballFilter uses reflection, so it's not recommended for
perofrmance...

http://incubator.apache.org/solr/docs/api/org/apache/solr/analysis/SnowballPorterFilterFactory.html



-Hoss



Re: acts_as_solr

2006-08-30 Thread Kevin Lewandowski

You might want to look at acts_as_searchable for Ruby:
http://rubyforge.org/projects/ar-searchable

That's a similar plugin for the Hyperestraier search engine using its
REST interface.

On 8/28/06, Erik Hatcher [EMAIL PROTECTED] wrote:

I've spent a few hours tinkering with an Ruby ActiveRecord plugin to
index, delete, and search models fronted by a database into Solr.


Re: acts_as_solr

2006-08-30 Thread Erik Hatcher


On Aug 28, 2006, at 10:25 PM, Erik Hatcher wrote:
I'd like to commit this to the Solr repository.  Any objections?   
Once committed, folks will be able to use script/plugin  
install ... to install the Ruby side of things, and using a binary  
distribution of Solr's example application and a custom solr/conf  
directory (just for schema.xml) they'd be up and running quite  
quickly.  If ok to commit, what directory should I put things  
under?  How about just ruby?


Ok, /client/ruby it is.  I'll get this committed in the next day or so.

I have to admit that the stuff Seth did with Searchable (linked to  
from http://wiki.apache.org/solr/SolRuby) is very well done so  
hopefully he can work with us to perhaps integrate that work into  
what lives in Solr's repository.  Having the Searchable abstraction  
is interesting, but it might be a bit limiting in terms of leveraging  
fancier return values from Solr, like the facets and highlighting -  
or maybe it's just an unnecessary abstraction for those always  
working with Solr.  I like it though, and will certainly borrow ideas  
from it on how to do slick stuff with Ruby.


While I'm at it, I'd be happy to commit the Java client into /client/ 
java.  I'll check the status of that contribution when I can.


Erik