Re: Indexing FTP Documents through SOLR??

2011-01-23 Thread Adam Estrada
+1 on Nutch! On Fri, Jan 21, 2011 at 4:11 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, Please take a look at Apache Nutch. I can crawl through a file system over FTP. After crawling, it can use Tika to extract the content from your PDF files and other. Finally you can then send

Re: Indexing FTP Documents through SOLR??

2011-01-23 Thread Gora Mohanty
On Mon, Jan 24, 2011 at 8:15 AM, Adam Estrada estrada.a...@gmail.com wrote: +1 on Nutch! [...] Would it be possible for Markus, and you to clarify on what the advantages of Nutch are in crawling a well-defined filesystem hierarchy? A simple shell script that POSTs to Solr works fine for this, so

Re: Indexing FTP Documents through SOLR??

2011-01-23 Thread Mattmann, Chris A (388J)
I'd be happy to comment: A simple shell script doesn't provide URL filtering and control of how you crawl those documents on the local file system. Nutch has several levels of URL filtering based on regex, MIME type, and others. Also, if there are any outlinks in those local files that point

Re: Indexing FTP Documents through SOLR??

2011-01-23 Thread Gora Mohanty
On Mon, Jan 24, 2011 at 11:07 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: I'd be happy to comment: A simple shell script doesn't provide URL filtering and control of how you crawl those documents on the local file system. Nutch has several levels of URL filtering based

Re: Indexing FTP Documents through SOLR??

2011-01-21 Thread pankaj bhatt
Hi Gora, Thanks for the answer. I want to index all the PDF,HTML documents lying within a tree hierarchy at FTP Server. In addition, can i add an attribute location whose value is the FTP FILE LOCATION. If you can give me, the sample configuration, it will be great. /

Re: Indexing FTP Documents through SOLR??

2011-01-21 Thread Gora Mohanty
On Fri, Jan 21, 2011 at 1:31 PM, pankaj bhatt panbh...@gmail.com wrote: Hi Gora,           Thanks for the answer. I want to index all the PDF,HTML documents lying within a tree hierarchy at FTP Server.       In addition, can i add an attribute location whose value is the FTP FILE LOCATION.  

Re: Indexing FTP Documents through SOLR??

2011-01-21 Thread pankaj bhatt
Hi Gora, Thanks, however i think it would be a cumbersome process, to do all this manual. Aren't there any plugin or extracter does this automatically.??? Anyone in the group, if had done this previously.? / Pankaj Bhatt. On Fri, Jan 21, 2011 at 1:41 PM, Gora Mohanty

Re: Indexing FTP Documents through SOLR??

2011-01-21 Thread Gora Mohanty
On Fri, Jan 21, 2011 at 1:47 PM, pankaj bhatt panbh...@gmail.com wrote: Hi Gora,        Thanks, however i think it would be a cumbersome process, to do all this manual.     Aren't there any plugin or extracter does this automatically.???    Anyone in the group, if had done this previously.?

Re: Indexing FTP Documents through SOLR??

2011-01-21 Thread Markus Jelsma
Hi, Please take a look at Apache Nutch. I can crawl through a file system over FTP. After crawling, it can use Tika to extract the content from your PDF files and other. Finally you can then send the data to your Solr server for indexing. http://nutch.apache.org/ Hi All, Is there is any

Indexing FTP Documents through SOLR??

2011-01-20 Thread pankaj bhatt
Hi All, Is there is any way in SOLR or any plug-in through which the folders and documents in FTP location can be indexed. / Pankaj Bhatt.

Re: Indexing FTP Documents through SOLR??

2011-01-20 Thread Gora Mohanty
On Fri, Jan 21, 2011 at 12:21 PM, pankaj bhatt panbh...@gmail.com wrote: Hi All,  Is there is any way in SOLR or any plug-in through which the folders and documents in FTP location can be indexed. [...] What format are these documents in? Which parts of the documents do you want to index? In