Re: Indexing FTP Documents through SOLR??

2011-01-23 Thread Adam Estrada
+1 on Nutch!

On Fri, Jan 21, 2011 at 4:11 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Hi,

 Please take a look at Apache Nutch. I can crawl through a file system over 
 FTP.
 After crawling, it can use Tika to extract the content from your PDF files and
 other. Finally you can then send the data to your Solr server for indexing.

 http://nutch.apache.org/

 Hi All,
   Is there is any way in SOLR or any plug-in through which the folders and
 documents in FTP location can be indexed.

 / Pankaj Bhatt.



Re: Indexing FTP Documents through SOLR??

2011-01-23 Thread Gora Mohanty
On Mon, Jan 24, 2011 at 8:15 AM, Adam Estrada estrada.a...@gmail.com wrote:
 +1 on Nutch!
[...]

Would it be possible for Markus, and you to clarify on
what the advantages of Nutch are in crawling a
well-defined filesystem hierarchy? A simple shell script
that POSTs to Solr works fine for this, so why would
one choose the extra configuration, and maintenance
issues required for Nutch.

Regards,
Gora


Re: Indexing FTP Documents through SOLR??

2011-01-23 Thread Mattmann, Chris A (388J)
I'd be happy to comment:

A simple shell script doesn't provide URL filtering and control of how you 
crawl those documents on the local file system. Nutch has several levels of URL 
filtering based on regex, MIME type, and others. Also, if there are any 
outlinks in those local files that point to remote content, Nutch will go and 
crawl it for you, something that a simple shell script doesn't take care of.

Also, it would be great if you could elaborate what the extra configuration and 
maintenance issues are regarding Nutch? If you had something specific in mind, 
patches or issue comments, welcome :)

Cheers,
Chris

On Jan 23, 2011, at 8:56 PM, Gora Mohanty wrote:

 On Mon, Jan 24, 2011 at 8:15 AM, Adam Estrada estrada.a...@gmail.com wrote:
 +1 on Nutch!
 [...]
 
 Would it be possible for Markus, and you to clarify on
 what the advantages of Nutch are in crawling a
 well-defined filesystem hierarchy? A simple shell script
 that POSTs to Solr works fine for this, so why would
 one choose the extra configuration, and maintenance
 issues required for Nutch.
 
 Regards,
 Gora


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Indexing FTP Documents through SOLR??

2011-01-23 Thread Gora Mohanty
On Mon, Jan 24, 2011 at 11:07 AM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:
 I'd be happy to comment:

 A simple shell script doesn't provide URL filtering and control of how you 
 crawl those documents on the local file system. Nutch has several levels of 
 URL filtering based on regex, MIME type, and others. Also, if there are any 
 outlinks in those local files that point to remote content, Nutch will go and 
 crawl it for you, something that a simple shell script doesn't take care of.


OK, thanks, those are good points. What we have dealt with,
and what I believe that the original poster in this thread wanted,
was a requirement just to dump the contents of documents in a
filesystem hierarchy.

 Also, it would be great if you could elaborate what the extra configuration 
 and maintenance issues are regarding Nutch? If you had something specific in 
 mind, patches or issue comments, welcome :)

Didn't mean it in that way. Nutch is indeed quite easy to set up,
and run. Nevertheless, if one's use case does not require the
features it provides, learning how to do that, and maintaining an
instance of Nutch, are all unnecessary overhead.

Regards,
Gora


Re: Indexing FTP Documents through SOLR??

2011-01-21 Thread pankaj bhatt
Hi Gora,
   Thanks for the answer. I want to index all the PDF,HTML documents
lying within a tree hierarchy at FTP Server.
   In addition, can i add an attribute location whose value is the FTP
FILE LOCATION.

  If you can give me, the sample configuration, it will be great.

/ Pankaj Bhatt.

On Fri, Jan 21, 2011 at 12:57 PM, Gora Mohanty g...@mimirtech.com wrote:

 On Fri, Jan 21, 2011 at 12:21 PM, pankaj bhatt panbh...@gmail.com wrote:
  Hi All,
   Is there is any way in SOLR or any plug-in through which the folders and
  documents in FTP location can be indexed.
 [...]

 What format are these documents in? Which parts of the documents
 do you want to index?

 In general, this can be done through Solr, but the details will depend
 on the above.

 Regards,
 Gora



Re: Indexing FTP Documents through SOLR??

2011-01-21 Thread Gora Mohanty
On Fri, Jan 21, 2011 at 1:31 PM, pankaj bhatt panbh...@gmail.com wrote:
 Hi Gora,
           Thanks for the answer. I want to index all the PDF,HTML documents
 lying within a tree hierarchy at FTP Server.
       In addition, can i add an attribute location whose value is the FTP
 FILE LOCATION.

      If you can give me, the sample configuration, it will be great.
[...]

From Solr 1.4 onwards, you can use the ExtractingRequestHandler
built into Solr, and simply POST such files to a Solr Server.

Please see http://wiki.apache.org/solr/ExtractingRequestHandler

Regards,
Gora


Re: Indexing FTP Documents through SOLR??

2011-01-21 Thread pankaj bhatt
Hi Gora,
Thanks, however i think it would be a cumbersome process, to do all
this manual.
 Aren't there any plugin or extracter does this automatically.???

Anyone in the group, if had done this previously.?

/ Pankaj Bhatt.

On Fri, Jan 21, 2011 at 1:41 PM, Gora Mohanty g...@mimirtech.com wrote:

 On Fri, Jan 21, 2011 at 1:31 PM, pankaj bhatt panbh...@gmail.com wrote:
  Hi Gora,
Thanks for the answer. I want to index all the PDF,HTML
 documents
  lying within a tree hierarchy at FTP Server.
In addition, can i add an attribute location whose value is the
 FTP
  FILE LOCATION.
 
   If you can give me, the sample configuration, it will be great.
 [...]

 From Solr 1.4 onwards, you can use the ExtractingRequestHandler
 built into Solr, and simply POST such files to a Solr Server.

 Please see http://wiki.apache.org/solr/ExtractingRequestHandler

 Regards,
 Gora



Re: Indexing FTP Documents through SOLR??

2011-01-21 Thread Gora Mohanty
On Fri, Jan 21, 2011 at 1:47 PM, pankaj bhatt panbh...@gmail.com wrote:
 Hi Gora,
        Thanks, however i think it would be a cumbersome process, to do all
 this manual.
     Aren't there any plugin or extracter does this automatically.???

    Anyone in the group, if had done this previously.?
[...]

Um, write a script to handle the POSTing of documents. That
is how we have handle indexing several million PDF, HTML,
ODF, DOC documents.

Regards,
Gora


Re: Indexing FTP Documents through SOLR??

2011-01-21 Thread Markus Jelsma
Hi,

Please take a look at Apache Nutch. I can crawl through a file system over FTP. 
After crawling, it can use Tika to extract the content from your PDF files and 
other. Finally you can then send the data to your Solr server for indexing.

http://nutch.apache.org/

 Hi All,
   Is there is any way in SOLR or any plug-in through which the folders and
 documents in FTP location can be indexed.
 
 / Pankaj Bhatt.


Indexing FTP Documents through SOLR??

2011-01-20 Thread pankaj bhatt
Hi All,
  Is there is any way in SOLR or any plug-in through which the folders and
documents in FTP location can be indexed.

/ Pankaj Bhatt.


Re: Indexing FTP Documents through SOLR??

2011-01-20 Thread Gora Mohanty
On Fri, Jan 21, 2011 at 12:21 PM, pankaj bhatt panbh...@gmail.com wrote:
 Hi All,
  Is there is any way in SOLR or any plug-in through which the folders and
 documents in FTP location can be indexed.
[...]

What format are these documents in? Which parts of the documents
do you want to index?

In general, this can be done through Solr, but the details will depend
on the above.

Regards,
Gora