Re: Indexing FTP Documents through SOLR??
+1 on Nutch! On Fri, Jan 21, 2011 at 4:11 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, Please take a look at Apache Nutch. I can crawl through a file system over FTP. After crawling, it can use Tika to extract the content from your PDF files and other. Finally you can then send the data to your Solr server for indexing. http://nutch.apache.org/ Hi All, Is there is any way in SOLR or any plug-in through which the folders and documents in FTP location can be indexed. / Pankaj Bhatt.
Re: Indexing FTP Documents through SOLR??
On Mon, Jan 24, 2011 at 8:15 AM, Adam Estrada estrada.a...@gmail.com wrote: +1 on Nutch! [...] Would it be possible for Markus, and you to clarify on what the advantages of Nutch are in crawling a well-defined filesystem hierarchy? A simple shell script that POSTs to Solr works fine for this, so why would one choose the extra configuration, and maintenance issues required for Nutch. Regards, Gora
Re: Indexing FTP Documents through SOLR??
I'd be happy to comment: A simple shell script doesn't provide URL filtering and control of how you crawl those documents on the local file system. Nutch has several levels of URL filtering based on regex, MIME type, and others. Also, if there are any outlinks in those local files that point to remote content, Nutch will go and crawl it for you, something that a simple shell script doesn't take care of. Also, it would be great if you could elaborate what the extra configuration and maintenance issues are regarding Nutch? If you had something specific in mind, patches or issue comments, welcome :) Cheers, Chris On Jan 23, 2011, at 8:56 PM, Gora Mohanty wrote: On Mon, Jan 24, 2011 at 8:15 AM, Adam Estrada estrada.a...@gmail.com wrote: +1 on Nutch! [...] Would it be possible for Markus, and you to clarify on what the advantages of Nutch are in crawling a well-defined filesystem hierarchy? A simple shell script that POSTs to Solr works fine for this, so why would one choose the extra configuration, and maintenance issues required for Nutch. Regards, Gora ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Indexing FTP Documents through SOLR??
On Mon, Jan 24, 2011 at 11:07 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: I'd be happy to comment: A simple shell script doesn't provide URL filtering and control of how you crawl those documents on the local file system. Nutch has several levels of URL filtering based on regex, MIME type, and others. Also, if there are any outlinks in those local files that point to remote content, Nutch will go and crawl it for you, something that a simple shell script doesn't take care of. OK, thanks, those are good points. What we have dealt with, and what I believe that the original poster in this thread wanted, was a requirement just to dump the contents of documents in a filesystem hierarchy. Also, it would be great if you could elaborate what the extra configuration and maintenance issues are regarding Nutch? If you had something specific in mind, patches or issue comments, welcome :) Didn't mean it in that way. Nutch is indeed quite easy to set up, and run. Nevertheless, if one's use case does not require the features it provides, learning how to do that, and maintaining an instance of Nutch, are all unnecessary overhead. Regards, Gora
Re: Indexing FTP Documents through SOLR??
Hi Gora, Thanks for the answer. I want to index all the PDF,HTML documents lying within a tree hierarchy at FTP Server. In addition, can i add an attribute location whose value is the FTP FILE LOCATION. If you can give me, the sample configuration, it will be great. / Pankaj Bhatt. On Fri, Jan 21, 2011 at 12:57 PM, Gora Mohanty g...@mimirtech.com wrote: On Fri, Jan 21, 2011 at 12:21 PM, pankaj bhatt panbh...@gmail.com wrote: Hi All, Is there is any way in SOLR or any plug-in through which the folders and documents in FTP location can be indexed. [...] What format are these documents in? Which parts of the documents do you want to index? In general, this can be done through Solr, but the details will depend on the above. Regards, Gora
Re: Indexing FTP Documents through SOLR??
On Fri, Jan 21, 2011 at 1:31 PM, pankaj bhatt panbh...@gmail.com wrote: Hi Gora, Thanks for the answer. I want to index all the PDF,HTML documents lying within a tree hierarchy at FTP Server. In addition, can i add an attribute location whose value is the FTP FILE LOCATION. If you can give me, the sample configuration, it will be great. [...] From Solr 1.4 onwards, you can use the ExtractingRequestHandler built into Solr, and simply POST such files to a Solr Server. Please see http://wiki.apache.org/solr/ExtractingRequestHandler Regards, Gora
Re: Indexing FTP Documents through SOLR??
Hi Gora, Thanks, however i think it would be a cumbersome process, to do all this manual. Aren't there any plugin or extracter does this automatically.??? Anyone in the group, if had done this previously.? / Pankaj Bhatt. On Fri, Jan 21, 2011 at 1:41 PM, Gora Mohanty g...@mimirtech.com wrote: On Fri, Jan 21, 2011 at 1:31 PM, pankaj bhatt panbh...@gmail.com wrote: Hi Gora, Thanks for the answer. I want to index all the PDF,HTML documents lying within a tree hierarchy at FTP Server. In addition, can i add an attribute location whose value is the FTP FILE LOCATION. If you can give me, the sample configuration, it will be great. [...] From Solr 1.4 onwards, you can use the ExtractingRequestHandler built into Solr, and simply POST such files to a Solr Server. Please see http://wiki.apache.org/solr/ExtractingRequestHandler Regards, Gora
Re: Indexing FTP Documents through SOLR??
On Fri, Jan 21, 2011 at 1:47 PM, pankaj bhatt panbh...@gmail.com wrote: Hi Gora, Thanks, however i think it would be a cumbersome process, to do all this manual. Aren't there any plugin or extracter does this automatically.??? Anyone in the group, if had done this previously.? [...] Um, write a script to handle the POSTing of documents. That is how we have handle indexing several million PDF, HTML, ODF, DOC documents. Regards, Gora
Re: Indexing FTP Documents through SOLR??
Hi, Please take a look at Apache Nutch. I can crawl through a file system over FTP. After crawling, it can use Tika to extract the content from your PDF files and other. Finally you can then send the data to your Solr server for indexing. http://nutch.apache.org/ Hi All, Is there is any way in SOLR or any plug-in through which the folders and documents in FTP location can be indexed. / Pankaj Bhatt.
Indexing FTP Documents through SOLR??
Hi All, Is there is any way in SOLR or any plug-in through which the folders and documents in FTP location can be indexed. / Pankaj Bhatt.
Re: Indexing FTP Documents through SOLR??
On Fri, Jan 21, 2011 at 12:21 PM, pankaj bhatt panbh...@gmail.com wrote: Hi All, Is there is any way in SOLR or any plug-in through which the folders and documents in FTP location can be indexed. [...] What format are these documents in? Which parts of the documents do you want to index? In general, this can be done through Solr, but the details will depend on the above. Regards, Gora