Re: DIH From various File system locations
I take that back...Use am currently using version 1.2 and make sure that the latest versions of Tika and PDFBox is in the contrib folder. 1.3 is structured a bit differently and it doesn't look like there is a contrib directory. Maybe one of the Nutch contributors can comment on this? Adam On Tue, Jan 25, 2011 at 3:21 PM, Adam Estrada wrote: > There are a few tutorials out there. > > 1. http://wiki.apache.org/nutch/RunningNutchAndSolr (not the most practical) > 2. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ (similar to 1.) > 3. Build the latest from branch > http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ and read > this one. > > http://www.adamestrada.com/2010/04/24/web-crawling-with-nutch/ > > but add the "solr" parameter at the end bin/nutch crawl urls -depth 5 > -topN 100 -solr http://localhost:8983/solr > > This will automatically add the data nutch collected to Solr. For > larger files I would also increase your JAVA_OPTS env to something > like JAVA_OPTS=' Xmx2048m' > > Adam > > > > > On Tue, Jan 25, 2011 at 11:41 AM, pankaj bhatt wrote: >> Thanks Adam, It seems like Nutch use to solve most of my concerns. >> i would be great if you can have share resources for Nutch with us. >> >> / Pankaj Bhatt. >> >> On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups < >> estrada.adam.gro...@gmail.com> wrote: >> >>> I would just use Nutch and specify the -solr param on the command line. >>> That will add the extracted content your instance of solr. >>> >>> Adam >>> >>> Sent from my iPhone >>> >>> On Jan 25, 2011, at 5:29 AM, pankaj bhatt wrote: >>> >>> > Hi All, >>> > I need to index the documents presents in my file system at >>> various >>> > locations (e.g. C:\docs , d:\docs ). >>> > Is there any way through which i can specify this in my DIH >>> > Configuration. >>> > Here is my configuration:- >>> > >>> > >>> > >> > processor="FileListEntityProcessor" >>> > fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$" >>> > *baseDir="G:\\Desktop\\"* >>> > recursive="false" >>> > rootEntity="true" >>> > transformer="DateFormatTransformer" >>> > onerror="continue"> >>> > >> > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor" >>> > url="${sd.fileAbsolutePath}" format="text" dataSource="bin"> >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > / Pankaj Bhatt. >>> >> >
Re: DIH From various File system locations
There are a few tutorials out there. 1. http://wiki.apache.org/nutch/RunningNutchAndSolr (not the most practical) 2. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ (similar to 1.) 3. Build the latest from branch http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ and read this one. http://www.adamestrada.com/2010/04/24/web-crawling-with-nutch/ but add the "solr" parameter at the end bin/nutch crawl urls -depth 5 -topN 100 -solr http://localhost:8983/solr This will automatically add the data nutch collected to Solr. For larger files I would also increase your JAVA_OPTS env to something like JAVA_OPTS=' Xmx2048m' Adam On Tue, Jan 25, 2011 at 11:41 AM, pankaj bhatt wrote: > Thanks Adam, It seems like Nutch use to solve most of my concerns. > i would be great if you can have share resources for Nutch with us. > > / Pankaj Bhatt. > > On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups < > estrada.adam.gro...@gmail.com> wrote: > >> I would just use Nutch and specify the -solr param on the command line. >> That will add the extracted content your instance of solr. >> >> Adam >> >> Sent from my iPhone >> >> On Jan 25, 2011, at 5:29 AM, pankaj bhatt wrote: >> >> > Hi All, >> > I need to index the documents presents in my file system at >> various >> > locations (e.g. C:\docs , d:\docs ). >> > Is there any way through which i can specify this in my DIH >> > Configuration. >> > Here is my configuration:- >> > >> > >> > > > processor="FileListEntityProcessor" >> > fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$" >> > *baseDir="G:\\Desktop\\"* >> > recursive="false" >> > rootEntity="true" >> > transformer="DateFormatTransformer" >> > onerror="continue"> >> > > > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor" >> > url="${sd.fileAbsolutePath}" format="text" dataSource="bin"> >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > / Pankaj Bhatt. >> >
Re: DIH From various File system locations
Thanks Adam, It seems like Nutch use to solve most of my concerns. i would be great if you can have share resources for Nutch with us. / Pankaj Bhatt. On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups < estrada.adam.gro...@gmail.com> wrote: > I would just use Nutch and specify the -solr param on the command line. > That will add the extracted content your instance of solr. > > Adam > > Sent from my iPhone > > On Jan 25, 2011, at 5:29 AM, pankaj bhatt wrote: > > > Hi All, > > I need to index the documents presents in my file system at > various > > locations (e.g. C:\docs , d:\docs ). > >Is there any way through which i can specify this in my DIH > > Configuration. > >Here is my configuration:- > > > > > > >processor="FileListEntityProcessor" > >fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$" > > *baseDir="G:\\Desktop\\"* > >recursive="false" > >rootEntity="true" > >transformer="DateFormatTransformer" > > onerror="continue"> > > > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor" > > url="${sd.fileAbsolutePath}" format="text" dataSource="bin"> > > > > > > > > > > > > > > > > > > > > > > > > > > > > / Pankaj Bhatt. >
Re: DIH From various File system locations
I would just use Nutch and specify the -solr param on the command line. That will add the extracted content your instance of solr. Adam Sent from my iPhone On Jan 25, 2011, at 5:29 AM, pankaj bhatt wrote: > Hi All, > I need to index the documents presents in my file system at various > locations (e.g. C:\docs , d:\docs ). >Is there any way through which i can specify this in my DIH > Configuration. >Here is my configuration:- > > > processor="FileListEntityProcessor" >fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$" > *baseDir="G:\\Desktop\\"* >recursive="false" >rootEntity="true" >transformer="DateFormatTransformer" > onerror="continue"> > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor" > url="${sd.fileAbsolutePath}" format="text" dataSource="bin"> > > > > > > > > > > > > > > / Pankaj Bhatt.