The FAQ has the following:
http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
Niall
On Sat, Oct 3, 2009 at 2:48 PM, jkimathi wrote:
>
> Hi,
> I have installed nutch on Ubuntu 8.04 and I need it to search the local file
> system. How can I configure nutch to
Hi,
I have installed nutch on Ubuntu 8.04 and I need it to search the local file
system. How can I configure nutch to achieve this? How also can I map the
local file system to the http interfacce?
Regards,
John.
--
View this message in context:
http://www.nabble.com/crawling-local-file-system
If you are talking about Nutch Contents which are stored in the segments
during fetching of pages, then you would need to write MapReduce job to
read in the Contents object and do whatever processing you desire.
Dennis
oSilvio wrote:
Very useful information, thanks!
But in order to extract t
t;> files and writable formats.
>>
>> Dennis
>>
>> oSilvio wrote:
>>> Do somebody know how do the file structure works, briefly?
>>> It seems that the data are compressed or something, its not possible to
>>> understand whats recorded in th
; It seems that the data are compressed or something, its not possible to
>> understand whats recorded in the data nor index files.
>> Thanks
>> Silvio
>
>
--
View this message in context:
http://www.nabble.com/File-system-tp21022587p21032357.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.
The nutch databases are either SequenceFile or MapFile formats which
store key and value pairs. Their keys and values are Writable
implementations which translate an object into it byte equivalent and
vice versa.
Data and index files are MapFile format. Data is a SequenceFile, index
is an i
Do somebody know how do the file structure works, briefly?
It seems that the data are compressed or something, its not possible to
understand whats recorded in the data nor index files.
Thanks
Silvio
--
View this message in context:
http://www.nabble.com/File-system-tp21022587p21022587.html
Hi guys,
I am very new to Ntuch. I did follow the Nutch web site for configuration
and it works fine eventually.
However, the web site didn't mention how to configure the nutch to search
for local file system. If anyone has experience on it, please help.
Thanks
Benson
--
View this me
; > > > property in nutch-default.xml)
> > > > 3. create a dump with the "readseg" command and the "-dump" option
> > > > 4. process the dump file and cut out what is necessary
> > > >
> > > > Just interested if that could work . .
work . . . however:
> > > I had a look at the class implementing the readseg command and found
> > > that
> > > the dump file is created with a "PrintWriter". This will create
> > > trouble I
> > > think. Maybe you can modify the SegmentReader (
ader (use an OutputStream).
> > Regarding the fetcher - it's using a binary stream to store the content
> > (FSDataOutputStream).
> >
> >
> > Cheers,
> >
> > Martin
> >
> >
> > On 9/11/07, eyal edri <[EMAIL PROTECTED]> wrote:
&g
the SegmentReader (use an OutputStream).
> Regarding the fetcher - it's using a binary stream to store the content
> (FSDataOutputStream).
>
>
> Cheers,
>
> Martin
>
>
> On 9/11/07, eyal edri <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> >
> > I&
Hi,
I've asked this question before on a different mail list, with no real
response.
I hope someone saw the need for this actions and could help.
I'm trying to config nutch to download certain file types (exe/zip) to the
file system while crawling.
I know nutch doesn't have a p
a map/reduce job to generate fetch list, fetch, update, etc.
- name node (master node?) would be notified of the change to the file
system and index is updated
I don't really know how well that would work, though. Can slave nodes can
start map/reduce jobs? Should they? Would the ta
content and a
simpler solution, IMO, would be to monitor file system events and just
recrawl the necessary pages each time something changes. That way our index
would always be up to date and there would be no reason to do a brute force
recrawl every night. I am willing to write this functionality
simpler solution, IMO, would be to monitor file system events and just
recrawl the necessary pages each time something changes. That way our index
would always be up to date and there would be no reason to do a brute force
recrawl every night. I am willing to write this functionality and contribute
it
t;[EMAIL PROTECTED]> wrote:
> Stefan Groschupf wrote:
>
> > Hi geeks,
> >
> > I have not that much much deep knowledge about the unix file systems,
> > so my questions what would be the best file system for nutch
> > distributed file systems data nodes?
> >
On Tue, 2005-12-13 at 21:43 +0100, Andrzej Bialecki wrote:
>
> Most of the time we deal with very large files, with sequential
> access.
> Only in few places we deal with a lot of small files (e.g. indexing).
> So, I think the best would be an FS optimized for efficient
> sequential
> write/rea
Stefan Groschupf wrote:
Hi geeks,
I have not that much much deep knowledge about the unix file systems,
so my questions what would be the best file system for nutch
distributed file systems data nodes?
Does it make any different using the one or the other file system?
Would reiserFS a
Hi geeks,
I have not that much much deep knowledge about the unix file systems,
so my questions what would be the best file system for nutch
distributed file systems data nodes?
Does it make any different using the one or the other file system?
Would reiserFS a good choice?
Thanks for any
NDFS is not recommended in 0.7. The version of NDFS in the mapred
branch is much improved. Note however that the mapred branch is
substantially different than 0.7 and is still incomplete.
Doug
Transbuerg Tian wrote:
hi, all friends,
I download nutch0.7 ,and want use ndfs independence.
so
hi, all friends,
I download nutch0.7 ,and want use ndfs independence.
so , First I start NameNode , It sucessfuly started , console m essage is
below:
--
D:\workspace\nutch_src\bin>java -cp D:\ja
22 matches
Mail list logo