I think you need to specify the individual segment..
bin/nutch readseg -dump crawl-20100420112025/segments/20100422092816
dumpSegmentDirectory
On Wed, Apr 21, 2010 at 9:38 PM, nachonieto3 wrote:
>
> Thank you a lot! Now I'm working on that, I have some doubts more...I'm not
> able to run the comm
Thanks Julien.
I have changed nutch-site.xml to have only parse-(tika) instead of
parse-(text | html | js | tika) in plugin.includes property.
It works now as it doesn't pick up any other parser besides tika.
On Wed, Apr 21, 2010 at 7:42 PM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:
>
Hey there! Wanted to let you all know about our next meetup, April
28th. We've got a killer new venue thanks to Amazon.
Check out the details at the link:
http://www.meetup.com/Seattle-Hadoop-HBase-NoSQL-Meetup/calendar/13072272/
Our Speakers this month:
1. Nick Dimiduk, Drawn to Scale: Intro to
Hi Tim,
I would think that this parameter is related to the problem you describe, but
the default value should allow indexing pages of the size you mention. Did you
change this parameter?
Regards,
Arkadi
indexer.max.tokens
1
The maximum number of tokens that will be indexed for
YES - I forgot to include that... robots.txt is fine. it is wide open:
###
#
# sample robots.txt file for this website
#
# addresses all robots by using wild card *
User-agent: *
#
# list folders robots are not allowed to index
#Disallow: /tutorials/404redirect/
Disall
Using 1.1, it looks like the same error at first:
threads = 10
depth = 5
indexer=lucene
Injector: starting
Injector: crawlDb: crawl-20100421175011/crawldb
Injector: urlDir: /projects/events/search/apache-nutch-1.1/cmrolg-even/urls
Injector: Converting injected urls to crawl db entries.
Exception i
Hi,
how do i set up a specific crawldb in my own Java App?
I tried to do it like the following snip:
Configuration nutchConf = NutchConfiguration.create();
//nutchConf.addResource(new Path(prop.getProperty("nutchPath")));
Path configPath = new
Path("/cygdrive/f/Workspaces/Nu
Hey,
We have a long page that appears in the search results but the summary
never contains the search terms. Why is this?
If we move the text containing the search terms up the page they get
displayed in the summary so it's obviously related to some limit imposed
somewhere. I've looked though
Joshua,
Could you try using Nutch 1.1 RC1 (see
http://people.apache.org/~mattmann/apache-nutch-1.1/rc1/)?
Could you also try separating the fetching and parsing steps? e.g fetch
first as you already do then parse the fetched segment (instead of parsing
while refetching)
Your crawl is fairly small
I get the same error on a filesystem with 10 GB (disk space is a commodity
here). The final crawl when it succeeds on my Windows machine is 93 MB, so
I really hope it doesn't need more than 10 GB to even pull down and parse
the first URL. Is there something concerning threading that could
intro
Thank you a lot! Now I'm working on that, I have some doubts more...I'm not
able to run the command readseg...I've been consulting some help forum and
the basic synthesis is
readseg
I have the segments in this path: D:\nutch-0.9\crawl-20100420112025\segments
The file named crawl-20100420112025
Thank you a lot! Now I'm working on that, I have some doubts more...I'm not
able to run the command readseg...I've been consulting some help forum and
the basic synthesis is
readseg
I have the segments in this path: D:\nutch-0.9\crawl-20100420112025\segments
The file named crawl-20100420112025
Hi Harry,
Could you try using parse-tika instead and see if you are getting the same
problem? I gather from your email that you are using Nutch 1.1 or the SVN
version, so parse-tika should be used by default. Have you deactivated it?
Thanks
Julien
On 21 April 2010 11:58, Harry Nutch wrote:
>
To convert the Nutch's crawled data which is stored in segments to human
readable and interpretable forms, you will have to look at the 'segread'
command (which was earlier 'readseg'). It reads and exports the segment
data.
Details at Nutch Wiki:
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutc
Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to
fix the problem.
On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch wrote:
> Hi,
>
> I am running the latest version for nutch. While crawling one particular
> site I get a AbstractMethodError in the cyberneko plugin for all of
House Less wrote:
>
>
> Hello everyone,
>
> I am quite new to development with Nutch, so you must forgive my question
> if it is amateurish. I asked it at the Lucene Java user mailing list and
> Grant Ingersoll referred me to this list.
>
> After
> some reading of Luke's source code, I found
Hi,
I am running the latest version for nutch. While crawling one particular
site I get a AbstractMethodError in the cyberneko plugin for all of it pages
when doing a Fetch.
As i understand, this has to do because of difference between the runtime
and compile version. However, I am running it afre
17 matches
Mail list logo