Thank Julien and Ferdy, appreciated. I will look into a custom map reduce job
for MySQL first as I don't really have the search size yet to justify HBase.
Depending on how difficult that turns out to be I may try HBase.
-Original Message-
From: Julien Nioche [mailto:lists.digitalpeb...@g
I had faced similar problem while crawling over online shopping website to
gather the catalog of all available products. There were many products for
a given category and it was messy to follow all the "next" links.
Analyze the pattern for the next links. Define tighter regexes so that the
unwante
> I should mention, that I'm using Nutch in a Web-Application.
It's possible though it's hard.
> While debugging I came across the runParser method in ParseUtil class in
> which the task.get(MAX_PARSE_TIME, TimeUnit.SECONDS); returns null.
See http://wiki.apache.org/nutch/RunNutchInEclipse#Debuggi
I checked the directory permissions. They should be ok, set to read/write
access.
It's just hard to debug, as i can't make Hadoop logs work. I only see
Warnings and Infos in the console.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Error-parsing-html-tp3994699p4012808.h
If I want to keep a cache of the websites crawled, something similar to the
Google cached view, this way to go with the HBase would be the best option or
storing in a filesystem?
- Mensaje original -
De: "Julien Nioche"
Para: user@nutch.apache.org
Enviados: Martes, 9 de Octubre 2012 16:
Mike
If you haven't done so yet maybe ask on the GORA mailing list. Would be
interested to know the answer as well.
Thanks
Julien
On 9 October 2012 02:50, Mike Baranczak wrote:
> What's the difference between those two data stores? I've read the
> javadocs, and I'm still confused.
>
> -MB
>
>
Good point Ferdy, thanks!
On 9 October 2012 18:10, Ferdy Galema wrote:
> Hi,
>
> HBase with multiple versions is certainly an option, however the current
> HBaseStore implementation is implemented with a single version in mind. (I
> have not really tested what happens with multiple versions, I g
Depth is a misleading term which should be replaced by round. Why don't you
write a HTMLParser to extract the total number of pages and generate
outlinks to all the pages beyond the first one i.e. the whole range from 2
to 30? That's assuming that the total number of pages is expressed in a
consist
How high did you set the depth? And why do you think it can't go any higher?
On Oct 9, 2012, at 5:15 AM, Jiang Fung Wong wrote:
> Hi All,
>
> I am setting up nutch to crawl forum pages and index the posts in the
> content pages (threads). I face a problem: nutch could not discover
> all conten
I checked the url you privided with parsechecker and they are parsed correctly.
You can check yourself by doing bin/nutch parsechecker yoururl. In you
implementation can you check if segment dir has correct permission.
Alex.
-Original Message-
From: CarinaBambina
To: user
Se
Hi,
HBase with multiple versions is certainly an option, however the current
HBaseStore implementation is implemented with a single version in mind. (I
have not really tested what happens with multiple versions, I guess you get
unexpected/undefined results). The exception to this case would be to
i now also tried using all source files itself instead of the nutch.jar, but
nothing changed.
Is there anyone who has an idea what the reason for this error might be? Or
at least where and what i should look for? Any hint?!
Thanks in advance!
--
View this message in context:
http://lucene.472
Hi James
You could have a custom map reduce job to copy the documents with a custom
ID as you just described. Another option would be to use Nutch 2 + HBase
and set a large value of versions (
http://hbase.apache.org/book/schema.versions.html) in the HBase schema.
Julien
On 9 October 2012 11:17,
I agree that doing it at the Solr level is the most straight forward easy way.
However, if possible I would like to do it at the webpage table level. That way
I would have the original data and I would be able to reindex data at a later
date and retroactively apply any improvements to the indexi
Are you pushing it into a search index of some sort?
As I mostly push things into Solr I would modify the key to take signature into
account.
On 9 Oct 2012, at 11:17, wrote:
> Hi
>
> Rather than a wide crawl of the web keeping track of the current state of
> sites (as I understand Nutch is
Hi
Rather than a wide crawl of the web keeping track of the current state of sites
(as I understand Nutch is currently optimized for) I am interested in keeping
copies of a more modest number of sites over time as they change. In other
words keeping copies of both the old webpages and the new p
Hi All,
I am setting up nutch to crawl forum pages and index the posts in the
content pages (threads). I face a problem: nutch could not discover
all content pages, despite me setting a very high depth.
This is because, typically a thread could have many posts that span
several pages. Suppose I a
17 matches
Mail list logo