RE: Keeping History/Archive with Nutch 2.x

2012-10-09 Thread j.sullivan
Thank Julien and Ferdy, appreciated. I will look into a custom map reduce job for MySQL first as I don't really have the search size yet to justify HBase. Depending on how difficult that turns out to be I may try HBase. -Original Message- From: Julien Nioche [mailto:lists.digitalpeb...@g

Re: crawling forum pages

2012-10-09 Thread Tejas Patil
I had faced similar problem while crawling over online shopping website to gather the catalog of all available products. There were many products for a given category and it was messy to follow all the "next" links. Analyze the pattern for the next links. Define tighter regexes so that the unwante

Re: Error parsing html

2012-10-09 Thread Sebastian Nagel
> I should mention, that I'm using Nutch in a Web-Application. It's possible though it's hard. > While debugging I came across the runParser method in ParseUtil class in > which the task.get(MAX_PARSE_TIME, TimeUnit.SECONDS); returns null. See http://wiki.apache.org/nutch/RunNutchInEclipse#Debuggi

Re: Error parsing html

2012-10-09 Thread CarinaBambina
I checked the directory permissions. They should be ok, set to read/write access. It's just hard to debug, as i can't make Hadoop logs work. I only see Warnings and Infos in the console. -- View this message in context: http://lucene.472066.n3.nabble.com/Error-parsing-html-tp3994699p4012808.h

Re: Keeping History/Archive with Nutch 2.x

2012-10-09 Thread Jorge Luis Betancourt Gonzalez
If I want to keep a cache of the websites crawled, something similar to the Google cached view, this way to go with the HBase would be the best option or storing in a filesystem? - Mensaje original - De: "Julien Nioche" Para: user@nutch.apache.org Enviados: Martes, 9 de Octubre 2012 16:

Re: DataFileAvroStore vs. AvroStore

2012-10-09 Thread Julien Nioche
Mike If you haven't done so yet maybe ask on the GORA mailing list. Would be interested to know the answer as well. Thanks Julien On 9 October 2012 02:50, Mike Baranczak wrote: > What's the difference between those two data stores? I've read the > javadocs, and I'm still confused. > > -MB > >

Re: Keeping History/Archive with Nutch 2.x

2012-10-09 Thread Julien Nioche
Good point Ferdy, thanks! On 9 October 2012 18:10, Ferdy Galema wrote: > Hi, > > HBase with multiple versions is certainly an option, however the current > HBaseStore implementation is implemented with a single version in mind. (I > have not really tested what happens with multiple versions, I g

Re: crawling forum pages

2012-10-09 Thread Julien Nioche
Depth is a misleading term which should be replaced by round. Why don't you write a HTMLParser to extract the total number of pages and generate outlinks to all the pages beyond the first one i.e. the whole range from 2 to 30? That's assuming that the total number of pages is expressed in a consist

Re: crawling forum pages

2012-10-09 Thread Mike Baranczak
How high did you set the depth? And why do you think it can't go any higher? On Oct 9, 2012, at 5:15 AM, Jiang Fung Wong wrote: > Hi All, > > I am setting up nutch to crawl forum pages and index the posts in the > content pages (threads). I face a problem: nutch could not discover > all conten

Re: Error parsing html

2012-10-09 Thread alxsss
I checked the url you privided with parsechecker and they are parsed correctly. You can check yourself by doing bin/nutch parsechecker yoururl. In you implementation can you check if segment dir has correct permission. Alex. -Original Message- From: CarinaBambina To: user Se

Re: Keeping History/Archive with Nutch 2.x

2012-10-09 Thread Ferdy Galema
Hi, HBase with multiple versions is certainly an option, however the current HBaseStore implementation is implemented with a single version in mind. (I have not really tested what happens with multiple versions, I guess you get unexpected/undefined results). The exception to this case would be to

Re: Error parsing html

2012-10-09 Thread CarinaBambina
i now also tried using all source files itself instead of the nutch.jar, but nothing changed. Is there anyone who has an idea what the reason for this error might be? Or at least where and what i should look for? Any hint?! Thanks in advance! -- View this message in context: http://lucene.472

Re: Keeping History/Archive with Nutch 2.x

2012-10-09 Thread Julien Nioche
Hi James You could have a custom map reduce job to copy the documents with a custom ID as you just described. Another option would be to use Nutch 2 + HBase and set a large value of versions ( http://hbase.apache.org/book/schema.versions.html) in the HBase schema. Julien On 9 October 2012 11:17,

RE: Keeping History/Archive with Nutch 2.x

2012-10-09 Thread j.sullivan
I agree that doing it at the Solr level is the most straight forward easy way. However, if possible I would like to do it at the webpage table level. That way I would have the original data and I would be able to reindex data at a later date and retroactively apply any improvements to the indexi

Re: Keeping History/Archive with Nutch 2.x

2012-10-09 Thread Dave Stuart
Are you pushing it into a search index of some sort? As I mostly push things into Solr I would modify the key to take signature into account. On 9 Oct 2012, at 11:17, wrote: > Hi > > Rather than a wide crawl of the web keeping track of the current state of > sites (as I understand Nutch is

Keeping History/Archive with Nutch 2.x

2012-10-09 Thread j.sullivan
Hi Rather than a wide crawl of the web keeping track of the current state of sites (as I understand Nutch is currently optimized for) I am interested in keeping copies of a more modest number of sites over time as they change. In other words keeping copies of both the old webpages and the new p

crawling forum pages

2012-10-09 Thread Jiang Fung Wong
Hi All, I am setting up nutch to crawl forum pages and index the posts in the content pages (threads). I face a problem: nutch could not discover all content pages, despite me setting a very high depth. This is because, typically a thread could have many posts that span several pages. Suppose I a