Re: ParseText contains newline

2010-02-18 Thread Ken Krugler
Hi Ted, This is a Tika issue, and one that's been on my list for a while to file/fix - thanks for the reminder :) -- Ken On Feb 18, 2010, at 4:31pm, Ted Yu wrote: Hi, We use nutch 1.0 I found that for certain web pages, e.g. http://www.funnycorner.net/funny-pictures/4060/funny-people-pictu

Query: Local webpage caching using Nutch Java API

2010-02-18 Thread Amit Agarwal
Hi, I am a newbie to Nutch and Lucene. Have a task to build a framework for webpage caching on local system (i.e. download and store webpage in local filesystem), indexing (index pages on keywords), search (search the local webpage cache using the keywords). The preference would be to build framew

ParseText contains newline

2010-02-18 Thread Ted Yu
Hi, We use nutch 1.0 I found that for certain web pages, e.g. http://www.funnycorner.net/funny-pictures/4060/funny-people-pictures/thieves-snort-dogs-ashes-in-cocaine-bungle.html,

Re: Is there a comprehensive guide to Nutch->Solr migration.

2010-02-18 Thread Aaron Binns
BTW, I've already read: Using Nutch with Solr http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ and it touches upon some of the issues, but are there any more articles/studies/etc. like this? Regards, Aaron -- Aaron Binns Senior Software Engineer, Web Group, Internet Archive

Is there a comprehensive guide to Nutch->Solr migration.

2010-02-18 Thread Aaron Binns
Hello all, Is there a comprehensive guide to Nutch->Solr migration, at least with respect to search serving? At the Internet Archive, we've been using Nutch in various capacities for quite sometime, and are evaluating Solr for doing the search serving/query execution. On the surface, there seem

Help needed for NutchBean.getContent(HitDetails) returning null

2010-02-18 Thread Bruno Adam Osiek
Hi, I'm new to Nutch and when performing a search in a java embedded application a get the expected results, i.e., NutchBean returns Hits. For each hit I manage to get ONLY the following HitDetails: boost, digest, segment, title, tstamp and url. Both methods NutchBean.getContent(HitDetails) and Nu

Re: convert segment dump into text for data mining.

2010-02-18 Thread Hannes Carl Meyer
Hi Felix, usually there is a lot of data stored in the segments. Which data do you need? The webpage content or term freqs only? You should also consider to perform this command to fetch the whole contents ./bin/nutch readseg For example: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder

How to add sitemp attribute to crawldb while fetching

2010-02-18 Thread Pravin Karne
Hi, Sitemap.xml contains URLinfo for "updatefrequency" and "lastmodify" . So , while fetching the URLs, can we update crawldatum with above values. So long run crawl will have upadated information every time. No need to re-crawl for updated links By default this value is the 30 days(my underst

convert segment dump into text for data mining.

2010-02-18 Thread Felix Zimmermann
Hi, I would like to convert the segment dumpfile into a textfile in order to analyse it later with a data mining programm. I am not very familar with file formats and so did not succeed using commands like od/uuencode. I use Ubuntu 9.10. Thanks for help, Felix.