Re: parseStatus not updated after parsing some files

2014-11-12 Thread coldboy128
I had the same problem like U. In parseUtil.java and ParseJob.java I make the
clone object for parStatus by code:
ParseStatus pstatus = null;
  if(page.getParseStatus() != null){
  pstatus = (ParseStatus) page.getParseStatus().clone();
  }




--
View this message in context: 
http://lucene.472066.n3.nabble.com/parseStatus-not-updated-after-parsing-some-files-tp4118570p4169014.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch configuration - V1 vs V2 differences

2014-11-12 Thread mikejf12

Ill add a little more detail, it might shed light. 

When I used apache-nutch-1.8-src I just used hdfs for storage implicitly by
using the crawl script. 

When I used apache-nutch-2.2.1 I used gora and hbase so that hbase was used
for storage. I wondered whether that would have an impact. 

Im just wondering why it worked in the later gora hbase setup without the
need for hadoop config files ? 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-configuration-V1-vs-V2-differences-tp4168893p4168967.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch configuration - V1 vs V2 differences

2014-11-12 Thread Meraj A. Khan
I installed it by copying the files to conf directory, never tried without
that step to confirm if the copying is really needed.
On Nov 12, 2014 6:24 AM, "mikejf12"  wrote:

>
> Hi
>
> I installed two version of Nutch on to a Centos 6 Linux Hadoop V1.2.1
> cluster. I didnt have any issues in using them but I noticed a difference
> ..
>
> I installed the src version  of apache-nutch-1.8-src, the instructions that
> I followed advised that the hadoop configuration files be copied to the
> nutch conf directory.
>
> I also installed the non source release apache-nutch-2.2.1, which didnt
> require this.
>
> Its been a while since I did this and I wondered whether the step to copy
> the hadoop config files was necessary for the src release ?
>
> cheers
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-configuration-V1-vs-V2-differences-tp4168893.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>


Nutch configuration - V1 vs V2 differences

2014-11-12 Thread mikejf12

Hi 

I installed two version of Nutch on to a Centos 6 Linux Hadoop V1.2.1
cluster. I didnt have any issues in using them but I noticed a difference ..

I installed the src version  of apache-nutch-1.8-src, the instructions that
I followed advised that the hadoop configuration files be copied to the
nutch conf directory. 

I also installed the non source release apache-nutch-2.2.1, which didnt
require this. 

Its been a while since I did this and I wondered whether the step to copy
the hadoop config files was necessary for the src release ? 

cheers



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-configuration-V1-vs-V2-differences-tp4168893.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Removing Common Web Page Header and Footer from content

2014-11-12 Thread Jigal van Hemert | alterNET internet BV
On 11 November 2014 09:12, Moumita Dhar01 
wrote:

> Hi,
>
> I am using Nutch 1.9 and Solr 4.6 to index a web application with
> approximately 100 distinct  URL and contents.
>
> Nutch is used to fetch the urls, links and the crawl the entire web
> application to extract all the content for  all pages, and send the content
> to  Solr.
>
> The problem that I have now is that the first 1000 or so characters and
> the last 400 or so characters of the pages which are common header and
> footer are showing up in the  search results.
>
> Is there a way  to ignore the links or keep only the static text in the
> content?
>

You can exclude parts of the page before it's added to the index. In
nutch-site.xml you can put (example configuration, adjust to your situation)

  
parser.html.NodesToExclude


  A list of nodes whose content will not be indexed separated by "|".
  Use this to tell the HTML parser to ignore, for example, site
navigation text.

  Each node has three elements, separated by semi-colon:
  the first one is the tag name,
  the second one the attribute name,
  the third one the value of the attribute.

  Example: table;summary;header|div;id;navigation

  Note that nodes with these attributes, and their children, will be
  silently ignored by the parser so verify the indexed content
  with Luke to confirm results.

  

In the value part you add your configuration. The description part is just
to explain.


-- 


Met vriendelijke groet,


Jigal van Hemert | Ontwikkelaar



Langesteijn 124
3342LG Hendrik-Ido-Ambacht

T. +31 (0)78 635 1200
F. +31 (0)848 34 9697
KvK. 23 09 28 65

ji...@alternet.nl
www.alternet.nl


Disclaimer:
Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke informatie
bevatten. Als u niet de beoogde ontvanger bent van dit bericht, neem dan
direct per e-mail of telefoon contact op met de verzender en verwijder dit
bericht van uw systeem. Het is niet toegestaan de inhoud van dit bericht op
welke wijze dan ook te delen met derden of anderszins openbaar te maken
zonder schriftelijke toestemming van alterNET Internet BV. U wordt
geadviseerd altijd bijlagen te scannen op virussen. AlterNET kan op geen
enkele wijze verantwoordelijk worden gesteld voor geleden schade als gevolg
van virussen.

Alle eventueel genoemde prijzen S.E. & O., excl. 21% BTW, excl. reiskosten.
Op al onze prijsopgaven, offertes, overeenkomsten, en diensten zijn, met
uitzondering van alle andere voorwaarden, de Algemene Voorwaarden van
alterNET Internet B.V. van toepassing. Op al onze domeinregistraties en
hostingactiviteiten zijn tevens onze aanvullende hostingvoorwaarden van
toepassing. Dit bericht is uitsluitend bedoeld voor de geadresseerde. Aan
dit bericht kunnen geen rechten worden ontleend.

! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !