Re: how can I index only a portion of html content?

2009-10-10 Thread winz


Jayant Kumar Gandhi wrote:
 
 It is possible in many ways. One of the ways to do it without using
 the HTML pasrser plugin is to do cloaking for your bot.
 
 

Hi,
Could I please know all possible methods for achieving this??
This seems to be a common problem but I failed to find decent answers on
this forum.
I'm using a content management system named Infoglue to create my website.
Most of the pages in my site have an identical navigation bar, header and
footer.
The content in these sections show up in the search result.

In a related question, what does de-duplication in nutch mean and how does
it work??
Is it possible to configure nutch to remove duplicate contents like
navigation bar during its de-duplication process??

Regards,
Winz



-- 
View this message in context: 
http://www.nabble.com/how-can-I-index-only-a-portion-of-html-content--tp5149557p25832007.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: How to ignore search results that don't have related keywords in main body?

2009-10-10 Thread winz


Venkateshprasanna wrote:
 
 Hi,
 
 You can very well think of doing that if you know that you would crawl and
 index only a selected set of web pages, which follow the same design.
 Otherwise, it would turn out to be a never ending process - i.e., finding
 out the sections, frames, divs, spans, css classes and the likes - from
 each of the web pages. Scalability would obviously be an issue.
 

Hi,
Could I please know how we can ignore template items like header, footer and
menu/navigations while crawling and indexing pages which follow the same
design??
I'm using a content management system called Infoglue to develop my website.
A standard template is applied for all the pages on the website.

The search results from Nutch shows content from menu/navigation bar
multiple times.
I need to get rid of menu/navigation content from the search result.

Please guide regarding this.

Thanks,
Vinay

-- 
View this message in context: 
http://www.nabble.com/How-to-ignore-search-results-that-don%27t-have-related-keywords-in-main-body--tp22654668p25833636.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: indexing just certain content

2009-10-10 Thread Andrzej Bialecki

MilleBii wrote:

Andzej,

The use case you are thinking is : at the parsing stage, filter out garbage
content and index only the rest.

I have a different use case, I want to keep everything as standard indexing
_AND_  also extract part for being indexed in a dedicated field (which will
be boosted at search time). In a document certain part have more importance
than others in my case.

So I would like either
1. to access html representation at indexing time... not possible or did not
find how
2. create a dual representation of the document, plain  standard, filtered
document

I think option 2. is much better because it better fits the model and allows
for a lot of different other use cases.


Actually, creativecommons provides hints how to do this .. but to be 
more explicit:


* in your HtmlParseFilter you need to extract from DOM tree the parts 
that you want, and put them inside ParseData.metadata. This way you will 
preserve both the original text, and your special parts that you extracted.


* in your IndexingFilter you will retrieve the parts from 
ParseData.metadata and add them as additional index fields (don't forget 
to specify indexing backend options).


* in your QueryFilter plugin.xml you declare that QueryParser should 
pass your special fields without treating them as terms, and in the 
implementation you create a BooleanClause to be added to the translated 
query.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to ignore search results that don't have related keywords in main body?

2009-10-10 Thread Andrzej Bialecki

winz wrote:


Venkateshprasanna wrote:

Hi,

You can very well think of doing that if you know that you would crawl and
index only a selected set of web pages, which follow the same design.
Otherwise, it would turn out to be a never ending process - i.e., finding
out the sections, frames, divs, spans, css classes and the likes - from
each of the web pages. Scalability would obviously be an issue.



Hi,
Could I please know how we can ignore template items like header, footer and
menu/navigations while crawling and indexing pages which follow the same
design??
I'm using a content management system called Infoglue to develop my website.
A standard template is applied for all the pages on the website.

The search results from Nutch shows content from menu/navigation bar
multiple times.
I need to get rid of menu/navigation content from the search result.


If all you index is this particular site, then you know the positions of 
navigation items, right? Then you can remove these elements in your 
HtmlParseFilter, or modify DOMContentUtils (in parse-html) to skip these 
elements.




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: indexing just certain content

2009-10-10 Thread BELLINI ADAM

Hi,
you said :  '...* in your HtmlParseFilter you need to extract from DOM tree the 
parts 
that you want ...'
but my problem is : i dont know what to extract becoz dont know all pages i'm 
indexing, i just know what to don't index
 1 - I just know what to dont index...all pages have some sections that i wont 
index, since i know those section i want to take them off from the document and 
keep the rest of the important content.

the sections are headers, top menus, right menus, left menus and some other 
sections:

div id = 'header' bla bla /div
 div id = 'top_menu'  bla bla /div
 div id = 'left_menu'  bla bla /div
 div id = 'right_menu' bla bla /div
mabe i could find some java classes which can delete sections form a an HTML 
page ?!
if i found this one so i guess it will be more easy to use 

2- you said dont forget backends index : could you tell me what are they ?

3- we are using SOLR, so i just have to index the important content...the 
search will be performed with solr so i guess i dont need the QueryFilter.

best regards




 Date: Sat, 10 Oct 2009 16:04:10 +0200
 From: a...@getopt.org
 To: nutch-user@lucene.apache.org
 Subject: Re: indexing just certain content
 
 MilleBii wrote:
  Andzej,
  
  The use case you are thinking is : at the parsing stage, filter out garbage
  content and index only the rest.
  
  I have a different use case, I want to keep everything as standard indexing
  _AND_  also extract part for being indexed in a dedicated field (which will
  be boosted at search time). In a document certain part have more importance
  than others in my case.
  
  So I would like either
  1. to access html representation at indexing time... not possible or did not
  find how
  2. create a dual representation of the document, plain  standard, filtered
  document
  
  I think option 2. is much better because it better fits the model and allows
  for a lot of different other use cases.
 
 Actually, creativecommons provides hints how to do this .. but to be 
 more explicit:
 
 * in your HtmlParseFilter you need to extract from DOM tree the parts 
 that you want, and put them inside ParseData.metadata. This way you will 
 preserve both the original text, and your special parts that you extracted.
 
 * in your IndexingFilter you will retrieve the parts from 
 ParseData.metadata and add them as additional index fields (don't forget 
 to specify indexing backend options).
 
 * in your QueryFilter plugin.xml you declare that QueryParser should 
 pass your special fields without treating them as terms, and in the 
 implementation you create a BooleanClause to be added to the translated 
 query.
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
  
_
New! Get to Messenger faster: Sign-in here now!
http://go.microsoft.com/?linkid=9677407

RE: indexing just certain content

2009-10-10 Thread BELLINI ADAM

yes 




MilleBii 

i tald you before that i created a DublinCore metadata parser and indexer...so 
i parsed my html and created fileds to get my DC metadata...my missing piece is 
how to delete sections form an html page :( if i will find this piece the rest 
will be like a peice of cake :)




 Date: Sat, 10 Oct 2009 16:41:44 +0200
 Subject: Re: indexing just certain content
 From: mille...@gmail.com
 To: nutch-user@lucene.apache.org
 
 Andrzej,
 
 Great !!!
 I did not realize you could put your own content in ParseData.metadata and
 read it back in the IndexingFilter... this was my missing piece in the
 puzzle, for the rest I knew what to do.
 
 Thanks,
 
 
 
 2009/10/10 Andrzej Bialecki a...@getopt.org
 
  MilleBii wrote:
 
  Andzej,
 
  The use case you are thinking is : at the parsing stage, filter out
  garbage
  content and index only the rest.
 
  I have a different use case, I want to keep everything as standard
  indexing
  _AND_  also extract part for being indexed in a dedicated field (which
  will
  be boosted at search time). In a document certain part have more
  importance
  than others in my case.
 
  So I would like either
  1. to access html representation at indexing time... not possible or did
  not
  find how
  2. create a dual representation of the document, plain  standard,
  filtered
  document
 
  I think option 2. is much better because it better fits the model and
  allows
  for a lot of different other use cases.
 
 
  Actually, creativecommons provides hints how to do this .. but to be more
  explicit:
 
  * in your HtmlParseFilter you need to extract from DOM tree the parts that
  you want, and put them inside ParseData.metadata. This way you will preserve
  both the original text, and your special parts that you extracted.
 
  * in your IndexingFilter you will retrieve the parts from
  ParseData.metadata and add them as additional index fields (don't forget to
  specify indexing backend options).
 
  * in your QueryFilter plugin.xml you declare that QueryParser should pass
  your special fields without treating them as terms, and in the
  implementation you create a BooleanClause to be added to the translated
  query.
 
 
 
  --
  Best regards,
  Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 
 
 
 
 -- 
 -MilleBii-
  
_
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406

RE: How to ignore search results that don't have related keywords in main body?

2009-10-10 Thread BELLINI ADAM

hi guyes
it's just what im talking about in my post 'indexing just certain content‏'...
you can read it mabe it could help you...
i was asking how to get rid of the garbage sections in a document and to parse 
only the important data...so i guess you will create your own parser and 
indexer...but the problem is how could we delete those garbage section from an 
html...try to read my post...mabe we can gather our two posts...i dont know if 
we can gather posts on thsi mailing list...to keep tracking only one post...

best regards




 Date: Sat, 10 Oct 2009 17:31:57 +0200
 From: a...@getopt.org
 To: nutch-user@lucene.apache.org
 Subject: Re: How to ignore search results that don't have related keywords in 
 main body?
 
 winz wrote:
  
  Venkateshprasanna wrote:
  Hi,
 
  You can very well think of doing that if you know that you would crawl and
  index only a selected set of web pages, which follow the same design.
  Otherwise, it would turn out to be a never ending process - i.e., finding
  out the sections, frames, divs, spans, css classes and the likes - from
  each of the web pages. Scalability would obviously be an issue.
 
  
  Hi,
  Could I please know how we can ignore template items like header, footer and
  menu/navigations while crawling and indexing pages which follow the same
  design??
  I'm using a content management system called Infoglue to develop my website.
  A standard template is applied for all the pages on the website.
  
  The search results from Nutch shows content from menu/navigation bar
  multiple times.
  I need to get rid of menu/navigation content from the search result.
 
 If all you index is this particular site, then you know the positions of 
 navigation items, right? Then you can remove these elements in your 
 HtmlParseFilter, or modify DOMContentUtils (in parse-html) to skip these 
 elements.
 
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
  
_
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406

RE: indexing just certain content

2009-10-10 Thread BELLINI ADAM

what i want is exactly explained in this second post : How to ignore search 
results that don't have related keywords in main body?




 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org
 Subject: RE: indexing just certain content
 Date: Sat, 10 Oct 2009 15:35:31 +
 
 
 yes 
 
 
 
 
 MilleBii 
 
 i tald you before that i created a DublinCore metadata parser and 
 indexer...so i parsed my html and created fileds to get my DC metadata...my 
 missing piece is how to delete sections form an html page :( if i will find 
 this piece the rest will be like a peice of cake :)
 
 
 
 
  Date: Sat, 10 Oct 2009 16:41:44 +0200
  Subject: Re: indexing just certain content
  From: mille...@gmail.com
  To: nutch-user@lucene.apache.org
  
  Andrzej,
  
  Great !!!
  I did not realize you could put your own content in ParseData.metadata and
  read it back in the IndexingFilter... this was my missing piece in the
  puzzle, for the rest I knew what to do.
  
  Thanks,
  
  
  
  2009/10/10 Andrzej Bialecki a...@getopt.org
  
   MilleBii wrote:
  
   Andzej,
  
   The use case you are thinking is : at the parsing stage, filter out
   garbage
   content and index only the rest.
  
   I have a different use case, I want to keep everything as standard
   indexing
   _AND_  also extract part for being indexed in a dedicated field (which
   will
   be boosted at search time). In a document certain part have more
   importance
   than others in my case.
  
   So I would like either
   1. to access html representation at indexing time... not possible or did
   not
   find how
   2. create a dual representation of the document, plain  standard,
   filtered
   document
  
   I think option 2. is much better because it better fits the model and
   allows
   for a lot of different other use cases.
  
  
   Actually, creativecommons provides hints how to do this .. but to be more
   explicit:
  
   * in your HtmlParseFilter you need to extract from DOM tree the parts that
   you want, and put them inside ParseData.metadata. This way you will 
   preserve
   both the original text, and your special parts that you extracted.
  
   * in your IndexingFilter you will retrieve the parts from
   ParseData.metadata and add them as additional index fields (don't forget 
   to
   specify indexing backend options).
  
   * in your QueryFilter plugin.xml you declare that QueryParser should pass
   your special fields without treating them as terms, and in the
   implementation you create a BooleanClause to be added to the translated
   query.
  
  
  
   --
   Best regards,
   Andrzej Bialecki 
___. ___ ___ ___ _ _   __
   [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
   ___|||__||  \|  ||  |  Embedded Unix, System Integration
   http://www.sigram.com  Contact: info at sigram dot com
  
  
  
  
  -- 
  -MilleBii-
 
 _
 New! Faster Messenger access on the new MSN homepage
 http://go.microsoft.com/?linkid=9677406
  
_
New! Get to Messenger faster: Sign-in here now!
http://go.microsoft.com/?linkid=9677407

Re: How to ignore search results that don't have related keywords in main body?

2009-10-10 Thread Andrzej Bialecki

BELLINI ADAM wrote:

hi guyes it's just what im talking about in my post 'indexing
just certain content‏'... you can read it mabe it could help you... i
was asking how to get rid of the garbage sections in a document and
to parse only the important data...so i guess you will create your
own parser and indexer...but the problem is how could we delete those
garbage section from an html...try to read my post...mabe we can
gather our two posts...i dont know if we can gather posts on thsi
mailing list...to keep tracking only one post...


What is garbage? Can you define it in terms of regex pattern or XPath 
expression that points to specific elements in DOM tree? If you crawl a 
single (or few) sites with well defined templates then you can hardcode 
some rules for removing unwanted parts of the page.


If you can't do this, then there are some heuristic methods to solve 
this. There are two groups of methods:


* page at a time (local): this group of methods considers only the 
current page that you analyze. The quality of filtering is usually limited.


* groups of pages (e.g. per site): these methods consider many pages at 
a time, and try to find recurring theme among them. Since you first need 
to accumulate some pages it can't be done on the fly, i.e. this requires 
a separate post-processing step.


The easiest to implement in Nutch is the first approach (page at a 
time). There are many possible implementations - e.g. based on text 
patterns, on visual position of elements, on DOM tree patterns, on 
block of content characteristics, etc.


Here's for example a simple method:

* collect text from the page in blocks, where each block fits within 
structural tags (div and table tags). Collect also the number of a 
links in each block.


* remove a percentage of the smallest blocks, where link number is high 
- these are likely navigational elements.


* reconstruct the whole page from the remaining blocks.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: How to ignore search results that don't have related keywords in main body?

2009-10-10 Thread BELLINI ADAM

As i explained in my poste the sections i dont wnat to index areare headers, 
top menus, right menus, left menus :
this is what i mean by garbage. 
div id = 'header' bla bla /div
 div id = 'top_menu'  bla bla /div
 div id = 'left_menu'  bla bla /div
 div id = 'right_menu' bla bla /diveach page contains the same header and 
menus section and i don't want index them becoz they are the same...
so in each page i just want to parse those sections  to get outlinks but dont 
want to index them...so i have to create a filtred content (without those 
section).  but how to construct this content since i dont know all the blocks 
and tags that this pages will contains and i even dont know if they are well 
formed...(its just HTML)
the only thing i'm sure about it that there is a template which applies to all 
pages, this templates are the div sections described above...(menus, 
left-menus, etc).
so i guess the easiest solution is to find a java class which take an HTML file 
and certains sections
div id = 'header'
  as parameters and just delete those sections form the HTML file and produce 
the new cleaned HTML



http://www.israel-stop.com/fr
a href=http://www.israel-stop.com/fr;israel/a




 Date: Sat, 10 Oct 2009 18:21:47 +0200
 From: a...@getopt.org
 To: nutch-user@lucene.apache.org
 Subject: Re: How to ignore search results that don't have related keywords in 
 main body?
 
 BELLINI ADAM wrote:
  hi guyes it's just what im talking about in my post 'indexing
  just certain content‏'... you can read it mabe it could help you... i
  was asking how to get rid of the garbage sections in a document and
  to parse only the important data...so i guess you will create your
  own parser and indexer...but the problem is how could we delete those
  garbage section from an html...try to read my post...mabe we can
  gather our two posts...i dont know if we can gather posts on thsi
  mailing list...to keep tracking only one post...
 
 What is garbage? Can you define it in terms of regex pattern or XPath 
 expression that points to specific elements in DOM tree? If you crawl a 
 single (or few) sites with well defined templates then you can hardcode 
 some rules for removing unwanted parts of the page.
 
 If you can't do this, then there are some heuristic methods to solve 
 this. There are two groups of methods:
 
 * page at a time (local): this group of methods considers only the 
 current page that you analyze. The quality of filtering is usually limited.
 
 * groups of pages (e.g. per site): these methods consider many pages at 
 a time, and try to find recurring theme among them. Since you first need 
 to accumulate some pages it can't be done on the fly, i.e. this requires 
 a separate post-processing step.
 
 The easiest to implement in Nutch is the first approach (page at a 
 time). There are many possible implementations - e.g. based on text 
 patterns, on visual position of elements, on DOM tree patterns, on 
 block of content characteristics, etc.
 
 Here's for example a simple method:
 
 * collect text from the page in blocks, where each block fits within 
 structural tags (div and table tags). Collect also the number of a 
 links in each block.
 
 * remove a percentage of the smallest blocks, where link number is high 
 - these are likely navigational elements.
 
 * reconstruct the whole page from the remaining blocks.
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
  
_
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406

OutOfMemoryError: Java heap space

2009-10-10 Thread Fadzi Ushewokunze
hi all,

I am getting this JVM error below during a recrawl specifically during the 
execution of 

$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*

i am running on a single machine:
Linux 2.6.24-23-xen  x86_64
4G RAM
java-6-sun
nutch-1.0
JAVA_HEAP_MAX=-Xmx1000m 

Any suggestions? I am about to up my heap max to Xmx2000m

i havent encountered this before running with the above specs, so i am not sure 
what could have changed?
Any suggestions will be greatly appreciated.

Thanks.


 
 
 2009-10-11 14:29:56,752 INFO [org.apache.hadoop.mapred.LocalJobRunner] - 
 reduce  reduce
 2009-10-11 14:30:15,801 INFO [org.apache.hadoop.mapred.LocalJobRunner] - 
 reduce  reduce
 2009-10-11 14:31:19,197 INFO [org.apache.hadoop.mapred.TaskRunner] - 
 Communication exception: java.lang.OutOfMemoryError: Java heap space
   at 
 java.util.ResourceBundle$Control.getCandidateLocales(ResourceBundle.java:2220)
   at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1229)
   at java.util.ResourceBundle.getBundle(ResourceBundle.java:715)
   at 
 org.apache.hadoop.mapred.Counters$Group.getResourceBundle(Counters.java:218)
   at org.apache.hadoop.mapred.Counters$Group.init(Counters.java:202)
   at org.apache.hadoop.mapred.Counters.getGroup(Counters.java:410)
   at org.apache.hadoop.mapred.Counters.incrAllCounters(Counters.java:491)
   at org.apache.hadoop.mapred.Counters.sum(Counters.java:506)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.statusUpdate(LocalJobRunner.java:222)
   at org.apache.hadoop.mapred.Task$1.run(Task.java:418)
   at java.lang.Thread.run(Thread.java:619)
 
 2009-10-11 14:31:22,197 INFO [org.apache.hadoop.mapred.LocalJobRunner] - 
 reduce  reduce
 2009-10-11 14:31:25,197 INFO [org.apache.hadoop.mapred.LocalJobRunner] - 
 reduce  reduce
 2009-10-11 14:31:40,002 WARN [org.apache.hadoop.mapred.LocalJobRunner] - 
 job_local_0001
 java.lang.OutOfMemoryError: Java heap space
   at 
 java.util.concurrent.locks.ReentrantLock.init(ReentrantLock.java:234)
   at 
 java.util.concurrent.ConcurrentHashMap$Segment.init(ConcurrentHashMap.java:289)
   at 
 java.util.concurrent.ConcurrentHashMap.init(ConcurrentHashMap.java:613)
   at 
 java.util.concurrent.ConcurrentHashMap.init(ConcurrentHashMap.java:652)
   at 
 org.apache.hadoop.io.AbstractMapWritable.init(AbstractMapWritable.java:49)
   at org.apache.hadoop.io.MapWritable.init(MapWritable.java:42)
   at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260)
   at 
 org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
   at 
 org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:101)
   at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
   at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
   at 
 org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940)
   at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880)
   at 
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237)
   at 
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233)
   at org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:377)
   at org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:113)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)