PRUNE : need some help on pruning syntax.

2009-11-09 Thread Annappa

Hi,

I am unsing Nutch-0.9 for crawing of  sime web application which has a
header part, menu part , left navigation and main contetn area. 

When i do a search on a perticular key word and if that appears in the main
menu, then results are repeating as many times as  pages are,  bcz the menu
will be included in all the pages. So i need to restrict my search not to
search with the content of a perticular div

ex : div class=menu    /div.


Ho do i remove the content between a div from a search

-- 
View this message in context: 
http://old.nabble.com/PRUNE-%3A-need-some-help-on-pruning-syntax.-tp26268447p26268447.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Simple vertical search engine question

2009-11-09 Thread Carlos Vera
I have looked into few vertical search engines like indeed.com,
simplyhired.com.  Anyone know how vertical search engine like indeed.com and
simplyhired.com displays relevant google ads for the searched keywords on
thier site?


Re: PRUNE : need some help on pruning syntax.

2009-11-09 Thread Fadzi Ushewokunze
one option is to extend the html parser and look for these things and
ignore them.

you might also want to look at this forum posting:

http://www.mail-archive.com/nutch-user@lucene.apache.org/msg13969.html


On Mon, 2009-11-09 at 07:39 -0800, Annappa wrote:
 Hi,
 
 I am unsing Nutch-0.9 for crawing of  sime web application which has a
 header part, menu part , left navigation and main contetn area. 
 
 When i do a search on a perticular key word and if that appears in the main
 menu, then results are repeating as many times as  pages are,  bcz the menu
 will be included in all the pages. So i need to restrict my search not to
 search with the content of a perticular div
 
 ex : div class=menu    /div.
 
 
 Ho do i remove the content between a div from a search
 



RE: Simple vertical search engine question

2009-11-09 Thread Fuad Efendi
Premium Google publishers (20 mlns pageviews per month) may use more
features of AdSense such as explicit keywords in a query (to Google)


 -Original Message-
 From: Carlos Vera [mailto:carlodesil...@gmail.com]
 Sent: November-09-09 10:53 AM
 To: nutch-user@lucene.apache.org
 Subject: Simple vertical search engine question
 
 I have looked into few vertical search engines like indeed.com,
 simplyhired.com.  Anyone know how vertical search engine like indeed.com
and
 simplyhired.com displays relevant google ads for the searched keywords on
 thier site?




Nutch near future - strategic directions

2009-11-09 Thread Andrzej Bialecki

Hi all,

The ApacheCon is over, our release 1.0 has been out already for some 
time, so I think it's a good moment to discuss what are the next steps 
in Nutch development.


Let me share with you the topics I identified and presented in the 
ApacheCon slides, and some topics that are worth discussing based on 
various conversations I had there, and the discussions we had on our 
mailing list:


1. Avoid duplication of effort
--
Currently we spend significant effort on implementing functionality that 
other projects are dedicated to. Instead of doing the same work, and 
sometimes poorly, we should concentrate on delegating and reusing:


* Use Tika for content parsing: this will require some effort and 
collaboration with the Tika project, to improve Tika's ability to handle 
more complex formats well (e.g. hierarchical compound documents such as 
archives, mailboxes, RSS), and to contribute any missing parsers (e.g. 
parse-swf).


* Use Solr for indexing  search: it is hard to justify the effort of 
developing and maintaining our own search server - Solr offers much more 
functionality, configurability, performance and ease of integration than 
our relatively primitive search server. Our integration with Solr needs 
to be improved so that it's easier to setup and operate.


* Use database-like storage abstraction: this may seem like a serious 
departure from the current architecture, but I don't mean that we should 
switch to an SQL DB ... what this means is that we should provide an 
option to use HBase, as well as the current plain MapFile-s (and perhaps 
other types of DBs, such as Berkeley DB or SQL, if it makes sense) as 
our storage. There is a very promising initial port of Nutch to HBase, 
which is currently closely integrated with HBase API (which is both good 
and bad) - it provides several improvements over our current storage, so 
I think it's worth using as the new default, but let's see if we can 
make it more abstract.


* Plugins: the initial OSGI port looks good, but I'm not sure yet at 
this moment if the benefits of OSGI outweigh the cost of this change ...


* Shard management: this is currently an Achilles' heel of Nutch, where 
users are left on their own ... If we switch to using HBase then at 
least on the crawling side the shard management will become much easier. 
This still leaves the problem of deploying new content to search 
server(s). The candidate framework for this side of the shard management 
is Katta + patches provided by Ted Dunning (see ???). If we switch to 
using Solr we would have to  also use the Katta / Solr integration, and 
perhaps Solr/Hadoop integration as well. This is a complex mix of 
half-ready components that needs to be well thought-through ...


* Crawler Commons: during our Crawler MeetUp all representatives agreed 
that we should collect a few components that are nearly the same across 
all projects and collaborate on their development, and use them as an 
external dependency. The candidate components are:


 - robots.txt parsing
 - URL filtering and normalization
 - page signature (fingerprint) implementations
 - page template detection  removal (aka. main content extraction)
 - possibly others, like URL redirection tracking, PageRank 
calculation, crawler trap detection etc.


2. Make Nutch easier to use
---
This, as you may remember our earlier discussions, begs the question: 
who is the target audience of Nutch?


In my opinion, the main users of Nutch are vertical search engines, and 
this is the audience that we should cater to. There are many reasons for 
this:


- Nutch is too complex and too heavy for those that need to crawl up to 
a few thousand pages. Now that the Droids project exists it's probably 
not worth the effort to attempt a complete re-design of Nutch so that it 
fits the need of this group - Nutch is based on map-reduce, and it's not 
likely we would want to change that, so this means there will always be 
a significant overhead for small jobs. I'm not saying we should not make 
Nutch easier to use, but for small crawls Nutch is an overkill. Also, in 
many cases these users don't realize that they don't do any frontier 
discovery and expansion, and what they really need is Solr.


- at the other end of the spectrum, there are very very few companies 
that want to do a wide large web-scale crawling - this is costly, and 
requires a solid business plan and serious funding. These users are 
prepared anyway to spend significant effort on customizations and 
problem-solving, or they may want to use only some parts of Nutch. Often 
they are also not too eager to contribute back to the project - either 
because of their proprietary nature or because their customizations are 
not useful for general audience.


The remaining group is interested in medium-size, high quality crawling 
(focused, with good spam  junk controls). Which is either an enterprise 
search or a vertical search. 

Re: changing/addding field in existing index

2009-11-09 Thread Andrzej Bialecki

fa...@butterflycluster.net wrote:

hi all,

i have an existing index - we have a custom field that needs to be added
or changed in every currently indexed document ;

whats the best way to go about this without recreating the index again?


There are ways to do it directly on the index, but this is complicated 
and involves hacking the low-level Lucene format. Alternatively, you 
could build a parallel index with just these fields, but synchronized 
internal docId-s, open both indexes with ParallelReader, and then create 
a new index using IndexWriter.addIndexes().


I suggest recreating the index.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: changing/addding field in existing index

2009-11-09 Thread Fadzi Ushewokunze
that seems to work. thanks for that. it was a bit fiddly more than i
expected but got the index sorted.

found an issue with sorting as most fields cannot be sorted by; and
throwing a 

java.lang.RuntimeException: Unknown sort value type!
at
org.apache.nutch.searcher.IndexSearcher.translateHits(IndexSearcher.java:159)
at
org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:98)
at
org.apache.nutch.searcher.LuceneSearchBean.search(LuceneSearchBean.java:84)
at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:231)




On Mon, 2009-11-09 at 17:34 +0100, Andrzej Bialecki wrote:
 fa...@butterflycluster.net wrote:
  hi all,
  
  i have an existing index - we have a custom field that needs to be added
  or changed in every currently indexed document ;
  
  whats the best way to go about this without recreating the index again?
 
 There are ways to do it directly on the index, but this is complicated 
 and involves hacking the low-level Lucene format. Alternatively, you 
 could build a parallel index with just these fields, but synchronized 
 internal docId-s, open both indexes with ParallelReader, and then create 
 a new index using IndexWriter.addIndexes().
 
 I suggest recreating the index.