Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Marco Scalone
Nutch also has adaptive strategy:

This class implements an adaptive re-fetch algorithm. This works as
> follows:
>
>- for pages that has changed since the last fetchTime, decrease their
>fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>- for pages that haven't changed since the last fetchTime, increase
>their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>If SYNC_DELTA property is true, then:
>   - calculate a delta = fetchTime - modifiedTime
>   - try to synchronize with the time of change, by shifting the next
>   fetchTime by a fraction of the difference between the last modification
>   time and the last fetch time. I.e. the next fetch time will be set to 
> fetchTime
>   + fetchInterval - delta * SYNC_DELTA_RATE
>   - if the adjusted fetch interval is bigger than the delta, then 
> fetchInterval
>   = delta.
>- the minimum value of fetchInterval may not be smaller than
>MIN_INTERVAL (default is 1 minute).
>- the maximum value of fetchInterval may not be bigger than
>MAX_INTERVAL (default is 365 days).
>
> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
> the algorithm, so that the fetch interval either increases or decreases
> infinitely, with little relevance to the page changes. Please use
> main(String[])
> 
> method to test the values before applying them in a production system.
>

From:
https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html


2016-08-03 14:45 GMT-03:00 Walter Underwood :

> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
> in Ultraseek.
>
> I think we were the only people who built an adaptive crawler for
> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
> to Mike Lynch. He looked at me like I had three heads and didn’t even
> answer me.
>
> Ultraseek also has great support for sites that need login. If you use
> that, you’ll need to find a way to do that with another crawler.
>
> wunder
> Walter Underwood
> Former Ultraseek Principal Engineer
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
>  wrote:
> >
> > CLASSIFICATION: UNCLASSIFIED
> >
> > We are currently using ultraseek and looking to deprecate it in favor of
> solr/nutch.
> > Ultraseek runs all the time and auto detects when pages have changed and
> automatically reindexes them.
> > Is this possible with SOLR/nutch?
> >
> > Thanks,
> > Kris
> >
> > ~~
> > Kris T. Musshorn
> > FileMaker Developer - Contractor - Catapult Technology Inc.
> > US Army Research Lab
> > Aberdeen Proving Ground
> > Application Management & Development Branch
> > 410-278-7251
> > kris.t.musshorn@mail.mil
> > ~~
> >
> >
> >
> > CLASSIFICATION: UNCLASSIFIED
>
>


Marco Scalone está ausente de la oficina.

2012-09-07 Thread Marco Scalone

Estaré ausente de la oficina desde el Vie 07/09/2012  y no volveré hasta el
Jue 20/09/2012 .

Responderé a su mensaje cuando regrese.



Re: Problems with estrange data appended to body field [SOLVED]

2012-07-09 Thread Marco Scalone
Problem solved. The problem was in the drupal side...  drupal core 
search interfered  with apachesolr module. and added extra information.


After deleting the core search tables and setting to 0 the numbres to 
index on cron .. reindex site  an the problem was solved.


thanks.
Marco

El 09/07/12 11:59, Marco Scalone escribió:

Hello, ,
I'm new in this list and have been using solr for many months and 
I'm trying to install and use wide along in the organization. But 
doing some test I realise a problem in the result snippet it generates.


When I make a search the result snippet shows a stemmed version of 
the title repeated times. made the search from apache admin query page 
and realise that the body was full of this strings. You will 
understand with this example:


--- 


str name=titleContralor de la Edificación/str
str name=body
4110
Espacios Públicos Habitat y Edificaciones
SERVICIO
Otorgamiento de permisos de construcción, de locales industriales y 
comerciales. Recepción de denuncias de obras sin permiso. Coordinación 
con Bomberos sobre temas de seguridad edilicia.Visite nuestra página: 
www.montevideo.gub.uy/ciudadania/contralor-de-la-edificacion

Interna

(contralor edif, contralor edif, contralor edif, contralor edif, 
contralor edif, contralor edif, contralor edif, contralor edif, 
contralor edif, contralor edificacion, contralor edif, contralor 
edificacion, contralor edif, contralor edif, contralor edif, contralor 
edif, contralor edif, contralor edif, contralor edif, contralor edif, 
contralor edif, contralor edif, contralor edif, contralor edif, 
contralor edif, contralor edif, contralor edif, contralor edif, 
contralor edificacion, contralor edif, contralor edificacion, 
contralor edif, contralor edificacion, contralor edificacion, 
contralor edif, contralor edif, contralor edif, contralor edif, 
contralor edif, contralor edif, contralor edif, contralor edif, 
contralor edif, contralor edif, contralor edif, contralor edif, 
contralor edif, contralor edif, contralor edif, contralor edif, 
contralor edif, contralor edif, contralor edif, contralor edif, 
contralor edif)

/str
- 



As you can see the last part of the body has de steemed version of the 
title meny times, the same happens with the spell (because its a 
copy), the problems is that this is visible in the results and 
highlighted.


Need help because dont know why this is happening seems to be somthing 
automatic in solr, maybe some configuration.

Using Drupal 6 with standard modules, I'm attaching config files

Thanks a lot
Marco



--
Ing. Marco Scalone
División Tecnología de la Información
Intendencia de Montevideo
Tel.: 1950 int. 4426