Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Walter Underwood
Ah, the difference between open source and a product. With Ultraseek, we chose 
a solid, stable algorithm that worked well for 3000 customers. In open source, 
it is a research project for every single customer.

I love open source. I’ve brought Solr into Netflix and Chegg. But there is a 
clear difference between developer-driven and customer-driven software.

I first learned about bounded binary exponential backoff in the 
Digital/Intel/Xerox (“DIX”) Ethernet spec in 1980. It is a solid algorithm for 
events with a Poisson distribution, like packet arrival times or web page next 
change times. There is no need for configuring algorithms here, especially 
configurations that lead to an unstable estimate. The only meaningful choices 
are the minimum revisit time, the maximum revisit time, and the number of bins. 
Those will be different for CNN (a launch customer for Ultraseek) or Sun 
documentation (another launch customer). CNN news articles change minute by 
minute, new Sun documentation appeared weekly or monthly.

Sorry for the rant, but “you can fix the algorithm yourself” almost always 
means a bad installation, an unhappy admin, and another black eye for open 
source.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 4:07 PM, Markus Jelsma  wrote:
> 
> Depending on your settings, Nutch does this as well. It is even possible to 
> set up different inc/decremental values per mime-type. 
> The algorithms are pluggable and overridable at any point of interest. You 
> can go all the way.  
> 
> -Original message-
>> From:Walter Underwood 
>> Sent: Wednesday 3rd August 2016 20:03
>> To: solr-user@lucene.apache.org
>> Subject: Re: SOLR + Nutch set up (UNCLASSIFIED)
>> 
>> That’s good news.
>> 
>> It should reset the interval estimate on page change instead of slowly 
>> shortening it.
>> 
>> I’m pretty sure that Ultraseek used a bounded exponential backoff when the 
>> page had not changed.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 3, 2016, at 10:51 AM, Marco Scalone  wrote:
>>> 
>>> Nutch also has adaptive strategy:
>>> 
>>> This class implements an adaptive re-fetch algorithm. This works as
>>>> follows:
>>>> 
>>>>  - for pages that has changed since the last fetchTime, decrease their
>>>>  fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>>>>  - for pages that haven't changed since the last fetchTime, increase
>>>>  their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>>>>  If SYNC_DELTA property is true, then:
>>>> - calculate a delta = fetchTime - modifiedTime
>>>> - try to synchronize with the time of change, by shifting the next
>>>> fetchTime by a fraction of the difference between the last modification
>>>> time and the last fetch time. I.e. the next fetch time will be set to 
>>>> fetchTime
>>>> + fetchInterval - delta * SYNC_DELTA_RATE
>>>> - if the adjusted fetch interval is bigger than the delta, then 
>>>> fetchInterval
>>>> = delta.
>>>>  - the minimum value of fetchInterval may not be smaller than
>>>>  MIN_INTERVAL (default is 1 minute).
>>>>  - the maximum value of fetchInterval may not be bigger than
>>>>  MAX_INTERVAL (default is 365 days).
>>>> 
>>>> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
>>>> the algorithm, so that the fetch interval either increases or decreases
>>>> infinitely, with little relevance to the page changes. Please use
>>>> main(String[])
>>>> <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
>>>> method to test the values before applying them in a production system.
>>>> 
>>> 
>>> From:
>>> https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
>>> 
>>> 
>>> 2016-08-03 14:45 GMT-03:00 Walter Underwood :
>>> 
>>>> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
>>>> in Ultraseek.
>>>> 
>>>> I think we were the only people who built an adaptive crawler for
>>>> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
>>>> to Mike Lynch. He looked at me like I had three heads and didn’t even
>>>> ans

RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Markus Jelsma
No, just run it continously, always! By default everything is refetched (if 
possible) every 30 days. Just read the descriptions for adaptive schedule and 
its javadoc. It is simple to use, but sometimes hard to predict its outcome, 
just because you never know what changes, at whatever time.

You will be fine with defaults if you have a small site. Just set the interval 
to a few days, or more if your site is slightly larger.

M.

 
 
-Original message-
> From:Musshorn, Kris T CTR USARMY RDECOM ARL (US) 
> 
> Sent: Wednesday 3rd August 2016 20:08
> To: solr-user@lucene.apache.org
> Subject: RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)
> 
> CLASSIFICATION: UNCLASSIFIED
> 
> Shall I assume that, even though nutch has adaptive capability, I would still 
> have to figure out how to trigger it to go look for content that needs update?
> 
> Thanks,
> Kris
> 
> ~~
> Kris T. Musshorn
> FileMaker Developer - Contractor – Catapult Technology Inc.  
> US Army Research Lab 
> Aberdeen Proving Ground 
> Application Management & Development Branch 
> 410-278-7251
> kris.t.musshorn@mail.mil
> ~~
> 
> 
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org] 
> Sent: Wednesday, August 03, 2016 2:03 PM
> To: solr-user@lucene.apache.org
> Subject: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)
> 
> All active links contained in this email were disabled.  Please verify the 
> identity of the sender, and confirm the authenticity of all links contained 
> within the message prior to copying and pasting the address to a Web browser. 
>  
> 
> 
> 
> 
> 
> 
> That’s good news.
> 
> It should reset the interval estimate on page change instead of slowly 
> shortening it.
> 
> I’m pretty sure that Ultraseek used a bounded exponential backoff when the 
> page had not changed.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> Caution-http://observer.wunderwood.org/  (my blog)
> 
> 
> > On Aug 3, 2016, at 10:51 AM, Marco Scalone  wrote:
> > 
> > Nutch also has adaptive strategy:
> > 
> > This class implements an adaptive re-fetch algorithm. This works as
> >> follows:
> >> 
> >>   - for pages that has changed since the last fetchTime, decrease their
> >>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
> >>   - for pages that haven't changed since the last fetchTime, increase
> >>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
> >>   If SYNC_DELTA property is true, then:
> >>  - calculate a delta = fetchTime - modifiedTime
> >>  - try to synchronize with the time of change, by shifting the next
> >>  fetchTime by a fraction of the difference between the last 
> >> modification
> >>  time and the last fetch time. I.e. the next fetch time will be set to 
> >> fetchTime
> >>  + fetchInterval - delta * SYNC_DELTA_RATE
> >>  - if the adjusted fetch interval is bigger than the delta, then 
> >> fetchInterval
> >>  = delta.
> >>   - the minimum value of fetchInterval may not be smaller than
> >>   MIN_INTERVAL (default is 1 minute).
> >>   - the maximum value of fetchInterval may not be bigger than
> >>   MAX_INTERVAL (default is 365 days).
> >> 
> >> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may 
> >> destabilize the algorithm, so that the fetch interval either 
> >> increases or decreases infinitely, with little relevance to the page 
> >> changes. Please use
> >> main(String[])
> >> https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutc
> >> h/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
> >> method to test the values before applying them in a production system.
> >> 
> > 
> > From:
> > Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/
> > crawl/AdaptiveFetchSchedule.html
> > 
> > 
> > 2016-08-03 14:45 GMT-03:00 Walter Underwood :
> > 
> >> I’m pretty sure Nutch uses a batch crawler instead of the adaptive 
> >> crawler in Ultraseek.
> >> 
> >> I think we were the only people who built an adaptive crawler for 
> >> enterprise use. I tried to get Ultraseek open-sourced. I made the 
> >> argument to Mike Lynch. He looked at me like I had three heads and 
> >> didn’t even answer me.
> >> 
> >> Ultraseek also has great support for sites that need login. If you 
> >> use th

RE: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Markus Jelsma
Depending on your settings, Nutch does this as well. It is even possible to set 
up different inc/decremental values per mime-type. 
The algorithms are pluggable and overridable at any point of interest. You can 
go all the way.  
 
-Original message-
> From:Walter Underwood 
> Sent: Wednesday 3rd August 2016 20:03
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR + Nutch set up (UNCLASSIFIED)
> 
> That’s good news.
> 
> It should reset the interval estimate on page change instead of slowly 
> shortening it.
> 
> I’m pretty sure that Ultraseek used a bounded exponential backoff when the 
> page had not changed.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
> > On Aug 3, 2016, at 10:51 AM, Marco Scalone  wrote:
> > 
> > Nutch also has adaptive strategy:
> > 
> > This class implements an adaptive re-fetch algorithm. This works as
> >> follows:
> >> 
> >>   - for pages that has changed since the last fetchTime, decrease their
> >>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
> >>   - for pages that haven't changed since the last fetchTime, increase
> >>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
> >>   If SYNC_DELTA property is true, then:
> >>  - calculate a delta = fetchTime - modifiedTime
> >>  - try to synchronize with the time of change, by shifting the next
> >>  fetchTime by a fraction of the difference between the last 
> >> modification
> >>  time and the last fetch time. I.e. the next fetch time will be set to 
> >> fetchTime
> >>  + fetchInterval - delta * SYNC_DELTA_RATE
> >>  - if the adjusted fetch interval is bigger than the delta, then 
> >> fetchInterval
> >>  = delta.
> >>   - the minimum value of fetchInterval may not be smaller than
> >>   MIN_INTERVAL (default is 1 minute).
> >>   - the maximum value of fetchInterval may not be bigger than
> >>   MAX_INTERVAL (default is 365 days).
> >> 
> >> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
> >> the algorithm, so that the fetch interval either increases or decreases
> >> infinitely, with little relevance to the page changes. Please use
> >> main(String[])
> >> <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
> >> method to test the values before applying them in a production system.
> >> 
> > 
> > From:
> > https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
> > 
> > 
> > 2016-08-03 14:45 GMT-03:00 Walter Underwood :
> > 
> >> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
> >> in Ultraseek.
> >> 
> >> I think we were the only people who built an adaptive crawler for
> >> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
> >> to Mike Lynch. He looked at me like I had three heads and didn’t even
> >> answer me.
> >> 
> >> Ultraseek also has great support for sites that need login. If you use
> >> that, you’ll need to find a way to do that with another crawler.
> >> 
> >> wunder
> >> Walter Underwood
> >> Former Ultraseek Principal Engineer
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >> 
> >> 
> >>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
> >>  wrote:
> >>> 
> >>> CLASSIFICATION: UNCLASSIFIED
> >>> 
> >>> We are currently using ultraseek and looking to deprecate it in favor of
> >> solr/nutch.
> >>> Ultraseek runs all the time and auto detects when pages have changed and
> >> automatically reindexes them.
> >>> Is this possible with SOLR/nutch?
> >>> 
> >>> Thanks,
> >>> Kris
> >>> 
> >>> ~~
> >>> Kris T. Musshorn
> >>> FileMaker Developer - Contractor - Catapult Technology Inc.
> >>> US Army Research Lab
> >>> Aberdeen Proving Ground
> >>> Application Management & Development Branch
> >>> 410-278-7251
> >>> kris.t.musshorn@mail.mil
> >>> ~~
> >>> 
> >>> 
> >>> 
> >>> CLASSIFICATION: UNCLASSIFIED
> >> 
> >> 
> 
> 


RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Musshorn, Kris T CTR USARMY RDECOM ARL (US)
CLASSIFICATION: UNCLASSIFIED

Shall I assume that, even though nutch has adaptive capability, I would still 
have to figure out how to trigger it to go look for content that needs update?

Thanks,
Kris

~~
Kris T. Musshorn
FileMaker Developer - Contractor – Catapult Technology Inc.  
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn@mail.mil
~~


-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Wednesday, August 03, 2016 2:03 PM
To: solr-user@lucene.apache.org
Subject: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)

All active links contained in this email were disabled.  Please verify the 
identity of the sender, and confirm the authenticity of all links contained 
within the message prior to copying and pasting the address to a Web browser.  






That’s good news.

It should reset the interval estimate on page change instead of slowly 
shortening it.

I’m pretty sure that Ultraseek used a bounded exponential backoff when the page 
had not changed.

wunder
Walter Underwood
wun...@wunderwood.org
Caution-http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 10:51 AM, Marco Scalone  wrote:
> 
> Nutch also has adaptive strategy:
> 
> This class implements an adaptive re-fetch algorithm. This works as
>> follows:
>> 
>>   - for pages that has changed since the last fetchTime, decrease their
>>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>>   - for pages that haven't changed since the last fetchTime, increase
>>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>>   If SYNC_DELTA property is true, then:
>>  - calculate a delta = fetchTime - modifiedTime
>>  - try to synchronize with the time of change, by shifting the next
>>  fetchTime by a fraction of the difference between the last modification
>>  time and the last fetch time. I.e. the next fetch time will be set to 
>> fetchTime
>>  + fetchInterval - delta * SYNC_DELTA_RATE
>>  - if the adjusted fetch interval is bigger than the delta, then 
>> fetchInterval
>>  = delta.
>>   - the minimum value of fetchInterval may not be smaller than
>>   MIN_INTERVAL (default is 1 minute).
>>   - the maximum value of fetchInterval may not be bigger than
>>   MAX_INTERVAL (default is 365 days).
>> 
>> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may 
>> destabilize the algorithm, so that the fetch interval either 
>> increases or decreases infinitely, with little relevance to the page 
>> changes. Please use
>> main(String[])
>> https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutc
>> h/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
>> method to test the values before applying them in a production system.
>> 
> 
> From:
> Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/
> crawl/AdaptiveFetchSchedule.html
> 
> 
> 2016-08-03 14:45 GMT-03:00 Walter Underwood :
> 
>> I’m pretty sure Nutch uses a batch crawler instead of the adaptive 
>> crawler in Ultraseek.
>> 
>> I think we were the only people who built an adaptive crawler for 
>> enterprise use. I tried to get Ultraseek open-sourced. I made the 
>> argument to Mike Lynch. He looked at me like I had three heads and 
>> didn’t even answer me.
>> 
>> Ultraseek also has great support for sites that need login. If you 
>> use that, you’ll need to find a way to do that with another crawler.
>> 
>> wunder
>> Walter Underwood
>> Former Ultraseek Principal Engineer
>> wun...@wunderwood.org
>> Caution-http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL 
>>> (US)
>>  wrote:
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>>> 
>>> We are currently using ultraseek and looking to deprecate it in 
>>> favor of
>> solr/nutch.
>>> Ultraseek runs all the time and auto detects when pages have changed 
>>> and
>> automatically reindexes them.
>>> Is this possible with SOLR/nutch?
>>> 
>>> Thanks,
>>> Kris
>>> 
>>> ~~
>>> Kris T. Musshorn
>>> FileMaker Developer - Contractor - Catapult Technology Inc.
>>> US Army Research Lab
>>> Aberdeen Proving Ground
>>> Application Management & Development Branch
>>> 410-278-7251
>>> kris.t.musshorn@mail.mil
>>> ~~
>>> 
>>> 
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>> 
>> 


CLASSIFICATION: UNCLASSIFIED


Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Walter Underwood
That’s good news.

It should reset the interval estimate on page change instead of slowly 
shortening it.

I’m pretty sure that Ultraseek used a bounded exponential backoff when the page 
had not changed.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 10:51 AM, Marco Scalone  wrote:
> 
> Nutch also has adaptive strategy:
> 
> This class implements an adaptive re-fetch algorithm. This works as
>> follows:
>> 
>>   - for pages that has changed since the last fetchTime, decrease their
>>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>>   - for pages that haven't changed since the last fetchTime, increase
>>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>>   If SYNC_DELTA property is true, then:
>>  - calculate a delta = fetchTime - modifiedTime
>>  - try to synchronize with the time of change, by shifting the next
>>  fetchTime by a fraction of the difference between the last modification
>>  time and the last fetch time. I.e. the next fetch time will be set to 
>> fetchTime
>>  + fetchInterval - delta * SYNC_DELTA_RATE
>>  - if the adjusted fetch interval is bigger than the delta, then 
>> fetchInterval
>>  = delta.
>>   - the minimum value of fetchInterval may not be smaller than
>>   MIN_INTERVAL (default is 1 minute).
>>   - the maximum value of fetchInterval may not be bigger than
>>   MAX_INTERVAL (default is 365 days).
>> 
>> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
>> the algorithm, so that the fetch interval either increases or decreases
>> infinitely, with little relevance to the page changes. Please use
>> main(String[])
>> <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
>> method to test the values before applying them in a production system.
>> 
> 
> From:
> https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
> 
> 
> 2016-08-03 14:45 GMT-03:00 Walter Underwood :
> 
>> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
>> in Ultraseek.
>> 
>> I think we were the only people who built an adaptive crawler for
>> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
>> to Mike Lynch. He looked at me like I had three heads and didn’t even
>> answer me.
>> 
>> Ultraseek also has great support for sites that need login. If you use
>> that, you’ll need to find a way to do that with another crawler.
>> 
>> wunder
>> Walter Underwood
>> Former Ultraseek Principal Engineer
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
>>  wrote:
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>>> 
>>> We are currently using ultraseek and looking to deprecate it in favor of
>> solr/nutch.
>>> Ultraseek runs all the time and auto detects when pages have changed and
>> automatically reindexes them.
>>> Is this possible with SOLR/nutch?
>>> 
>>> Thanks,
>>> Kris
>>> 
>>> ~~
>>> Kris T. Musshorn
>>> FileMaker Developer - Contractor - Catapult Technology Inc.
>>> US Army Research Lab
>>> Aberdeen Proving Ground
>>> Application Management & Development Branch
>>> 410-278-7251
>>> kris.t.musshorn@mail.mil
>>> ~~
>>> 
>>> 
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>> 
>> 



Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Marco Scalone
Nutch also has adaptive strategy:

This class implements an adaptive re-fetch algorithm. This works as
> follows:
>
>- for pages that has changed since the last fetchTime, decrease their
>fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>- for pages that haven't changed since the last fetchTime, increase
>their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>If SYNC_DELTA property is true, then:
>   - calculate a delta = fetchTime - modifiedTime
>   - try to synchronize with the time of change, by shifting the next
>   fetchTime by a fraction of the difference between the last modification
>   time and the last fetch time. I.e. the next fetch time will be set to 
> fetchTime
>   + fetchInterval - delta * SYNC_DELTA_RATE
>   - if the adjusted fetch interval is bigger than the delta, then 
> fetchInterval
>   = delta.
>- the minimum value of fetchInterval may not be smaller than
>MIN_INTERVAL (default is 1 minute).
>- the maximum value of fetchInterval may not be bigger than
>MAX_INTERVAL (default is 365 days).
>
> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
> the algorithm, so that the fetch interval either increases or decreases
> infinitely, with little relevance to the page changes. Please use
> main(String[])
> <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
> method to test the values before applying them in a production system.
>

From:
https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html


2016-08-03 14:45 GMT-03:00 Walter Underwood :

> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
> in Ultraseek.
>
> I think we were the only people who built an adaptive crawler for
> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
> to Mike Lynch. He looked at me like I had three heads and didn’t even
> answer me.
>
> Ultraseek also has great support for sites that need login. If you use
> that, you’ll need to find a way to do that with another crawler.
>
> wunder
> Walter Underwood
> Former Ultraseek Principal Engineer
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
>  wrote:
> >
> > CLASSIFICATION: UNCLASSIFIED
> >
> > We are currently using ultraseek and looking to deprecate it in favor of
> solr/nutch.
> > Ultraseek runs all the time and auto detects when pages have changed and
> automatically reindexes them.
> > Is this possible with SOLR/nutch?
> >
> > Thanks,
> > Kris
> >
> > ~~
> > Kris T. Musshorn
> > FileMaker Developer - Contractor - Catapult Technology Inc.
> > US Army Research Lab
> > Aberdeen Proving Ground
> > Application Management & Development Branch
> > 410-278-7251
> > kris.t.musshorn@mail.mil
> > ~~
> >
> >
> >
> > CLASSIFICATION: UNCLASSIFIED
>
>


Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Walter Underwood
I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler in 
Ultraseek.

I think we were the only people who built an adaptive crawler for enterprise 
use. I tried to get Ultraseek open-sourced. I made the argument to Mike Lynch. 
He looked at me like I had three heads and didn’t even answer me.

Ultraseek also has great support for sites that need login. If you use that, 
you’ll need to find a way to do that with another crawler.

wunder
Walter Underwood
Former Ultraseek Principal Engineer
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) 
>  wrote:
> 
> CLASSIFICATION: UNCLASSIFIED
> 
> We are currently using ultraseek and looking to deprecate it in favor of 
> solr/nutch.
> Ultraseek runs all the time and auto detects when pages have changed and 
> automatically reindexes them.
> Is this possible with SOLR/nutch?
> 
> Thanks,
> Kris
> 
> ~~
> Kris T. Musshorn
> FileMaker Developer - Contractor - Catapult Technology Inc.  
> US Army Research Lab 
> Aberdeen Proving Ground 
> Application Management & Development Branch 
> 410-278-7251
> kris.t.musshorn@mail.mil
> ~~
> 
> 
> 
> CLASSIFICATION: UNCLASSIFIED



SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Musshorn, Kris T CTR USARMY RDECOM ARL (US)
CLASSIFICATION: UNCLASSIFIED

We are currently using ultraseek and looking to deprecate it in favor of 
solr/nutch.
Ultraseek runs all the time and auto detects when pages have changed and 
automatically reindexes them.
Is this possible with SOLR/nutch?

Thanks,
Kris

~~
Kris T. Musshorn
FileMaker Developer - Contractor - Catapult Technology Inc.  
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn@mail.mil
~~



CLASSIFICATION: UNCLASSIFIED

Re: Solr & Nutch

2014-01-28 Thread Koji Sekiguchi

1. Nutch follows the links within HTML web pages to crawl the full graph of a 
web of pages.


In addition, I think Nutch has PageRank-like scoring function as opposed to
Lucene/Solr, those are based on vector space model scoring.

koji
--
http://soleami.com/blog/mahout-and-machine-learning-training-course-is-here.html


Re: Solr & Nutch

2014-01-28 Thread rashmi maheshwari
Thanks Markus and Alexei.


On Wed, Jan 29, 2014 at 12:08 AM, Alexei Martchenko <
ale...@martchenko.com.br> wrote:

> Well, not even Google parse those. I'm not sure about Nutch but in some
> crawlers (jSoup i believe) there's an option to try to get full URLs from
> plain text, so you can capture some urls in the form of someClickFunction('
> http://www.someurl.com/whatever') or even if they are in the middle of
> some
> paragraph. Sometimes it works beautifully, sometimes it misleads you to
> parse urls shortened with ellipsis in the middle.
>
>
>
> alexei martchenko
> Facebook <http://www.facebook.com/alexeiramone> |
> Linkedin<http://br.linkedin.com/in/alexeimartchenko>|
> Steam <http://steamcommunity.com/id/alexeiramone/> |
> 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone |
> Github <https://github.com/alexeiramone> | (11) 9 7613.0966 |
>
>
> 2014-01-28 rashmi maheshwari 
>
> > Thanks All for quick response.
> >
> > Today I crawled a webpage using nutch. This page have many links. But all
> > anchor tags have "href=#" and javascript is written on onClick event of
> > each anchor tag to open a new page.
> >
> > So crawler didnt crawl any of those links which were opening using
> onClick
> > event and has # href value.
> >
> > How these links are crawled using nutch?
> >
> >
> >
> >
> > On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko <
> > ale...@martchenko.com.br> wrote:
> >
> > > 1) Plus, those files are binaries sometimes with metadata, specific
> > > crawlers need to understand them. html is a plain text
> > >
> > > 2) Yes, different data schemes. Sometimes I replicate the same core and
> > > make some A-B tests with different weights, filters etc etc and some
> > people
> > > like to creare CoreA and CoreB with the same schema and hammer CoreA
> with
> > > updates and commits and optmizes, they make it available for searches
> > while
> > > hammering CoreB. Then swap again. This produces faster searches.
> > >
> > >
> > > alexei martchenko
> > > Facebook <http://www.facebook.com/alexeiramone> |
> > > Linkedin<http://br.linkedin.com/in/alexeimartchenko>|
> > > Steam <http://steamcommunity.com/id/alexeiramone/> |
> > > 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone |
> > > Github <https://github.com/alexeiramone> | (11) 9 7613.0966 |
> > >
> > >
> > > 2014-01-28 Jack Krupansky 
> > >
> > > > 1. Nutch follows the links within HTML web pages to crawl the full
> > graph
> > > > of a web of pages.
> > > >
> > > > 2. Think of a core as an SQL table - each table/core has a different
> > type
> > > > of data.
> > > >
> > > > 3. SolrCloud is all about scaling and availability - multiple shards
> > for
> > > > larger collections and multiple replicas for both scaling of query
> > > response
> > > > and availability if nodes go down.
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > -Original Message- From: rashmi maheshwari
> > > > Sent: Tuesday, January 28, 2014 11:36 AM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Solr & Nutch
> > > >
> > > >
> > > > Hi,
> > > >
> > > > Question1 --> When Solr could parse html, documents like doc, excel
> pdf
> > > > etc, why do we need nutch to parse html files? what is different?
> > > >
> > > > Questions 2: When do we use multiple core in solar? any practical
> > > business
> > > > case when we need multiple cores?
> > > >
> > > > Question 3: When do we go for cloud? What is meaning of implementing
> > solr
> > > > cloud?
> > > >
> > > >
> > > > --
> > > > Rashmi
> > > > Be the change that you want to see in this world!
> > > > www.minnal.zor.org
> > > > disha.resolve.at
> > > > www.artofliving.org
> > > >
> > >
> >
> >
> >
> > --
> > Rashmi
> > Be the change that you want to see in this world!
> > www.minnal.zor.org
> > disha.resolve.at
> > www.artofliving.org
> >
>



-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org


Re: Solr & Nutch

2014-01-28 Thread Alexei Martchenko
Well, not even Google parse those. I'm not sure about Nutch but in some
crawlers (jSoup i believe) there's an option to try to get full URLs from
plain text, so you can capture some urls in the form of someClickFunction('
http://www.someurl.com/whatever') or even if they are in the middle of some
paragraph. Sometimes it works beautifully, sometimes it misleads you to
parse urls shortened with ellipsis in the middle.



alexei martchenko
Facebook <http://www.facebook.com/alexeiramone> |
Linkedin<http://br.linkedin.com/in/alexeimartchenko>|
Steam <http://steamcommunity.com/id/alexeiramone/> |
4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone |
Github <https://github.com/alexeiramone> | (11) 9 7613.0966 |


2014-01-28 rashmi maheshwari 

> Thanks All for quick response.
>
> Today I crawled a webpage using nutch. This page have many links. But all
> anchor tags have "href=#" and javascript is written on onClick event of
> each anchor tag to open a new page.
>
> So crawler didnt crawl any of those links which were opening using onClick
> event and has # href value.
>
> How these links are crawled using nutch?
>
>
>
>
> On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko <
> ale...@martchenko.com.br> wrote:
>
> > 1) Plus, those files are binaries sometimes with metadata, specific
> > crawlers need to understand them. html is a plain text
> >
> > 2) Yes, different data schemes. Sometimes I replicate the same core and
> > make some A-B tests with different weights, filters etc etc and some
> people
> > like to creare CoreA and CoreB with the same schema and hammer CoreA with
> > updates and commits and optmizes, they make it available for searches
> while
> > hammering CoreB. Then swap again. This produces faster searches.
> >
> >
> > alexei martchenko
> > Facebook <http://www.facebook.com/alexeiramone> |
> > Linkedin<http://br.linkedin.com/in/alexeimartchenko>|
> > Steam <http://steamcommunity.com/id/alexeiramone/> |
> > 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone |
> > Github <https://github.com/alexeiramone> | (11) 9 7613.0966 |
> >
> >
> > 2014-01-28 Jack Krupansky 
> >
> > > 1. Nutch follows the links within HTML web pages to crawl the full
> graph
> > > of a web of pages.
> > >
> > > 2. Think of a core as an SQL table - each table/core has a different
> type
> > > of data.
> > >
> > > 3. SolrCloud is all about scaling and availability - multiple shards
> for
> > > larger collections and multiple replicas for both scaling of query
> > response
> > > and availability if nodes go down.
> > >
> > > -- Jack Krupansky
> > >
> > > -Original Message- From: rashmi maheshwari
> > > Sent: Tuesday, January 28, 2014 11:36 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Solr & Nutch
> > >
> > >
> > > Hi,
> > >
> > > Question1 --> When Solr could parse html, documents like doc, excel pdf
> > > etc, why do we need nutch to parse html files? what is different?
> > >
> > > Questions 2: When do we use multiple core in solar? any practical
> > business
> > > case when we need multiple cores?
> > >
> > > Question 3: When do we go for cloud? What is meaning of implementing
> solr
> > > cloud?
> > >
> > >
> > > --
> > > Rashmi
> > > Be the change that you want to see in this world!
> > > www.minnal.zor.org
> > > disha.resolve.at
> > > www.artofliving.org
> > >
> >
>
>
>
> --
> Rashmi
> Be the change that you want to see in this world!
> www.minnal.zor.org
> disha.resolve.at
> www.artofliving.org
>


Re: Solr & Nutch

2014-01-28 Thread Markus Jelsma
Short answer, you can't.rashmi maheshwari  
schreef:Thanks All for quick response.

Today I crawled a webpage using nutch. This page have many links. But all
anchor tags have "href=#" and javascript is written on onClick event of
each anchor tag to open a new page.

So crawler didnt crawl any of those links which were opening using onClick
event and has # href value.

How these links are crawled using nutch?




On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko <
ale...@martchenko.com.br> wrote:

> 1) Plus, those files are binaries sometimes with metadata, specific
> crawlers need to understand them. html is a plain text
>
> 2) Yes, different data schemes. Sometimes I replicate the same core and
> make some A-B tests with different weights, filters etc etc and some people
> like to creare CoreA and CoreB with the same schema and hammer CoreA with
> updates and commits and optmizes, they make it available for searches while
> hammering CoreB. Then swap again. This produces faster searches.
>
>
> alexei martchenko
> Facebook <http://www.facebook.com/alexeiramone> |
> Linkedin<http://br.linkedin.com/in/alexeimartchenko>|
> Steam <http://steamcommunity.com/id/alexeiramone/> |
> 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone |
> Github <https://github.com/alexeiramone> | (11) 9 7613.0966 |
>
>
> 2014-01-28 Jack Krupansky 
>
> > 1. Nutch follows the links within HTML web pages to crawl the full graph
> > of a web of pages.
> >
> > 2. Think of a core as an SQL table - each table/core has a different type
> > of data.
> >
> > 3. SolrCloud is all about scaling and availability - multiple shards for
> > larger collections and multiple replicas for both scaling of query
> response
> > and availability if nodes go down.
> >
> > -- Jack Krupansky
> >
> > -Original Message- From: rashmi maheshwari
> > Sent: Tuesday, January 28, 2014 11:36 AM
> > To: solr-user@lucene.apache.org
> > Subject: Solr & Nutch
> >
> >
> > Hi,
> >
> > Question1 --> When Solr could parse html, documents like doc, excel pdf
> > etc, why do we need nutch to parse html files? what is different?
> >
> > Questions 2: When do we use multiple core in solar? any practical
> business
> > case when we need multiple cores?
> >
> > Question 3: When do we go for cloud? What is meaning of implementing solr
> > cloud?
> >
> >
> > --
> > Rashmi
> > Be the change that you want to see in this world!
> > www.minnal.zor.org
> > disha.resolve.at
> > www.artofliving.org
> >
>



-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org


Re: Solr & Nutch

2014-01-28 Thread rashmi maheshwari
Thanks All for quick response.

Today I crawled a webpage using nutch. This page have many links. But all
anchor tags have "href=#" and javascript is written on onClick event of
each anchor tag to open a new page.

So crawler didnt crawl any of those links which were opening using onClick
event and has # href value.

How these links are crawled using nutch?




On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko <
ale...@martchenko.com.br> wrote:

> 1) Plus, those files are binaries sometimes with metadata, specific
> crawlers need to understand them. html is a plain text
>
> 2) Yes, different data schemes. Sometimes I replicate the same core and
> make some A-B tests with different weights, filters etc etc and some people
> like to creare CoreA and CoreB with the same schema and hammer CoreA with
> updates and commits and optmizes, they make it available for searches while
> hammering CoreB. Then swap again. This produces faster searches.
>
>
> alexei martchenko
> Facebook <http://www.facebook.com/alexeiramone> |
> Linkedin<http://br.linkedin.com/in/alexeimartchenko>|
> Steam <http://steamcommunity.com/id/alexeiramone/> |
> 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone |
> Github <https://github.com/alexeiramone> | (11) 9 7613.0966 |
>
>
> 2014-01-28 Jack Krupansky 
>
> > 1. Nutch follows the links within HTML web pages to crawl the full graph
> > of a web of pages.
> >
> > 2. Think of a core as an SQL table - each table/core has a different type
> > of data.
> >
> > 3. SolrCloud is all about scaling and availability - multiple shards for
> > larger collections and multiple replicas for both scaling of query
> response
> > and availability if nodes go down.
> >
> > -- Jack Krupansky
> >
> > -Original Message- From: rashmi maheshwari
> > Sent: Tuesday, January 28, 2014 11:36 AM
> > To: solr-user@lucene.apache.org
> > Subject: Solr & Nutch
> >
> >
> > Hi,
> >
> > Question1 --> When Solr could parse html, documents like doc, excel pdf
> > etc, why do we need nutch to parse html files? what is different?
> >
> > Questions 2: When do we use multiple core in solar? any practical
> business
> > case when we need multiple cores?
> >
> > Question 3: When do we go for cloud? What is meaning of implementing solr
> > cloud?
> >
> >
> > --
> > Rashmi
> > Be the change that you want to see in this world!
> > www.minnal.zor.org
> > disha.resolve.at
> > www.artofliving.org
> >
>



-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org


Re: Solr & Nutch

2014-01-28 Thread Alexei Martchenko
1) Plus, those files are binaries sometimes with metadata, specific
crawlers need to understand them. html is a plain text

2) Yes, different data schemes. Sometimes I replicate the same core and
make some A-B tests with different weights, filters etc etc and some people
like to creare CoreA and CoreB with the same schema and hammer CoreA with
updates and commits and optmizes, they make it available for searches while
hammering CoreB. Then swap again. This produces faster searches.


alexei martchenko
Facebook <http://www.facebook.com/alexeiramone> |
Linkedin<http://br.linkedin.com/in/alexeimartchenko>|
Steam <http://steamcommunity.com/id/alexeiramone/> |
4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone |
Github <https://github.com/alexeiramone> | (11) 9 7613.0966 |


2014-01-28 Jack Krupansky 

> 1. Nutch follows the links within HTML web pages to crawl the full graph
> of a web of pages.
>
> 2. Think of a core as an SQL table - each table/core has a different type
> of data.
>
> 3. SolrCloud is all about scaling and availability - multiple shards for
> larger collections and multiple replicas for both scaling of query response
> and availability if nodes go down.
>
> -- Jack Krupansky
>
> -Original Message- From: rashmi maheshwari
> Sent: Tuesday, January 28, 2014 11:36 AM
> To: solr-user@lucene.apache.org
> Subject: Solr & Nutch
>
>
> Hi,
>
> Question1 --> When Solr could parse html, documents like doc, excel pdf
> etc, why do we need nutch to parse html files? what is different?
>
> Questions 2: When do we use multiple core in solar? any practical business
> case when we need multiple cores?
>
> Question 3: When do we go for cloud? What is meaning of implementing solr
> cloud?
>
>
> --
> Rashmi
> Be the change that you want to see in this world!
> www.minnal.zor.org
> disha.resolve.at
> www.artofliving.org
>


Re: Solr & Nutch

2014-01-28 Thread Jorge Luis Betancourt Gonzalez
Q1: Nutch doesn’t only handle the parse of HTML files, it also use hadoop to 
achieve large-scale crawling using multiple nodes, it fetch the content of the 
HTML file, and yes it also parse its content.

Q2: In our case we use sold to crawl some website, store the content in one 
“main” solr core. We also have a web app with the typical “search box” we use a 
separated core to store the queries made by our users.

Q3: Not currently using SolrCloud so I’m going to let this one pass to a more 
experienced fellow.

On Jan 28, 2014, at 11:36 AM, rashmi maheshwari  
wrote:

> Hi,
> 
> Question1 --> When Solr could parse html, documents like doc, excel pdf
> etc, why do we need nutch to parse html files? what is different?
> 
> Questions 2: When do we use multiple core in solar? any practical business
> case when we need multiple cores?
> 
> Question 3: When do we go for cloud? What is meaning of implementing solr
> cloud?
> 
> 
> -- 
> Rashmi
> Be the change that you want to see in this world!
> www.minnal.zor.org
> disha.resolve.at
> www.artofliving.org


III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 
2014. Ver www.uci.cu


Re: Solr & Nutch

2014-01-28 Thread Jack Krupansky
1. Nutch follows the links within HTML web pages to crawl the full graph of 
a web of pages.


2. Think of a core as an SQL table - each table/core has a different type of 
data.


3. SolrCloud is all about scaling and availability - multiple shards for 
larger collections and multiple replicas for both scaling of query response 
and availability if nodes go down.


-- Jack Krupansky

-Original Message- 
From: rashmi maheshwari

Sent: Tuesday, January 28, 2014 11:36 AM
To: solr-user@lucene.apache.org
Subject: Solr & Nutch

Hi,

Question1 --> When Solr could parse html, documents like doc, excel pdf
etc, why do we need nutch to parse html files? what is different?

Questions 2: When do we use multiple core in solar? any practical business
case when we need multiple cores?

Question 3: When do we go for cloud? What is meaning of implementing solr
cloud?


--
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org 



Solr & Nutch

2014-01-28 Thread rashmi maheshwari
Hi,

Question1 --> When Solr could parse html, documents like doc, excel pdf
etc, why do we need nutch to parse html files? what is different?

Questions 2: When do we use multiple core in solar? any practical business
case when we need multiple cores?

Question 3: When do we go for cloud? What is meaning of implementing solr
cloud?


-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org


AjaxSolr + Solr + Nutch question

2012-07-14 Thread praful
I referred  https://github.com/evolvingweb/ajax-solr/wiki/reuters-tutorial  
for Ajax-Solr setup.

I want to know that although ajax-solr is running but it's searching under
only reuters data. If I want to crawl the web using nutch and integrate it
with solr,then i have to replace solr's schema.xml file with nutch's
schema.xml file which will not be according to ajax-solr configuration. By
replacing the schema.xml files, ajax-solr wont work(correct me if I am
wrong)!!!

How would I now integrate Solr with Nutch along with Ajax-Solr so ajax-Solr
can search other data on the web as well??

Thanks
Regards
Praful Bagai

--
View this message in context: 
http://lucene.472066.n3.nabble.com/AjaxSolr-Solr-Nutch-question-tp3995030.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Spellcheck in solr-nutch integration

2011-02-05 Thread Anurag

First go thru the schema.xml file . Look at the different components.
On Sat, Feb 5, 2011 at 1:01 PM, 666 [via Lucene] <
ml-node+2429702-1399813783-146...@n3.nabble.com
> wrote:

> Hello Anurag, I'm facing the same problem. Will u please elaborate on how u
> solved the problem? It would be great if u give me a step by step
> description as I'm new in Solr.
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p2429702.html
>  To unsubscribe from Spellcheck in solr-nutch integration, click 
> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1953232&code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXwxOTUzMjMyfC0yMDk4MzQ0MTk2>.
>
>



-- 
Kumar Anurag


-
Kumar Anurag

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p2429782.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Spellcheck in solr-nutch integration

2011-02-05 Thread 666

Hello Anurag, I'm facing the same problem. Will u please elaborate on how u
solved the problem? It would be great if u give me a step by step
description as I'm new in Solr.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p2429702.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Spellcheck in solr-nutch integration

2010-11-29 Thread Anurag

i solved the problemAll we need to modify schema file.

Also the spellcheck index is created first when spellcheck.build=true 

-
Kumar Anurag

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p1988252.html
Sent from the Solr - User mailing list archive at Nabble.com.


Spellcheck in solr-nutch integration

2010-11-23 Thread Anurag

I have integrated solr and nutch using 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ this 

As the tutorial says, the Schema.xml & SolrConfig.xml of Solr has to be
modified. I also did the same. I am using Solr-1.3.
But my problem is that i am not able to implement Spellcheck in this
Solr-nutch integration.

I have got a separate Solr-1.4 where there are options available for
Spellcheck.

What i want to ask is...
1.Indexing for spellcheck is to be done as the same time of indexing the
contents.?What are the steps to follow?

2.How can i implement spellcheck in solr-nutch integration?

please help.


-
Kumar Anurag

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p1953232.html
Sent from the Solr - User mailing list archive at Nabble.com.


Seeking Solr/Nutch consultant in San Jose, CA

2009-09-30 Thread Leann Pereira
Hi,

I am working with a SaaS vendor who is integrated with Nutch 0.9 and SOLR.  We 
are looking for some help to migrate this to Nutch 1.0.  The work involves:


1)  We made changes to Nutch 0.9;  these need to be ported to Nutch 1.0.

2)  Configure SOLR integration with Nutch 1.0

3)  Configure SOLR to do Japanese indexing;  expose this configuration as 
part of Baynote configuration.

4)  Check if indexes are portable between Nutch 0.9 and Nutch 1.0 - should 
we re-index?

Please email me if there is interest.  The work is in San Jose, CA.  Duration 
and rate are not yet known.

Best regards,

Leann


Leann Pereira | o: +1 650.425.7950 | le...@1sourcestaffing.com | Senior 
Technical Recruiter




Re: solr nutch url indexing

2009-08-26 Thread Uri Boness

Do you mean the schema or the solrconfig.xml?

The request handler is configured in the solrconfig.xml and you can find 
out more about this particular configuration in 
http://wiki.apache.org/solr/DisMaxRequestHandler?highlight=(CategorySolrRequestHandler)|((CategorySolrRequestHandler)). 



To understand the schema better, you can read 
http://wiki.apache.org/solr/SchemaXml


Uri

last...@gmail.com wrote:

Uri Boness wrote:
Well... yes, it's a tool the Nutch ships with. It also ships with an 
example Solr schema which you can use. 


hi,
is there any documentation to understand what going in the schema ?


   
   dismax
   explicit
   0.01
   content0.5 anchor1.0 title5.2
   content0.5 anchor1.5 title5.2 site1.5
   url
   2<-1 5<-2 6<90%
   100
   
   *:*
   title url content
   0
   title
   0
   url
   regex
   




Re: solr nutch url indexing

2009-08-25 Thread last...@gmail.com

Uri Boness wrote:
Well... yes, it's a tool the Nutch ships with. It also ships with an 
example Solr schema which you can use. 


hi,
is there any documentation to understand what going in the schema ?


   
   dismax
   explicit
   0.01
   content0.5 anchor1.0 title5.2
   content0.5 anchor1.5 title5.2 site1.5
   url
   2<-1 5<-2 6<90%
   100
   
   *:*
   title url content
   0
   title
   0
   url
   regex
   



Re: solr nutch url indexing

2009-08-25 Thread Uri Boness
Well... yes, it's a tool the Nutch ships with. It also ships with an 
example Solr schema which you can use.


Fuad Efendi wrote:

Thanks for the link, so, SolrIndex is NOT plugin, it is an application... I
use similar approach...

-Original Message-
From: Uri Boness 
Hi,


Nutch comes with support for Solr out of the box. I suggest you follow 
the steps as described here: 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/


Cheers,
Uri

Fuad Efendi wrote:
  
Is SolrIndex plugin for Nutch? 
Thanks!



  





  


RE: solr nutch url indexing

2009-08-25 Thread Fuad Efendi
Thanks for the link, so, SolrIndex is NOT plugin, it is an application... I
use similar approach...

-Original Message-
From: Uri Boness 
Hi,

Nutch comes with support for Solr out of the box. I suggest you follow 
the steps as described here: 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

Cheers,
Uri

Fuad Efendi wrote:
> Is SolrIndex plugin for Nutch? 
> Thanks!
>
>
>   




Re: solr nutch url indexing

2009-08-25 Thread Uri Boness
It seems to me that this configuration actually does what you want - 
queries on "title" mostly. The default search field doesn't influence a 
dismax query. I would suggest you to include the debugQuery=true 
parameter, it will help you figure out how the matching is performed.


You can read more about dismax queries here: 
http://wiki.apache.org/solr/DisMaxRequestHandler




Thibaut Lassalle wrote:

Thanks for your help.

I use the default Nutch configuration and I use solrindex to give the Nutch
result to Solr. I have results when I query therefore Nutch works properly
(it gives a url, title, content ...)

I would like to query on Solr to emphase the "title" field and not the
"content" field.

Here is a sample of my "shema.xml"

..
id
content


...


Here is a sample of my "solrconfig.xml"



dismax
explicit
0.01
content^0.5 anchor^1.0 title^5.2
content^0.5 anchor^1.5 title^5.2 site^1.5
url
2<-1 5<-2 6<90%
100

*:*
title url content
0
title
0
url
regex




This configuration query on "content" only.
How to I change them to query mostly on "title" ?

I tried to change "defaultSearchField" to "title" but it doesn't work.

Where can I find doc on the "solr.SearchHandler" ?

Thanks
t.

  


Re: solr nutch url indexing

2009-08-25 Thread Thibaut Lassalle
Thanks for your help.

I use the default Nutch configuration and I use solrindex to give the Nutch
result to Solr. I have results when I query therefore Nutch works properly
(it gives a url, title, content ...)

I would like to query on Solr to emphase the "title" field and not the
"content" field.

Here is a sample of my "shema.xml"

..
id
content


...


Here is a sample of my "solrconfig.xml"



dismax
explicit
0.01
content^0.5 anchor^1.0 title^5.2
content^0.5 anchor^1.5 title^5.2 site^1.5
url
2<-1 5<-2 6<90%
100

*:*
title url content
0
title
0
url
regex




This configuration query on "content" only.
How to I change them to query mostly on "title" ?

I tried to change "defaultSearchField" to "title" but it doesn't work.

Where can I find doc on the "solr.SearchHandler" ?

Thanks
t.


Re: solr nutch url indexing

2009-08-24 Thread Uri Boness

Hi,

Nutch comes with support for Solr out of the box. I suggest you follow 
the steps as described here: 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/


Cheers,
Uri

Fuad Efendi wrote:
Is SolrIndex plugin for Nutch? 
Thanks!



-Original Message-
From: Uri Boness [mailto:ubon...@gmail.com] 
Sent: August-24-09 4:42 PM

To: solr-user@lucene.apache.org
Subject: Re: solr nutch url indexing

How did you configure nutch?

Make sure you have the "parse-html" and "index-basic" configured. The 
HtmlParser should by default extract the page title and add to the 
parsed data, and the BasicIndexingFilter by default adds this title to 
the NutchDocument and stores it in the "title" filed. All the SolrIndex 
(actually the SolrWriter) does is converting the NuchDocument to a 
SolrInputDocument. So having these plugins configured in Nutch and 
having a field in the schema named "title" should work. (I'm assuming 
you're using the "solrindex" tool)


Cheers,
Uri

Lassalle, Thibaut wrote:
  

Hi,

 


I would like to crawl intranets with nutch and index them with solr.

 


I would like to search mostly on the title of the pages (the one in
This is a title)

 


I tried to tweak the schema.xml to do that but nothing is working. I
just have the content indexed.

 


How do I index on title ?

 


Thanks

t.


  





  


RE: solr nutch url indexing

2009-08-24 Thread Fuad Efendi
Is SolrIndex plugin for Nutch? 
Thanks!


-Original Message-
From: Uri Boness [mailto:ubon...@gmail.com] 
Sent: August-24-09 4:42 PM
To: solr-user@lucene.apache.org
Subject: Re: solr nutch url indexing

How did you configure nutch?

Make sure you have the "parse-html" and "index-basic" configured. The 
HtmlParser should by default extract the page title and add to the 
parsed data, and the BasicIndexingFilter by default adds this title to 
the NutchDocument and stores it in the "title" filed. All the SolrIndex 
(actually the SolrWriter) does is converting the NuchDocument to a 
SolrInputDocument. So having these plugins configured in Nutch and 
having a field in the schema named "title" should work. (I'm assuming 
you're using the "solrindex" tool)

Cheers,
Uri

Lassalle, Thibaut wrote:
> Hi,
>
>  
>
> I would like to crawl intranets with nutch and index them with solr.
>
>  
>
> I would like to search mostly on the title of the pages (the one in
> This is a title)
>
>  
>
> I tried to tweak the schema.xml to do that but nothing is working. I
> just have the content indexed.
>
>  
>
> How do I index on title ?
>
>  
>
> Thanks
>
> t.
>
>
>   




Re: solr nutch url indexing

2009-08-24 Thread Uri Boness

How did you configure nutch?

Make sure you have the "parse-html" and "index-basic" configured. The 
HtmlParser should by default extract the page title and add to the 
parsed data, and the BasicIndexingFilter by default adds this title to 
the NutchDocument and stores it in the "title" filed. All the SolrIndex 
(actually the SolrWriter) does is converting the NuchDocument to a 
SolrInputDocument. So having these plugins configured in Nutch and 
having a field in the schema named "title" should work. (I'm assuming 
you're using the "solrindex" tool)


Cheers,
Uri

Lassalle, Thibaut wrote:

Hi,

 


I would like to crawl intranets with nutch and index them with solr.

 


I would like to search mostly on the title of the pages (the one in
This is a title)

 


I tried to tweak the schema.xml to do that but nothing is working. I
just have the content indexed.

 


How do I index on title ?

 


Thanks

t.


  


solr nutch url indexing

2009-08-24 Thread Lassalle, Thibaut
Hi,

 

I would like to crawl intranets with nutch and index them with solr.

 

I would like to search mostly on the title of the pages (the one in
This is a title)

 

I tried to tweak the schema.xml to do that but nothing is working. I
just have the content indexed.

 

How do I index on title ?

 

Thanks

t.



NYC Apache Lucene/Solr/Nutch/etc. Meetup

2009-07-03 Thread Grant Ingersoll

Hi All, (sorry for the cross-post)

For those in NYC, there will be a Lucene ecosystem (Lucene/Solr/Mahout/ 
Nutch/Tika/Droids/Lucene ports) Meetup on July 22, hosted by MTV  
Networks and co-sponsored with Lucid Imagination.


For more info and to RSVP, see http://www.meetup.com/NYC-Apache-Lucene-Solr-Meetup/ 
.  There is limited seating, so get your spot early.   Note, you must  
register with your first and last name so that security badges can be  
printed ahead of time for access.


Cheers,
Grant


Re: Snipets Solr/nutch

2008-04-15 Thread Mike Klaas

On 15-Apr-08, at 1:37 PM, khirb7 wrote:


Thank you a lot you are helpful, concerning my solr I am using the  
1.2.0

version i download it from the Apache download mirror
http://www.apache.org/dyn/closer.cgi/lucene/solr/  , I haven't well
understand you when you said :

you're trying to apply a patch that has long since been
applied to Solr.


Hi khirb,

You could try looking at "trunk" (the development version of Solr that  
hasn't yet been release).  It contains all the features you were  
trying to add manually to your version.


You can download a "nightly" build of Solr here:

http://people.apache.org/builds/lucene/solr/nightly/

regards,
-Mike


Re: Snipets Solr/nutch

2008-04-15 Thread khirb7



Mike Klaas wrote:
> 
> On 13-Apr-08, at 3:25 AM, khirb7 wrote:
>>
>> it doesn't work solr still use the default value fragsize=100. also  
>> I am not
>> able to spécifieregex  fragmenter due to this probleme of  
>> version I
>> suppose or the way I am declaring   ..> highlighting>
>> because
>> both of:
> 
> Hi khirb,
> 
> It might be easier for people to help you if you keep things in one  
> thread.
> 
> I notice that you're trying to apply a patch that has long since been  
> applied to Solr (another thread).  What version of Solr are you  
> using?  How did you acquire it?
> 
> -Mike
> 
hi mike 

Thank you a lot you are helpful, concerning my solr I am using the 1.2.0
version i download it from the Apache download mirror  
http://www.apache.org/dyn/closer.cgi/lucene/solr/  , I haven't well
understand you when you said :

you're trying to apply a patch that has long since been  
applied to Solr.

thank you mike.


-- 
View this message in context: 
http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16708645.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Snipets Solr/nutch

2008-04-14 Thread Mike Klaas

On 13-Apr-08, at 3:25 AM, khirb7 wrote:


it doesn't work solr still use the default value fragsize=100. also  
I am not
able to spécifieregex  fragmenter due to this probleme of  
version I
suppose or the way I am declaring   ..highlighting>

because
both of:


Hi khirb,

It might be easier for people to help you if you keep things in one  
thread.


I notice that you're trying to apply a patch that has long since been  
applied to Solr (another thread).  What version of Solr are you  
using?  How did you acquire it?


-Mike

Re: Snipets Solr/nutch

2008-04-13 Thread khirb7

hello,
mike adviser me last time to use:

>This is done by the fragmenting stage of highlighting.  Solr (trunk)  
>ships with a fragmenter that looks for sentence-like snippets using  
>regular expressions: try hl.fragmenter=regex (see config in  
>solrconfig.xml).
the prolem is I wasn't  able either to  do that or spécifie  the fragsize 
from solrconfig.xml i think it is due to the version of solr I use and what
classe and package I spécifie   ie:
I put this in solrconfig.xml



−

−

−

400


−

−

−

−

70

0.5

[-\w ,/\n\"']{20,200}



−

−







so either using 


org.apache.solr.util.GapFragmenter   specifique to  solr1.2

 or 


it doesn't work solr still use the default value fragsize=100. also I am not
able to spécifieregex  fragmenter due to this probleme of version I
suppose or the way I am declaring   ..
because 
both of:
 

and

still use fragsize=100 but i am using   400 as
shown above.

thank you.
-- 
View this message in context: 
http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16656960.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Snipets Solr/nutch

2008-04-10 Thread Mike Klaas

On 10-Apr-08, at 12:26 AM, khirb7 wrote:


hello every body

just one other question, to analyse and modify Solr's snippet, I  
want to

know if  org.apache.solr.util.HighlightingUtils
is the class generating the snippet and which methode generate them,  
and
could you please explain me how are they generated in that class and  
where

exactly to modify it. all that in order to not return the first word
encountered highlighted but to return an other one because of the  
problem I

explained  in my previous messages


Unfortunately I have not familiar with nutch's snippet generation.

Solr's highlighting is located in  
org.apache.solr.util.HighlightingUtils in version 1.2, in the current  
(trunk) version, it is located in

org.apache.solr.highlight.* package.

Your use case is a little tricky.  The best way to deal with it in my  
opinion is to strip out the header before sending the data to Solr.   
This will improve your highlighting _and_ your search relevance.


-Mike


Re: Snipets Solr/nutch(maxFragSize?)

2008-04-10 Thread khirb7



khirb7 wrote:
> 
> hello every body
>  
> just one other question, to analyse and modify Solr's snippet, I want to
> know if  org.apache.solr.util.HighlightingUtils
> is the class generating the snippet and which methode generate them, and
> could you please explain me how are they generated in that class and where
> exactly to modify it. all that in order to not return the first word
> encountered highlighted but to return an other one because of the problem
> I explained  in my previous messages
> 
> Cheers
> 
I have done deep search and I found that lucene provide this that methode  :
getBestFragments
highlighter.getBestFragments(tokenStream, text, maxNumFragment, "...");

so with this methode we can precise to lucene to return   maxNumFragment
fragment (with highligted word)of fragsize characters, but there is no
maxFragSize parameter in solr. this would be useful in my case if I want to
highlight not only the first occurrence of a searched word but up to 1
occurrence of the same word. 

cheers




-- 
View this message in context: 
http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16608806.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Snipets Solr/nutch

2008-04-10 Thread khirb7

hello every body
 
just one other question, to analyse and modify Solr's snippet, I want to
know if  org.apache.solr.util.HighlightingUtils
is the class generating the snippet and which methode generate them, and
could you please explain me how are they generated in that class and where
exactly to modify it. all that in order to not return the first word
encountered highlighted but to return an other one because of the problem I
explained  in my previous messages

Cheers
-- 
View this message in context: 
http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16603642.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Snipets Solr/nutch

2008-04-09 Thread khirb7

thank you for your response.

I have  another problem with snippets.here is the problem:
I transform the  HTML code into text then I index all this text generated
into one field called myText , many pages has common header with common
information (example : web site about the president bush) and the word bush
appear in this header, if I want  to highlighting the the field myText and I
am searching the word bush, I will have  the same sentence containing
bush highlighted ( which is the sentence of the comment header containing
bush word  )because I have put fargsize to 150and  Solr return through
the whole  text the first word encountered (bush) highlighted. How can I
deal with that. I was told that nutchwax handle this problem is it true?if
true how can I integarte nutch classes into solr.

thank you in advance.
-- 
View this message in context: 
http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16585594.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Snipets Solr/nutch

2008-04-07 Thread Mike Klaas

On 7-Apr-08, at 7:12 AM, khirb7 wrote:

khirb7 wrote:


hello every body

I am using solr in my project, and I want to use solr snipets  
generated by

the highlighting.
The problem is that these snipets aren't really well displayed,  
they are

trancated and not really meanigful.
I heard that nutch provide well snipets, is it possible and how  to
integrate them to my solr.

thank you in advence.


hi every body
I am digging in solr classes and I am looking for solution to the  
generated
snipets, first of all I want to know on which class and where this  
snippets

are generated .
my snippets are like this:
" project, and I want to use solr snipets generated by the  
highlighting"

ie:
do you se starting whith project has no sens,I think the best way is  
to to

show the whole sentence like this:
"I am using solr in my project, and I want to use solr snipets  
generated by

the highlighting".
and not to trunc it, may be by paying attention to the punctuation  
(the

comma or the capital letter)


This is done by the fragmenting stage of highlighting.  Solr (trunk)  
ships with a fragmenter that looks for sentence-like snippets using  
regular expressions: try hl.fragmenter=regex (see config in  
solrconfig.xml).


regards,
-Mike


Re: Snipets Solr/nutch

2008-04-07 Thread khirb7



khirb7 wrote:
> 
> hello every body
> 
> I am using solr in my project, and I want to use solr snipets generated by
> the highlighting.
> The problem is that these snipets aren't really well displayed, they are
> trancated and not really meanigful.
> I heard that nutch provide well snipets, is it possible and how  to
> integrate them to my solr.
> 
> thank you in advence.  
> 
hi every body 
I am digging in solr classes and I am looking for solution to the generated
snipets, first of all I want to know on which class and where this snippets
are generated .
my snippets are like this:
" project, and I want to use solr snipets generated by the highlighting"
ie:
do you se starting whith project has no sens,I think the best way is to to
show the whole sentence like this:
"I am using solr in my project, and I want to use solr snipets generated by
the highlighting".
and not to trunc it, may be by paying attention to the punctuation (the
comma or the capital letter)

thank you in advence.




-- 
View this message in context: 
http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16537460.html
Sent from the Solr - User mailing list archive at Nabble.com.



Snipets Solr/nutch

2008-04-07 Thread khirb7

hello every body

I am using solr in my project, and I want to use solr snipets generated by
the highlighting.
The problem is that these snipets aren't really well displayed, they are
trancated and not really meanigful.
I heard that nutch provide well snipets, is it possible and how  to
integrate them to my solr.

thank you in advence.  
-- 
View this message in context: 
http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16537216.html
Sent from the Solr - User mailing list archive at Nabble.com.