Re: Can Solr be used to search public websites(Newbie).

2008-09-17 Thread George Everitt

Dear Con,

Searching the entire Internet is a non-trivial computer science  
problem.  It's kind of like asking a brain surgeon the best way to  
remove a tumor.  The answer should be First, spend 16 years becoming  
a neurosurgeon.  My point is, there is a whole lot you need to know  
beyond is Solr the correct tool for the job.


However, the short answer is that Nutch is probably better suited for  
what you want to do, when you get the funding, hardware and expertise  
to do it.


I'm not mocking or denigrating you in any way, but I think you need to  
do a bit more basic research in how search engines work.


I found this very readable and accurate site the other day:

http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC

Regards,
George


On Sep 17, 2008, at 8:39 AM, convoyer wrote:



Hi all.
I am quite new to solr. I am just checking whether this tool suits my
application.
I am developing a search application that searches all publically  
available
websites and also some selective websites. Can I use solr for this  
purpose.

If yes how can I get started.
All the tutorials are pointing to load data from a xml file and  
search those

values..:-(:-( . Instead how can I give the URL of website and search
contents of that site(just like in nutch)..

Expecting reply
thanks in advance
con

--
View this message in context: 
http://www.nabble.com/Can-Solr-be-used-to-search-public-websites%28Newbie%29.-tp19531227p19531227.html
Sent from the Solr - User mailing list archive at Nabble.com.






Inverted Search Engine

2008-01-23 Thread George Everitt
Verity had a function called profiler which was essentially an  
inverted search engine.  Instead of evaluating a single query at a  
time against a large corpus of documents, the profiler evaluated a  
single document at a time against a large number of queries.   This  
kind of functionality is used for alert notifications, where a large  
number of users can have their own queries and as documents are  
indexed into the system,  the queries are matched and some kind of  
notification is made to the owner of the query (e-mail, SMS, etc).  
Think Google Alerts.


I'm wondering if anybody has implemented this kind of functionality  
with Solr, and if so what strategy did you use?  If you haven't  
implemented something like that I would still be interested in ideas  
on how to do it with Solr, or how to perhaps use Lucene to patch that  
functionality into Solr?  I have my own thoughts, but they are still a  
bit primitive, and I'd like to throw it over the transom and see who  
bites...


George Everitt
Applied Relevance LLC







Re: Inverted Search Engine

2008-01-23 Thread George Everitt

Wow, that's spooky.

Thanks for the heads up - looks like a good list to subscribe to as well

George Everitt
Applied Relevance LLC
[EMAIL PROTECTED]
Tel: +1 (727) 641-4660
Fax: +1 (727) 233-0672
Skype: geverit4
AIM: [EMAIL PROTECTED]




On Jan 23, 2008, at 2:30 PM, Erick Erickson wrote:


As chance would have it, this was just discussed over on the lucene
user's list. See the thread..

Inverted search / Search on profilenetBest
Erick


On Jan 23, 2008 1:38 PM, George Everitt  
[EMAIL PROTECTED]

wrote:


Verity had a function called profiler which was essentially an
inverted search engine.  Instead of evaluating a single query at a
time against a large corpus of documents, the profiler evaluated a
single document at a time against a large number of queries.   This
kind of functionality is used for alert notifications, where a large
number of users can have their own queries and as documents are
indexed into the system,  the queries are matched and some kind of
notification is made to the owner of the query (e-mail, SMS, etc).
Think Google Alerts.

I'm wondering if anybody has implemented this kind of functionality
with Solr, and if so what strategy did you use?  If you haven't
implemented something like that I would still be interested in ideas
on how to do it with Solr, or how to perhaps use Lucene to patch that
functionality into Solr?  I have my own thoughts, but they are  
still a

bit primitive, and I'd like to throw it over the transom and see who
bites...

George Everitt
Applied Relevance LLC










Re: does solr handle hierarchical facets?

2007-12-17 Thread George Everitt



On Dec 13, 2007, at 1:56 AM, Chris Hostetter wrote:


ie, if this is your hierarchy...

   Products/
   Products/Computers/
   Products/Computers/Laptops
   Products/Computers/Desktops
   Products/Cases
   Products/Cases/Laptops
   Products/Cases/CellPhones

Then this trick won't work (because Laptops appears twice) but if  
you have
numeric IDs that corrispond with each of those categories (so that  
the two

instances of Laptops are unique...

   1/
   1/2/
   1/2/3
   1/2/4
   1/5/
   1/5/6
   1/5/7


Why not just use the whole path as the unique identifying token for a  
given node on the hierarchy?   That way, you don't need to map nodes  
to unique numbers, just use a prefix query.


taxonomy:Products/Computers/Laptops* or taxonomy:Products/Cases/Laptops*

Sorry - that may be bogus query syntax, but you get the idea.

Products/Computers/Laptops* and Products/Cases/Laptops* are two unique  
identifiers.  You just need to make sure they are tokenized properly -  
which is beyond my current off-the-cuff expertise.


At least that is the way I've been doing it with IDOL lately.  I  
dearly hope I can do the same in Solr when the time comes.


I have a whole mess of Java code which parses out arbitrary path  
separated values into real tree structures.  I think it would be a  
useful addition to Solr, or maybe Solrj.  It's been knocking around my  
hard drives for the better part of a decade.   If I get enough  
interest, I'll clean it up and figure out how to offer it up as a part  
of the code base.  I'm pretty naive when it comes to FLOSS, so any  
authoritative non-condescending hints on how to go about this would be  
greatly appreciated.


Regards,
George


Heritrix and Solr

2007-11-22 Thread George Everitt
I'm looking for a web crawler to use with Solr.  The objective is to  
crawl about a dozen public web sites regarding a specific topic.


After a lot of googling, I came across Heritrix, which seems to be the  
most robust well supported open source crawler out there.   Heritrix  
has an integration with Nutch (NutchWax), but not with Solr.   I'm  
wondering if anybody can share any experience using Heritrix with Solr.


It seems that there are three options for integration:

1. Write a custom Heritrix Writer class which submits documents to  
Solr for indexing.
2. Write an ARC to Sol input XML format converter to import the ARC  
files.
3. Use the filesystem mirror writer and then another program to walk  
the downloaded files.


Has anybody looked into this or have any suggestions on an alternative  
approach?  The optimal answer would be You dummy, just use XXX to  
crawl your web sites - there's no 'integration' required at all.   Can  
you believe the temerity?   What a poltroon.


Yours in Revolution,
George










Re: Heritrix and Solr

2007-11-22 Thread George Everitt

Otis:

There are many reasons I prefer Solr to Nutch:

1. I actually tried to do some of the crawling with Nutch, but found  
the crawling options less flexible than I would have liked.
2. I prefer the Solr approach in general.  I have a long background in  
Verity and Autonomy search, and Solr is a bit closer to them than Nutch.

3. I really like the schema support in Solr.
4. I really really like the facets/parametric search in Solr.
5. I really really really like the REST interface in Solr.
6. Finally, and not to put too fine a point on it, hadoop frightens  
the bejeebers out of me.  I've skimmed some of the papers and it looks  
like a lot of study before I will fully understand it.  I'm not saying  
I'm stupid and lazy, but if the map-reduce algorithm fits, I'll wear  
it.  Plus, I'm trying to get a mental handle on Jeff Hawkins' HTM and  
it's application to the real world.   It all makes my cerebral cortex  
itchy.


Thanks for the suggestion, though.   I'll probably revisit Nutch again  
if Heritrix lets me down.  I had no luck getting the Nutch crawler  
Solr patch to work, either.   Sadly, I'm the David Lee Roth of Java  
programmers - I may think that Im hard-core, but I'm not, really. And  
my groupies are getting a bit saggy.


BTW - add my voice to the paeans of praise for Lucene in Action.   You  
and Erik did a bang up job, and I surely appreciate all the feedback  
you give on this forum, Especially over the past few months as I feel  
my way through Solr and Lucene.




On Nov 22, 2007, at 10:10 PM, Otis Gospodnetic wrote:


The answer to that question, Norberto, would depend on versions.

George: why not just use straight Nutch and forget about Heritrix?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Norberto Meijome [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Cc: [EMAIL PROTECTED]
Sent: Thursday, November 22, 2007 5:54:32 PM
Subject: Re: Heritrix and Solr

On Thu, 22 Nov 2007 10:41:41 -0500
George Everitt [EMAIL PROTECTED] wrote:


After a lot of googling, I came across Heritrix, which seems to be

the

most robust well supported open source crawler out there.   Heritrix



has an integration with Nutch (NutchWax), but not with Solr.   I'm
wondering if anybody can share any experience using Heritrix with

Solr.

out on a limb here... both Nutch and SOLR use Lucene for the actual
indexing / searching. Would the indexes generated with Nutch be  
compatible

/ readable with SOLR?

_
{Beto|Norberto|Numard} Meijome

Why do you sit there looking like an envelope without any address on
it?
 Mark Twain

I speak for myself, not my employer. Contents may be hot. Slippery  
when

wet. Reading disclaimers makes you go blind. Writing them is worse.
You have been Warned.








Re: Can you parse the contents of a field to populate other fields?

2007-11-07 Thread George Everitt
I'm not sure I fully understand your ultimate goal or Yonik's  
response.  However, in the past I've been able to represent  
hierarchical data as a simple enumeration of delimited paths:


field name=taxonomyroot/field
field name=taxonomyroot/region/field
field name=taxonomyroot/region/north america/field
field name=taxonomyroot/region/south america/field

Then, at response time, you can walk the result facet and build a  
hierarchy with counts that can be put into a tree view.  The tree can  
be any arbitrary depth, and documents can live in any combination of  
nodes on the tree.


In addition, you can represent any arbitrary name value pair  
(attribute/tuple) as a two level tree.   That way, you can put any  
combination of attributes in the facet and parse them out at results  
list time.  For example, you might be indexing computer hardware.
Memory, Bus Speed and Resolution may be valid for some objects but not  
for others.   Just put them in a facet and specify a separator:


field name=attributememory:1GB/name
field name=attributebusspeed:133Mhz/name
field name=attributevoltage:110/220/name
field name=attributemanufacturer:Shiangtsu/field


When you do a facet query, you can easily display the categories  
appropriate to the object.  And do facet selections like show me all  
green things and show me all size 4 things.



Even if that's not your goal, this might help someone else.


George Everitt







On Nov 7, 2007, at 3:15 PM, Kristen Roth wrote:

So, I think I have things set up correctly in my schema, but it  
doesn't

appear that any logic is being applied to my Category_# fields - they
are being populated with the full string copied from the Category  
field

(facet1::facet2::facet3...facetn) instead of just facet1, facet2, etc.

I have several different field types, each with a different regex to
match a specific part of the input string.  In this example, I'm
matching facet1 in input string facet1::facet2::facet3...facetn

   fieldtype name=cat1str class=solr.TextField
analyzer type=index
tokenizer class=solr.PatternTokenizerFactory
pattern=^([^:]+) group=1/
/analyzer
   /fieldtype

I have copyfields set up for each Category_# field.  Anything  
obviously

wrong?

Thanks!
Kristen

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Wednesday, November 07, 2007 9:38 AM
To: solr-user@lucene.apache.org
Subject: Re: Can you parse the contents of a field to populate other
fields?

On 11/6/07, Kristen Roth [EMAIL PROTECTED] wrote:

Yonik - thanks so much for your help!  Just to clarify; where should

the

regex go for each field?


Each field should have a different FieldType (referenced by the type
XML attribute).  Each fieldType can have it's own analyzer.  You can
use a different PatternTokenizer (which specifies a regex) for each
analyzer.

-Yonik





Re: [slightly ot] Looking for Lucene/Solr consultant in Germany

2007-08-15 Thread George Everitt

Dear Jan,

I just saw your post on the SOLR mailing list.  I hope I'm not too late.

First of, I don't exactly match your required qualifications.  I do  
have 9 years at Verity and 1 year at Autonomy in enterprise search,  
however.   I'm in the middle of coming up to speed on SOLR and  
applying my considerable expertise in general Enterprise Search to  
the SOLR/Lucene platform.   So, your specific requirements for a  
Lucene/SOLR expert are not quite met.  But, I've been in the business  
of enterprise search for 10 years.   Think if it as asking an Oracle  
expert to look at your MySQL implementation.


My normal rate is USD 200/hour, and I do command that rate more often  
than not.  I'd be interested in taking on the challenge in my spare  
time, free of charge, just to get my bearings and to see how my  
consulting skills translate from the closed-source Verity/IDOL world  
to the open source world.  I think this could be beneficial to both  
of us:   I would get some expertise in specific SOLR idiosyncrasies,  
and you would get the benefit of 10 years of general enterprise  
search experience.


I've been studying SOLR and Lucene, and even developing my own  
project using them as a basis.  That being said, I expect to make  
some mistakes as I try to match my existing skill set with what's  
available in SOLR.  Fortunately, I found that with the transition  
from Verity K2 to Autonomy IDOL the underlying concepts of full-text  
search are pretty much universal.


Another fly in the ointment is that I live in the USA (St. Pete  
Beach, Florida to be exact), so there would be some time zone  
issues.  Also, I don't speak German, which will be a handicap when it  
comes to analyzing stemming options.   If you can live with those  
limitations, I'd be happy to help.


Let me know if you're interested.

George Everitt
Applied Relevance LLC
[EMAIL PROTECTED]
Tel: +1 (727) 641-4660
Fax: +1 (727) 233-0672






On Aug 8, 2007, at 12:43 PM, Jan Miczaika wrote:


Hello,

we are looking for a Lucene/Solr consultant in Germany. We have set  
up a Lucene/Solr server (currently live at http://www.hitflip.de).  
It returns search results, but the results are not really very  
good. We have been tweaking the parameters a bit, following  
suggestions from the mailing list, but are unsure of the effects  
this has.


We are looking for someone to do the following:
- analyse the search patterns on our website
- define a methodology for defining the quality of search
- analyse the data we have available
- specify which data is required in the index
- modify the search patterns used to query the data
- test and evaluate the results

The requirements: deep knowledge of Lucene/Solr, examples of  
implemented working search engines, theoretical knowledge


Is anyone interested? Please feel free to circulate this offer.

Thanks in advance

Jan

--
Geschäftsführer / Managing Director
Hitflip Media Trading GmbH
Gürzenichstr. 7, 50667 Köln
http://www.hitflip.de - new: http://www.hitflip.co.uk

Tel. +49-(0)221-272407-27
Fax. 0221-272407-22 (that's so 1990s)
HRB 59046, Amtsgericht Köln

Geschäftsführer: Andre Alpar, Jan Miczaika, Gerald Schönbucher