Re: Can Solr be used to search public websites(Newbie).
Dear Con, Searching the entire Internet is a non-trivial computer science problem. It's kind of like asking a brain surgeon the best way to remove a tumor. The answer should be First, spend 16 years becoming a neurosurgeon. My point is, there is a whole lot you need to know beyond is Solr the correct tool for the job. However, the short answer is that Nutch is probably better suited for what you want to do, when you get the funding, hardware and expertise to do it. I'm not mocking or denigrating you in any way, but I think you need to do a bit more basic research in how search engines work. I found this very readable and accurate site the other day: http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC Regards, George On Sep 17, 2008, at 8:39 AM, convoyer wrote: Hi all. I am quite new to solr. I am just checking whether this tool suits my application. I am developing a search application that searches all publically available websites and also some selective websites. Can I use solr for this purpose. If yes how can I get started. All the tutorials are pointing to load data from a xml file and search those values..:-(:-( . Instead how can I give the URL of website and search contents of that site(just like in nutch).. Expecting reply thanks in advance con -- View this message in context: http://www.nabble.com/Can-Solr-be-used-to-search-public-websites%28Newbie%29.-tp19531227p19531227.html Sent from the Solr - User mailing list archive at Nabble.com.
Inverted Search Engine
Verity had a function called profiler which was essentially an inverted search engine. Instead of evaluating a single query at a time against a large corpus of documents, the profiler evaluated a single document at a time against a large number of queries. This kind of functionality is used for alert notifications, where a large number of users can have their own queries and as documents are indexed into the system, the queries are matched and some kind of notification is made to the owner of the query (e-mail, SMS, etc). Think Google Alerts. I'm wondering if anybody has implemented this kind of functionality with Solr, and if so what strategy did you use? If you haven't implemented something like that I would still be interested in ideas on how to do it with Solr, or how to perhaps use Lucene to patch that functionality into Solr? I have my own thoughts, but they are still a bit primitive, and I'd like to throw it over the transom and see who bites... George Everitt Applied Relevance LLC
Re: Inverted Search Engine
Wow, that's spooky. Thanks for the heads up - looks like a good list to subscribe to as well George Everitt Applied Relevance LLC [EMAIL PROTECTED] Tel: +1 (727) 641-4660 Fax: +1 (727) 233-0672 Skype: geverit4 AIM: [EMAIL PROTECTED] On Jan 23, 2008, at 2:30 PM, Erick Erickson wrote: As chance would have it, this was just discussed over on the lucene user's list. See the thread.. Inverted search / Search on profilenetBest Erick On Jan 23, 2008 1:38 PM, George Everitt [EMAIL PROTECTED] wrote: Verity had a function called profiler which was essentially an inverted search engine. Instead of evaluating a single query at a time against a large corpus of documents, the profiler evaluated a single document at a time against a large number of queries. This kind of functionality is used for alert notifications, where a large number of users can have their own queries and as documents are indexed into the system, the queries are matched and some kind of notification is made to the owner of the query (e-mail, SMS, etc). Think Google Alerts. I'm wondering if anybody has implemented this kind of functionality with Solr, and if so what strategy did you use? If you haven't implemented something like that I would still be interested in ideas on how to do it with Solr, or how to perhaps use Lucene to patch that functionality into Solr? I have my own thoughts, but they are still a bit primitive, and I'd like to throw it over the transom and see who bites... George Everitt Applied Relevance LLC
Re: does solr handle hierarchical facets?
On Dec 13, 2007, at 1:56 AM, Chris Hostetter wrote: ie, if this is your hierarchy... Products/ Products/Computers/ Products/Computers/Laptops Products/Computers/Desktops Products/Cases Products/Cases/Laptops Products/Cases/CellPhones Then this trick won't work (because Laptops appears twice) but if you have numeric IDs that corrispond with each of those categories (so that the two instances of Laptops are unique... 1/ 1/2/ 1/2/3 1/2/4 1/5/ 1/5/6 1/5/7 Why not just use the whole path as the unique identifying token for a given node on the hierarchy? That way, you don't need to map nodes to unique numbers, just use a prefix query. taxonomy:Products/Computers/Laptops* or taxonomy:Products/Cases/Laptops* Sorry - that may be bogus query syntax, but you get the idea. Products/Computers/Laptops* and Products/Cases/Laptops* are two unique identifiers. You just need to make sure they are tokenized properly - which is beyond my current off-the-cuff expertise. At least that is the way I've been doing it with IDOL lately. I dearly hope I can do the same in Solr when the time comes. I have a whole mess of Java code which parses out arbitrary path separated values into real tree structures. I think it would be a useful addition to Solr, or maybe Solrj. It's been knocking around my hard drives for the better part of a decade. If I get enough interest, I'll clean it up and figure out how to offer it up as a part of the code base. I'm pretty naive when it comes to FLOSS, so any authoritative non-condescending hints on how to go about this would be greatly appreciated. Regards, George
Heritrix and Solr
I'm looking for a web crawler to use with Solr. The objective is to crawl about a dozen public web sites regarding a specific topic. After a lot of googling, I came across Heritrix, which seems to be the most robust well supported open source crawler out there. Heritrix has an integration with Nutch (NutchWax), but not with Solr. I'm wondering if anybody can share any experience using Heritrix with Solr. It seems that there are three options for integration: 1. Write a custom Heritrix Writer class which submits documents to Solr for indexing. 2. Write an ARC to Sol input XML format converter to import the ARC files. 3. Use the filesystem mirror writer and then another program to walk the downloaded files. Has anybody looked into this or have any suggestions on an alternative approach? The optimal answer would be You dummy, just use XXX to crawl your web sites - there's no 'integration' required at all. Can you believe the temerity? What a poltroon. Yours in Revolution, George
Re: Heritrix and Solr
Otis: There are many reasons I prefer Solr to Nutch: 1. I actually tried to do some of the crawling with Nutch, but found the crawling options less flexible than I would have liked. 2. I prefer the Solr approach in general. I have a long background in Verity and Autonomy search, and Solr is a bit closer to them than Nutch. 3. I really like the schema support in Solr. 4. I really really like the facets/parametric search in Solr. 5. I really really really like the REST interface in Solr. 6. Finally, and not to put too fine a point on it, hadoop frightens the bejeebers out of me. I've skimmed some of the papers and it looks like a lot of study before I will fully understand it. I'm not saying I'm stupid and lazy, but if the map-reduce algorithm fits, I'll wear it. Plus, I'm trying to get a mental handle on Jeff Hawkins' HTM and it's application to the real world. It all makes my cerebral cortex itchy. Thanks for the suggestion, though. I'll probably revisit Nutch again if Heritrix lets me down. I had no luck getting the Nutch crawler Solr patch to work, either. Sadly, I'm the David Lee Roth of Java programmers - I may think that Im hard-core, but I'm not, really. And my groupies are getting a bit saggy. BTW - add my voice to the paeans of praise for Lucene in Action. You and Erik did a bang up job, and I surely appreciate all the feedback you give on this forum, Especially over the past few months as I feel my way through Solr and Lucene. On Nov 22, 2007, at 10:10 PM, Otis Gospodnetic wrote: The answer to that question, Norberto, would depend on versions. George: why not just use straight Nutch and forget about Heritrix? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Norberto Meijome [EMAIL PROTECTED] To: solr-user@lucene.apache.org Cc: [EMAIL PROTECTED] Sent: Thursday, November 22, 2007 5:54:32 PM Subject: Re: Heritrix and Solr On Thu, 22 Nov 2007 10:41:41 -0500 George Everitt [EMAIL PROTECTED] wrote: After a lot of googling, I came across Heritrix, which seems to be the most robust well supported open source crawler out there. Heritrix has an integration with Nutch (NutchWax), but not with Solr. I'm wondering if anybody can share any experience using Heritrix with Solr. out on a limb here... both Nutch and SOLR use Lucene for the actual indexing / searching. Would the indexes generated with Nutch be compatible / readable with SOLR? _ {Beto|Norberto|Numard} Meijome Why do you sit there looking like an envelope without any address on it? Mark Twain I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Can you parse the contents of a field to populate other fields?
I'm not sure I fully understand your ultimate goal or Yonik's response. However, in the past I've been able to represent hierarchical data as a simple enumeration of delimited paths: field name=taxonomyroot/field field name=taxonomyroot/region/field field name=taxonomyroot/region/north america/field field name=taxonomyroot/region/south america/field Then, at response time, you can walk the result facet and build a hierarchy with counts that can be put into a tree view. The tree can be any arbitrary depth, and documents can live in any combination of nodes on the tree. In addition, you can represent any arbitrary name value pair (attribute/tuple) as a two level tree. That way, you can put any combination of attributes in the facet and parse them out at results list time. For example, you might be indexing computer hardware. Memory, Bus Speed and Resolution may be valid for some objects but not for others. Just put them in a facet and specify a separator: field name=attributememory:1GB/name field name=attributebusspeed:133Mhz/name field name=attributevoltage:110/220/name field name=attributemanufacturer:Shiangtsu/field When you do a facet query, you can easily display the categories appropriate to the object. And do facet selections like show me all green things and show me all size 4 things. Even if that's not your goal, this might help someone else. George Everitt On Nov 7, 2007, at 3:15 PM, Kristen Roth wrote: So, I think I have things set up correctly in my schema, but it doesn't appear that any logic is being applied to my Category_# fields - they are being populated with the full string copied from the Category field (facet1::facet2::facet3...facetn) instead of just facet1, facet2, etc. I have several different field types, each with a different regex to match a specific part of the input string. In this example, I'm matching facet1 in input string facet1::facet2::facet3...facetn fieldtype name=cat1str class=solr.TextField analyzer type=index tokenizer class=solr.PatternTokenizerFactory pattern=^([^:]+) group=1/ /analyzer /fieldtype I have copyfields set up for each Category_# field. Anything obviously wrong? Thanks! Kristen -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Wednesday, November 07, 2007 9:38 AM To: solr-user@lucene.apache.org Subject: Re: Can you parse the contents of a field to populate other fields? On 11/6/07, Kristen Roth [EMAIL PROTECTED] wrote: Yonik - thanks so much for your help! Just to clarify; where should the regex go for each field? Each field should have a different FieldType (referenced by the type XML attribute). Each fieldType can have it's own analyzer. You can use a different PatternTokenizer (which specifies a regex) for each analyzer. -Yonik
Re: [slightly ot] Looking for Lucene/Solr consultant in Germany
Dear Jan, I just saw your post on the SOLR mailing list. I hope I'm not too late. First of, I don't exactly match your required qualifications. I do have 9 years at Verity and 1 year at Autonomy in enterprise search, however. I'm in the middle of coming up to speed on SOLR and applying my considerable expertise in general Enterprise Search to the SOLR/Lucene platform. So, your specific requirements for a Lucene/SOLR expert are not quite met. But, I've been in the business of enterprise search for 10 years. Think if it as asking an Oracle expert to look at your MySQL implementation. My normal rate is USD 200/hour, and I do command that rate more often than not. I'd be interested in taking on the challenge in my spare time, free of charge, just to get my bearings and to see how my consulting skills translate from the closed-source Verity/IDOL world to the open source world. I think this could be beneficial to both of us: I would get some expertise in specific SOLR idiosyncrasies, and you would get the benefit of 10 years of general enterprise search experience. I've been studying SOLR and Lucene, and even developing my own project using them as a basis. That being said, I expect to make some mistakes as I try to match my existing skill set with what's available in SOLR. Fortunately, I found that with the transition from Verity K2 to Autonomy IDOL the underlying concepts of full-text search are pretty much universal. Another fly in the ointment is that I live in the USA (St. Pete Beach, Florida to be exact), so there would be some time zone issues. Also, I don't speak German, which will be a handicap when it comes to analyzing stemming options. If you can live with those limitations, I'd be happy to help. Let me know if you're interested. George Everitt Applied Relevance LLC [EMAIL PROTECTED] Tel: +1 (727) 641-4660 Fax: +1 (727) 233-0672 On Aug 8, 2007, at 12:43 PM, Jan Miczaika wrote: Hello, we are looking for a Lucene/Solr consultant in Germany. We have set up a Lucene/Solr server (currently live at http://www.hitflip.de). It returns search results, but the results are not really very good. We have been tweaking the parameters a bit, following suggestions from the mailing list, but are unsure of the effects this has. We are looking for someone to do the following: - analyse the search patterns on our website - define a methodology for defining the quality of search - analyse the data we have available - specify which data is required in the index - modify the search patterns used to query the data - test and evaluate the results The requirements: deep knowledge of Lucene/Solr, examples of implemented working search engines, theoretical knowledge Is anyone interested? Please feel free to circulate this offer. Thanks in advance Jan -- Geschäftsführer / Managing Director Hitflip Media Trading GmbH Gürzenichstr. 7, 50667 Köln http://www.hitflip.de - new: http://www.hitflip.co.uk Tel. +49-(0)221-272407-27 Fax. 0221-272407-22 (that's so 1990s) HRB 59046, Amtsgericht Köln Geschäftsführer: Andre Alpar, Jan Miczaika, Gerald Schönbucher