Overall
Hi! Some questions: 1) Is it possible to make Solr to use, for example, MySQL database, or it only supports *.xml files as a database? 2) Is there a way to add data in the search database using some online interface, or the only way is manually adding the data in the *.xml files? 3) Is there any guide on how to implement Solr to the web-site? Ar cieņu, Mihails
Re: Overall
2008/6/9 Mihails Agafonovs [EMAIL PROTECTED]: Hi! Some questions: 1) Is it possible to make Solr to use, for example, MySQL database, or it only supports *.xml files as a database? you can use DataImportHandler to index from MySql (or other databases) 2) Is there a way to add data in the search database using some online interface, or the only way is manually adding the data in the *.xml files? you can generate the XMLs from a program that can read from the data source or use some of the solr clients (java, python, ruby) to update the index using the provided APIs. 3) Is there any guide on how to implement Solr to the web-site? whatever you have is in the wiki and the mailing archives, if you cant find it there, I am afraid it is not available. Ar cieņu, Mihails
Re: Overall
1) ok 2) This means developing some custom program, so there is no such functionality in Solr :( 3) I have some connection problems and I really can't load these mailing list archives at all! Anyway, I want to understand, how can I use Solr in my site or any other usage? Quoting Umar Shah : 2008/6/9 Mihails Agafonovs : Hi! Some questions: 1) Is it possible to make Solr to use, for example, MySQL database, or it only supports *.xml files as a database? you can use DataImportHandler to index from MySql (or other databases) 2) Is there a way to add data in the search database using some online interface, or the only way is manually adding the data in the *.xml files? you can generate the XMLs from a program that can read from the data source or use some of the solr clients (java, python, ruby) to update the index using the provided APIs. 3) Is there any guide on how to implement Solr to the web-site? whatever you have is in the wiki and the mailing archives, if you cant find it there, I am afraid it is not available. Ar cieņu, Mihails Ar cieņu, Mihails Links: -- [1] mailto:[EMAIL PROTECTED]
Re: Overall
Hi Mihails, I don't know about points 1 and 2 as I'm just starting with Solr but for point 3 you need to understand that Solr is just going to return xml for your queries so you can use any web language to parse the xml of the results. It might return other formats like json as well, haven't figured this out yet but it's not intended to give you a full blown web page that's up to you to do it'll just give you the data. - d On 9 Jun 2008, at 11:06, Mihails Agafonovs wrote: 1) ok 2) This means developing some custom program, so there is no such functionality in Solr :( 3) I have some connection problems and I really can't load these mailing list archives at all! Anyway, I want to understand, how can I use Solr in my site or any other usage? Quoting Umar Shah : 2008/6/9 Mihails Agafonovs : Hi! Some questions: 1) Is it possible to make Solr to use, for example, MySQL database, or it only supports *.xml files as a database? you can use DataImportHandler to index from MySql (or other databases) 2) Is there a way to add data in the search database using some online interface, or the only way is manually adding the data in the *.xml files? you can generate the XMLs from a program that can read from the data source or use some of the solr clients (java, python, ruby) to update the index using the provided APIs. 3) Is there any guide on how to implement Solr to the web-site? whatever you have is in the wiki and the mailing archives, if you cant find it there, I am afraid it is not available. Ar cieņu, Mihails Ar cieņu, Mihails Links: -- [1] mailto:[EMAIL PROTECTED] -- Dominic Stockdale [EMAIL PROTECTED] +44(0)1273 311407 +44(0)7886 654562 skype: domonline
setAllowLeadingWildcard
Hello list, I really need to setAllowLeadingWildcard to true and I'm wondering if you can advise me on the best way to do this. I am a newbie so forgive me if I'm being a dummy. I've established that it's not set-able in the 1.2.0 version which seems to be quite old so I've been looking through trunk to see what's what there and it seems now to be an option however the current version in trunk appears to be broken. Doesn't anyone know a revision number from svn that might be working and where setAllowLeadingWildcard is set-able? Is there another way I can set setAllowLeadingWildcard to true if I'm trying to do this the wrong way? Thanks - Dom
DataImport
Looked through the tutorial on data import, section Full Import Example. 1) Where is this dataimport.jar? There is no such file in the extracted example-solr-home.jar. 2) Use the solr folder inside example-data-config folder as your solr home. What does this mean? Anyway, there is no folder example-data-config. Ar cieņu, Mihails
Re: DataImport
1. Correct, there is no jar. You can use the solr.war file. If you really need a jar, you'll need to use the SOLR-469.patch at http://issues.apache.org/jira/browse/SOLR-469 and build solr from source after applying that patch. 2. The jar contains a folder named example-solr-home. Please check again. Please let me know if you run into any problems. 2008/6/9 Mihails Agafonovs [EMAIL PROTECTED]: Looked through the tutorial on data import, section Full Import Example. 1) Where is this dataimport.jar? There is no such file in the extracted example-solr-home.jar. 2) Use the solr folder inside example-data-config folder as your solr home. What does this mean? Anyway, there is no folder example-data-config. Ar cieņu, Mihails -- Regards, Shalin Shekhar Mangar.
Solr system and numbers
Hello experts, How does Solr deal with numbers or phone numbers .. For example if you have 1234 and 12 34 or 1 234... with spaces between the numbers .. Or this is dealt by lucene ? any documentations or tutorial on this ? many thanks, ak _ All new Live Search at Live.com http://clk.atdmt.com/UKM/go/msnnkmgl001006ukm/direct/01/
Re: DataImport
I've placed the solr.war under the tomcat directory, restarted tomcat to deploy the solr.war. But still... there is no .jar, no folder named example-data-config, and hitting http://localhost:8983/solr/dataimport doesn't work. Do I need the original Solr instance to use this .war with? Quoting Shalin Shekhar Mangar : 1. Correct, there is no jar. You can use the solr.war file. If you really need a jar, you'll need to use the SOLR-469.patch at http://issues.apache.org/jira/browse/SOLR-469 and build solr from source after applying that patch. 2. The jar contains a folder named example-solr-home. Please check again. Please let me know if you run into any problems. 2008/6/9 Mihails Agafonovs : Looked through the tutorial on data import, section Full Import Example. 1) Where is this dataimport.jar? There is no such file in the extracted example-solr-home.jar. 2) Use the solr folder inside example-data-config folder as your solr home. What does this mean? Anyway, there is no folder example-data-config. Ar cieņu, Mihails -- Regards, Shalin Shekhar Mangar. Ar cieņu, Mihails Links: -- [1] mailto:[EMAIL PROTECTED]
RE: Solr system and numbers
great info ,,, thanks a lot all Date: Mon, 9 Jun 2008 05:58:50 -0700 From: [EMAIL PROTECTED] Subject: Re: Solr system and numbers To: solr-user@lucene.apache.org Hi, Solr/Lucene can treat phone numbers as strings. If you want to clean them up and normalize them outside of Solr, you can do that and feed them into Solr as pure numbers. How the phone numbers will be treated after you pump them into Solr depends on the analyzer you choose to use for this data. If you don't need to search on subsets of phone numbers, then just don't tokenize them (i.e. use string type if the phone numbers contain any non-numeric characters, sint otherwise). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: dudes dudes To: solr-user@lucene.apache.org Sent: Monday, June 9, 2008 2:10:20 PM Subject: Solr system and numbers Hello experts, How does Solr deal with numbers or phone numbers .. For example if you have 1234 and 12 34 or 1 234... with spaces between the numbers .. Or this is dealt by lucene ? any documentations or tutorial on this ? many thanks, ak _ All new Live Search at Live.com http://clk.atdmt.com/UKM/go/msnnkmgl001006ukm/direct/01/ _ All new Live Search at Live.com http://clk.atdmt.com/UKM/go/msnnkmgl001006ukm/direct/01/
Re: Problems in solrJ trunk
Well, There is a simple case here. I tried to update SolrJ to use the last one and got the application selected for test broke. So, I developed an alternative interface for SolrServer and a wrapper to CommonsHttpSolrServer. Altered my aoolication to use it and everything is working nice. When you use good pratices like IoC or AOP it is preferred to program interface oriented. Sorry, but I find it really bad to go from an interface to a an abstract class. In my modest opinion, SolrServer should have both. Like SolrServer being an interface and AbstractSolrServer implementing it. CommonsHttpSolrServer would extend AbstractSolrServer. If the dev team really thinks interfaces are an ass, I think we will have problems using Solr with other advanced OO features. 2008/6/7 Ryan McKinley [EMAIL PROTECTED]: solrj was not released in 1.2, so the change is not incompatible... The rationalle for abstract class vs interface is more to do with usage and future maintenance. If SolrServer is an interface and solr 1.4 adds methods, there is no way to make it backwards compatible -- as an abstract class, we can add a reasonable default behavior. Since SolrServer is a rather involved action, it seems like it will tend to be a standalone class. Interfaces are great for OO clarity, but very difficult to maintain. Is there a good usage case we are not thinking of before this gets baked into 1.3? ryan On Jun 7, 2008, at 2:13 PM, Alexander Ramos Jardim wrote: Hello, Shouldn't SolrServer be an interface that externalizes the signatures for classes like CommonsHttpSolrServer, like it was in solr-1.2? Why did it became an abstract class? I can't see any benefit from it, as now I need to type the object as CommonsHttpSolrServer directly. I think it is really bad. -- Alexander Ramos Jardim -- Alexander Ramos Jardim
Re: DataImport
No, the steps are as follows: 1. Download the example-solr-home.jar from the DataImportHandler wiki page 2. Extract it. You'll find a folder named example-solr-home and a solr.war file after extraction 3. Copy the solr.war to tomcat_home/webapps. You don't need any other solr instance. This war is self-sufficient. 4. You need to set the example-solr-home/solr folder as the solr home folder. For instructions on how to do that, look at http://wiki.apache.org/solr/SolrTomcat From the port number of the URL you are trying, it seems that you're using the Jetty supplied with Solr instead of Tomcat. 2008/6/9 Mihails Agafonovs [EMAIL PROTECTED]: I've placed the solr.war under the tomcat directory, restarted tomcat to deploy the solr.war. But still... there is no .jar, no folder named example-data-config, and hitting http://localhost:8983/solr/dataimport doesn't work. Do I need the original Solr instance to use this .war with? Quoting Shalin Shekhar Mangar : 1. Correct, there is no jar. You can use the solr.war file. If you really need a jar, you'll need to use the SOLR-469.patch at http://issues.apache.org/jira/browse/SOLR-469 and build solr from source after applying that patch. 2. The jar contains a folder named example-solr-home. Please check again. Please let me know if you run into any problems. 2008/6/9 Mihails Agafonovs : Looked through the tutorial on data import, section Full Import Example. 1) Where is this dataimport.jar? There is no such file in the extracted example-solr-home.jar. 2) Use the solr folder inside example-data-config folder as your solr home. What does this mean? Anyway, there is no folder example-data-config. Ar cieņu, Mihails -- Regards, Shalin Shekhar Mangar. Ar cieņu, Mihails Links: -- [1] mailto:[EMAIL PROTECTED] -- Regards, Shalin Shekhar Mangar.
Re: Problems in solrJ trunk
Hi, This interface vs. abstract class and maintenance/backwards compatibility question comes up pretty often. I suggest using markmail.org and searching for things like: interface abstract solr -jira interface abstract lucene -jira I think that will lead to some explanations without anyone having to go into this discussion again. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Alexander Ramos Jardim [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, June 9, 2008 3:19:36 PM Subject: Re: Problems in solrJ trunk Well, There is a simple case here. I tried to update SolrJ to use the last one and got the application selected for test broke. So, I developed an alternative interface for SolrServer and a wrapper to CommonsHttpSolrServer. Altered my aoolication to use it and everything is working nice. When you use good pratices like IoC or AOP it is preferred to program interface oriented. Sorry, but I find it really bad to go from an interface to a an abstract class. In my modest opinion, SolrServer should have both. Like SolrServer being an interface and AbstractSolrServer implementing it. CommonsHttpSolrServer would extend AbstractSolrServer. If the dev team really thinks interfaces are an ass, I think we will have problems using Solr with other advanced OO features. 2008/6/7 Ryan McKinley : solrj was not released in 1.2, so the change is not incompatible... The rationalle for abstract class vs interface is more to do with usage and future maintenance. If SolrServer is an interface and solr 1.4 adds methods, there is no way to make it backwards compatible -- as an abstract class, we can add a reasonable default behavior. Since SolrServer is a rather involved action, it seems like it will tend to be a standalone class. Interfaces are great for OO clarity, but very difficult to maintain. Is there a good usage case we are not thinking of before this gets baked into 1.3? ryan On Jun 7, 2008, at 2:13 PM, Alexander Ramos Jardim wrote: Hello, Shouldn't SolrServer be an interface that externalizes the signatures for classes like CommonsHttpSolrServer, like it was in solr-1.2? Why did it became an abstract class? I can't see any benefit from it, as now I need to type the object as CommonsHttpSolrServer directly. I think it is really bad. -- Alexander Ramos Jardim -- Alexander Ramos Jardim
Re: Problems in solrJ trunk
Exactly, And adding the methods in the abstract class in the minor releases, and in the interface in major releases. []s, Lucas Lucas Frare A. Teixeira [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] Tel: +55 11 3660.1622 - R3018 Alexander Ramos Jardim escreveu: Well, There is a simple case here. I tried to update SolrJ to use the last one and got the application selected for test broke. So, I developed an alternative interface for SolrServer and a wrapper to CommonsHttpSolrServer. Altered my aoolication to use it and everything is working nice. When you use good pratices like IoC or AOP it is preferred to program interface oriented. Sorry, but I find it really bad to go from an interface to a an abstract class. In my modest opinion, SolrServer should have both. Like SolrServer being an interface and AbstractSolrServer implementing it. CommonsHttpSolrServer would extend AbstractSolrServer. If the dev team really thinks interfaces are an ass, I think we will have problems using Solr with other advanced OO features. 2008/6/7 Ryan McKinley [EMAIL PROTECTED]: solrj was not released in 1.2, so the change is not incompatible... The rationalle for abstract class vs interface is more to do with usage and future maintenance. If SolrServer is an interface and solr 1.4 adds methods, there is no way to make it backwards compatible -- as an abstract class, we can add a reasonable default behavior. Since SolrServer is a rather involved action, it seems like it will tend to be a standalone class. Interfaces are great for OO clarity, but very difficult to maintain. Is there a good usage case we are not thinking of before this gets baked into 1.3? ryan On Jun 7, 2008, at 2:13 PM, Alexander Ramos Jardim wrote: Hello, Shouldn't SolrServer be an interface that externalizes the signatures for classes like CommonsHttpSolrServer, like it was in solr-1.2? Why did it became an abstract class? I can't see any benefit from it, as now I need to type the object as CommonsHttpSolrServer directly. I think it is really bad. -- Alexander Ramos Jardim
Re: Problems in solrJ trunk
Thank you Lucas, You caught my point nicely and even got a clearer idea of what to do. Sorry Solr Dev Team, but I don't there is any reasonable excuse for making such an argument interface vs abstract class as they are complements and don't have the same role in OOP. Anyways, Solr is a great app. I just don't think it has the better programming practices, but that is just me. Let's not turn that in a flame, or big discussion. 2008/6/9 Lucas F. A. Teixeira [EMAIL PROTECTED]: Exactly, And adding the methods in the abstract class in the minor releases, and in the interface in major releases. []s, Lucas Lucas Frare A. Teixeira [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] Tel: +55 11 3660.1622 - R3018 Alexander Ramos Jardim escreveu: Well, There is a simple case here. I tried to update SolrJ to use the last one and got the application selected for test broke. So, I developed an alternative interface for SolrServer and a wrapper to CommonsHttpSolrServer. Altered my aoolication to use it and everything is working nice. When you use good pratices like IoC or AOP it is preferred to program interface oriented. Sorry, but I find it really bad to go from an interface to a an abstract class. In my modest opinion, SolrServer should have both. Like SolrServer being an interface and AbstractSolrServer implementing it. CommonsHttpSolrServer would extend AbstractSolrServer. If the dev team really thinks interfaces are an ass, I think we will have problems using Solr with other advanced OO features. 2008/6/7 Ryan McKinley [EMAIL PROTECTED]: solrj was not released in 1.2, so the change is not incompatible... The rationalle for abstract class vs interface is more to do with usage and future maintenance. If SolrServer is an interface and solr 1.4 adds methods, there is no way to make it backwards compatible -- as an abstract class, we can add a reasonable default behavior. Since SolrServer is a rather involved action, it seems like it will tend to be a standalone class. Interfaces are great for OO clarity, but very difficult to maintain. Is there a good usage case we are not thinking of before this gets baked into 1.3? ryan On Jun 7, 2008, at 2:13 PM, Alexander Ramos Jardim wrote: Hello, Shouldn't SolrServer be an interface that externalizes the signatures for classes like CommonsHttpSolrServer, like it was in solr-1.2? Why did it became an abstract class? I can't see any benefit from it, as now I need to type the object as CommonsHttpSolrServer directly. I think it is really bad. -- Alexander Ramos Jardim -- Alexander Ramos Jardim
solrj client in mven repository?
Hello all I'm new to solr, and have a question about the java client. Is it going to be available from central maven repository? I had a look, and saw that it is under development (1.3 dev), but someone may have tha answer. I built the trunk and solrj code seems to be separated from solr server's code. Best regards Zsolt
Re: NullPointerException at lucene.analysis.StopFilter with 1.3
: I'm just looking into transitioning from solr 1.2 to 1.3 (trunk). I : have some legacy handler code (called AdvancedRequestHandler) that : used to work with 1.2 but now throws an exception using 1.3 (latest : nightly build). This is an interesting use case that wasn't really considered when we switched away from using hte SolrCore singlton ... When I have some more time, i'll spin up a thread on solr-dev to discuss what we should do about this -- n the mean time feel free to file a bug that StopFilter isn't backwards compatible. Created SOLR-594 for this issue. FWIW: constructing a new TokenizerChain inside your RequestHandlers handeRequest method seems unneccessary. if nothing else, you could do this in your init method and reuse the TokenizerChain on every request. but if it were me, I'd just use the schema.xml to declare a fieldtype that had the behavior i want, and then use schema.getFieldType(specialType).getQueryAnalyzer().tokenStream(...) I actually had a single reusable version, but flattened it back out in the code snippet for clarity. But thanks for the tactful suggestion. :-) I didn't know that you could fetch the tokenizer chain directly from the schema (how cool), which was what was originally desired -- the constructed tokenizer was just mirroring an existing field. I appreciate the tip, Hoss -- much cleaner! r
XSL scripting
This started out in the num-docs thread, but deserves its own. And a wiki page. There is a more complex and general way to get the number of documents in the index. I run a query against solr and postprocess the output with an XSL script. Install this xsl script as home/conf/xslt/numfound.xsl. xsl:stylesheet version=1.0 xmlns:xsl=http://www.w3.org/1999/XSL/Transform; xsl:output method=text/ xsl:template match=/ xsl:value-of select=response/result/@numFound / xsl:text#x0A;/xsl:text /xsl:template /xsl:stylesheet Make sure 'curl' is installed, and add numfound.sh, a unix shell script. SHARD=localhost:8080/solr QUERY=$1 LINK=http://$SHARD/select?indent=onversion=2.2q=$QUERYstart=0rows=0 fl=*wt=xslttr=numfound.xsl curl --silent $LINK -H Content-Type:text -X GET Run it as sh numfound.sh *:* How to install the XSLT script is to be found on the Wiki. Star-colon-star is magic for 'all records'. XSL is appalling garbage. Cheers!
Re: solrj client in mven repository?
It is not in a central repo yet, though this has been requested. See the issue I filed here: https://issues.apache.org/jira/browse/SOLR-586 If you follow the outline there, you can build/install into your own repo pretty easily. Zsolt Czinkos-2 wrote: Hello all I'm new to solr, and have a question about the java client. Is it going to be available from central maven repository? I had a look, and saw that it is under development (1.3 dev), but someone may have tha answer. I built the trunk and solrj code seems to be separated from solr server's code. Best regards Zsolt -- View this message in context: http://www.nabble.com/solrj-client-in-mven-repository--tp17734823p17739891.html Sent from the Solr - User mailing list archive at Nabble.com.
html to text based on some sort of uniqueness metric
Hello, I am indexing newspaper articles as an excercise in solr. When dealing with newspaper articles in previous experiences I always tried to get the div or the table that contains the actual news, using nekohtml traversing tru the dom tree and getting the text from the div or table that contains the article. When dealing with many newspapers, it is a hassle to custom code to extract relevant information. There is usually a lot of garbage in the html. From categories to ads, and further more they change, so a static coding is problematic. I have been thinking if I could measure the frequency or uniqueness for each node, and find the news automatically - but I have not come up with an implementation. Has anyone did/contemplated/used something similar? Maybe there is already a way - using lucene, or even hadoop. Best Regards, -C.A.
Re: solrj client in mven repository?
I have done mine already. It is really simple. 2008/6/9 spencer.c [EMAIL PROTECTED]: It is not in a central repo yet, though this has been requested. See the issue I filed here: https://issues.apache.org/jira/browse/SOLR-586 If you follow the outline there, you can build/install into your own repo pretty easily. Zsolt Czinkos-2 wrote: Hello all I'm new to solr, and have a question about the java client. Is it going to be available from central maven repository? I had a look, and saw that it is under development (1.3 dev), but someone may have tha answer. I built the trunk and solrj code seems to be separated from solr server's code. Best regards Zsolt -- View this message in context: http://www.nabble.com/solrj-client-in-mven-repository--tp17734823p17739891.html Sent from the Solr - User mailing list archive at Nabble.com. -- Alexander Ramos Jardim
Re: Solr system and numbers
I got a similar question: how would one normalize or even detect if a string is a phone number? On Mon, Jun 9, 2008 at 4:17 PM, dudes dudes [EMAIL PROTECTED] wrote: great info ,,, thanks a lot all Date: Mon, 9 Jun 2008 05:58:50 -0700 From: [EMAIL PROTECTED] Subject: Re: Solr system and numbers To: solr-user@lucene.apache.org Hi, Solr/Lucene can treat phone numbers as strings. If you want to clean them up and normalize them outside of Solr, you can do that and feed them into Solr as pure numbers. How the phone numbers will be treated after you pump them into Solr depends on the analyzer you choose to use for this data. If you don't need to search on subsets of phone numbers, then just don't tokenize them (i.e. use string type if the phone numbers contain any non-numeric characters, sint otherwise). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: dudes dudes To: solr-user@lucene.apache.org Sent: Monday, June 9, 2008 2:10:20 PM Subject: Solr system and numbers Hello experts, How does Solr deal with numbers or phone numbers .. For example if you have 1234 and 12 34 or 1 234... with spaces between the numbers .. Or this is dealt by lucene ? any documentations or tutorial on this ? many thanks, ak _ All new Live Search at Live.com http://clk.atdmt.com/UKM/go/msnnkmgl001006ukm/direct/01/ _ All new Live Search at Live.com http://clk.atdmt.com/UKM/go/msnnkmgl001006ukm/direct/01/
Re: XSL scripting
Lance, Thanks, want to put it up on the Wiki? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Lance Norskog [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, June 9, 2008 1:12:35 PM Subject: XSL scripting This started out in the num-docs thread, but deserves its own. And a wiki page. There is a more complex and general way to get the number of documents in the index. I run a query against solr and postprocess the output with an XSL script. Install this xsl script as home/conf/xslt/numfound.xsl. xmlns:xsl=http://www.w3.org/1999/XSL/Transform; Make sure 'curl' is installed, and add numfound.sh, a unix shell script. SHARD=localhost:8080/solr QUERY=$1 LINK=http://$SHARD/select?indent=onversion=2.2q=$QUERYstart=0rows=0 fl=*wt=xslttr=numfound.xsl curl --silent $LINK -H Content-Type:text -X GET Run it as sh numfound.sh *:* How to install the XSLT script is to be found on the Wiki. Star-colon-star is magic for 'all records'. XSL is appalling garbage. Cheers!
Re: Solr system and numbers
Not sure. Perhaps it can be done by training a language model and treating phone numbers as named entities? Not sure if it would work. But I know there are a few NLP people subscribed, maybe they'll have some good ideas. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Cam Bazz [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, June 9, 2008 4:24:48 PM Subject: Re: Solr system and numbers I got a similar question: how would one normalize or even detect if a string is a phone number? On Mon, Jun 9, 2008 at 4:17 PM, dudes dudes wrote: great info ,,, thanks a lot all Date: Mon, 9 Jun 2008 05:58:50 -0700 From: [EMAIL PROTECTED] Subject: Re: Solr system and numbers To: solr-user@lucene.apache.org Hi, Solr/Lucene can treat phone numbers as strings. If you want to clean them up and normalize them outside of Solr, you can do that and feed them into Solr as pure numbers. How the phone numbers will be treated after you pump them into Solr depends on the analyzer you choose to use for this data. If you don't need to search on subsets of phone numbers, then just don't tokenize them (i.e. use string type if the phone numbers contain any non-numeric characters, sint otherwise). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: dudes dudes To: solr-user@lucene.apache.org Sent: Monday, June 9, 2008 2:10:20 PM Subject: Solr system and numbers Hello experts, How does Solr deal with numbers or phone numbers .. For example if you have 1234 and 12 34 or 1 234... with spaces between the numbers .. Or this is dealt by lucene ? any documentations or tutorial on this ? many thanks, ak _ All new Live Search at Live.com http://clk.atdmt.com/UKM/go/msnnkmgl001006ukm/direct/01/ _ All new Live Search at Live.com http://clk.atdmt.com/UKM/go/msnnkmgl001006ukm/direct/01/
Re: html to text based on some sort of uniqueness metric
I have not looked at the code yet, but look for NovelAnalyzer in Lucene JIRA. I believe it's supposed to do something similar. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Cam Bazz [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, June 9, 2008 3:55:16 PM Subject: html to text based on some sort of uniqueness metric Hello, I am indexing newspaper articles as an excercise in solr. When dealing with newspaper articles in previous experiences I always tried to get the div or the table that contains the actual news, using nekohtml traversing tru the dom tree and getting the text from the div or table that contains the article. When dealing with many newspapers, it is a hassle to custom code to extract relevant information. There is usually a lot of garbage in the html. From categories to ads, and further more they change, so a static coding is problematic. I have been thinking if I could measure the frequency or uniqueness for each node, and find the news automatically - but I have not come up with an implementation. Has anyone did/contemplated/used something similar? Maybe there is already a way - using lucene, or even hadoop. Best Regards, -C.A.
Re: Solr system and numbers
Doh, I forgot. Regular expressions worked well for me when I dealt with that problem many years ago. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Otis Gospodnetic [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, June 9, 2008 5:36:34 PM Subject: Re: Solr system and numbers Not sure. Perhaps it can be done by training a language model and treating phone numbers as named entities? Not sure if it would work. But I know there are a few NLP people subscribed, maybe they'll have some good ideas. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Cam Bazz To: solr-user@lucene.apache.org Sent: Monday, June 9, 2008 4:24:48 PM Subject: Re: Solr system and numbers I got a similar question: how would one normalize or even detect if a string is a phone number? On Mon, Jun 9, 2008 at 4:17 PM, dudes dudes wrote: great info ,,, thanks a lot all Date: Mon, 9 Jun 2008 05:58:50 -0700 From: [EMAIL PROTECTED] Subject: Re: Solr system and numbers To: solr-user@lucene.apache.org Hi, Solr/Lucene can treat phone numbers as strings. If you want to clean them up and normalize them outside of Solr, you can do that and feed them into Solr as pure numbers. How the phone numbers will be treated after you pump them into Solr depends on the analyzer you choose to use for this data. If you don't need to search on subsets of phone numbers, then just don't tokenize them (i.e. use string type if the phone numbers contain any non-numeric characters, sint otherwise). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: dudes dudes To: solr-user@lucene.apache.org Sent: Monday, June 9, 2008 2:10:20 PM Subject: Solr system and numbers Hello experts, How does Solr deal with numbers or phone numbers .. For example if you have 1234 and 12 34 or 1 234... with spaces between the numbers .. Or this is dealt by lucene ? any documentations or tutorial on this ? many thanks, ak _ All new Live Search at Live.com http://clk.atdmt.com/UKM/go/msnnkmgl001006ukm/direct/01/ _ All new Live Search at Live.com http://clk.atdmt.com/UKM/go/msnnkmgl001006ukm/direct/01/
Re: Overall
2008/6/9 Mihails Agafonovs [EMAIL PROTECTED]: Hi! Some questions: 1) Is it possible to make Solr to use, for example, MySQL database, or it only supports *.xml files as a database? If you do that, use MySQL own full text search capabilities and not Solr, as it is built from Lucene. 2) Is there a way to add data in the search database using some online interface, or the only way is manually adding the data in the *.xml files? You should develop your own . 3) Is there any guide on how to implement Solr to the web-site? Solr is easy to go. Choose your client api and begin toying with it. Good things will come fast. :-) Ar cieņu, Mihails -- Alexander Ramos Jardim
Re: solrj client in mven repository?
I've already installed the jars into my local repo, but the pom files are very useful. Thank you zsolt On Mon, Jun 9, 2008 at 10:02 PM, Alexander Ramos Jardim [EMAIL PROTECTED] wrote: I have done mine already. It is really simple. 2008/6/9 spencer.c [EMAIL PROTECTED]: It is not in a central repo yet, though this has been requested. See the issue I filed here: https://issues.apache.org/jira/browse/SOLR-586 If you follow the outline there, you can build/install into your own repo pretty easily. Zsolt Czinkos-2 wrote: Hello all I'm new to solr, and have a question about the java client. Is it going to be available from central maven repository? I had a look, and saw that it is under development (1.3 dev), but someone may have tha answer. I built the trunk and solrj code seems to be separated from solr server's code. Best regards Zsolt -- View this message in context: http://www.nabble.com/solrj-client-in-mven-repository--tp17734823p17739891.html Sent from the Solr - User mailing list archive at Nabble.com. -- Alexander Ramos Jardim
Re: Num docs
Exactly. I think I mentioned this once before several months ago. One can take various hardware specs (# cores, CPU speed, FSB, RAM, etc.), performance numbers, etc. and come up with a number for each server's overall capacity. As a matter of fact, I think this would be useful to have right in Solr, primarily for use when allocating and sizing shards for Distributed Search. JIRA enhancement/feature issue? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Alexander Ramos Jardim [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, June 9, 2008 6:42:17 PM Subject: Re: Num docs I even think that such a decision should be based on the overall machine performance at a given time, and not the index size. Unless you are talking solely about HD space and not having any performance issues. 2008/6/7 Otis Gospodnetic : Marcus, For that you can rely on du, vmstat, iostat, top and such, too. :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Marcus Herou To: solr-user@lucene.apache.org Sent: Saturday, June 7, 2008 12:33:10 PM Subject: Re: Num docs Thanks, I wanna ask the indices how much more each shard can handle before they're considered full and scream for a budget to get a new machine :) /M On Sat, Jun 7, 2008 at 3:07 PM, Otis Gospodnetic wrote: Marcus, check out the Luke request handler. You can get it from its output. It may also be possible to get *just* that number, but I'm not looking at docs/code right now to know for sure. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Marcus Herou To: solr-user@lucene.apache.org Sent: Saturday, June 7, 2008 5:09:20 AM Subject: Num docs Hi. Is there a way of retrieve IndexWriter.numDocs() in SOLR ? Kindly //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [EMAIL PROTECTED] http://www.tailsweep.com/ http://blogg.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [EMAIL PROTECTED] http://www.tailsweep.com/ http://blogg.tailsweep.com/ -- Alexander Ramos Jardim
searching only within allowed documents
Hi, I'm new to Solr (and Lucene) and I'm trying to work out just how I could fit this technology into my app (I'm moving over from using MySQL fulltext indexes). Things are actually going really well - the facet functionality fits in just perfectly, and the basic full-text searching is working very well for me as well, especially considering that I'm trying to index several languages at once. It's really much, much faster than MySQL. Somehow, I thought that would be the hard part! Unfortunately, I'm getting tripped up on something that seems far more complicated... So, there are two kinds of searches you can do in this application. There's an Advanced Search and a basic Text Search. For the Advanced Search, users pick out one or more sets of documents which they are allowed to see, and some set of tags to filter by, and they get a list of documents. This part is easy, I can do all of this with the functionality I picked up reading the docs and tutorials, and since my application is handling what sets of documents that my users can choose, Solr doesn't need to know anything about the permissions model. The text search is where I'm running into trouble. Right now, the application automatically filters the documents to search through with a join in MySQL. In order to do this through Solr, I need to figure out a good way for Solr to know what sets of documents in which to search. Here's what I have so far: 1) Each document has a field folder_id, which contains one value, which is the ID of the folder to which the document belongs. There are right now about 6000 different folders altogether. 2) Each user is permitted to see documents from a particular subset of folders. Some users can see only 100-200 folders, some users can see 4000-5000 folders (all depends on what they have subscribed to). In the advanced search, in order to restrict the available documents, I use a filter query: fq=folder_id:1 OR folder_id:2 etc... In the advanced search, the user is only ever searching through a max of 80 or 90 folders (and usually more like 1 or 2), so this seems quite workable. However, in the plain text search, the user automatically searches through *all* of the folders to which they have subscribed. This means, for (good!) users who have subscribed to a large (1000+) number of folders, the filter query would be quite long, and would exceed the default number of boolean parameters allowed. Of course, I could just increase the limit, but the fact that a limit is there in the first place leads me to believe this is probably not the most scalable solution. Now, I'm reading on this tutorial page for Lucene: http://www.lucenetutorial.com/techniques/permission-filtering.html that the best way to do this would involve some combination of HitCollector FieldCache. From what the author is saying, this sounds like exactly what I need. Unfortunately, I am almost completely Java-illiterate, and on top of that, I'm not really finding any explanation of: a) What exactly I would do with the HitCollector FieldCache objects that would help me achieve this goal - even just at the level of Lucene, there's no real explanation in the tutorial or b) Where exactly these classes fit in to Solr (if they do at all) So far I have already written my own (tiny, tiny) Tokenizer and TokenizerFactory for correctly parsing the tags that come in from the database, and that works great, so I'm thinking, if there's something I can sub-class or modify somewhere to get this working, even with my meager Java knowledge I could do it... But I have no clue even where to start with this. Do I need my own custom version of SolrIndexSearcher, or SolrRequestHandler... or some other class I haven't even gotten to yet? If it helps, I am using version 1.2, and trying to integrate this with a LAMP-based application. I already have hooks set up to allow PHP to index documents, query solr, and parse responses. Since everything else is already working so well, and it's just a matter of getting permissions working, I would really, really like to stick with Solr. Has anyone done anything like this or can point me in the right direction? I can figure out the mechanics of getting the list of allowed folder_ids to Solr, all I really need to know is what kind of modifications I would need to make, where, to get Solr to limit the search to a particular subset of documents without using a gigantic filter query. Many thanks for any advice. My apologies if this has been asked a million times before, I am new to the list however I did read and search through the archives and didn't really find anything on this subject. Best regards, Steve
Re: Overall
2) Take a look at DataImportHandler for indexing data at http://wiki.apache.org/solr/DataImportHandler 2008/6/10 Alexander Ramos Jardim [EMAIL PROTECTED]: 2008/6/9 Mihails Agafonovs [EMAIL PROTECTED]: Hi! Some questions: 1) Is it possible to make Solr to use, for example, MySQL database, or it only supports *.xml files as a database? If you do that, use MySQL own full text search capabilities and not Solr, as it is built from Lucene. 2) Is there a way to add data in the search database using some online interface, or the only way is manually adding the data in the *.xml files? You should develop your own . 3) Is there any guide on how to implement Solr to the web-site? Solr is easy to go. Choose your client api and begin toying with it. Good things will come fast. :-) Ar cieņu, Mihails -- Alexander Ramos Jardim -- Regards, Shalin Shekhar Mangar.