LSH in Solr/Lucene
Hi folks, have any of you successfully implemented LSH (MinHash) in Solr? If so, could you share some details of how you went about it? I know LSH is available in Mahout, but was hoping if someone has a solr or Lucene implementation. Thanks
Searching Numeric Data
Hi all, I have a use-case where I would need to search a set of numeric values, using a query set. My business case is 1. I have various Rock samples from various locations {R1...Rn} with multiple measurements like Porosity [255] - an array of values , Conductivity [1028] - also an array of numbers, and several such metrics etc. They are arrays becauses measurements are taken in various ambient conditions. 2. For a new rock-sample Rn+1 I would like to query Solr and get a ranked list of ordered by their multidimensional similarity. I was thinking using Solr as a way of perform this query, by representing the numeric arrays as text and creating a document for each sample with fields for each of the measurements. Has anyone approached in such a fashion? If so, could you share some details about your approach. Regards Shashi -- sk...@alum.mit.edu (604) 446-2460
Re: Solr Patent
You can ask on this site http://patents.stackexchange.com/ On Sat, Sep 14, 2013 at 10:03 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: On 9/13/2013 9:14 PM, Zaizen Ushio wrote: Hello I have a question about patent. I believe Apache license is protecting Solr developers from patent issue in Solr community. But is there any case that Solr developer or Solr users are alleged by outside of Solr Community? Is there any cases somebody experienced? Any advice is appreciated. Thanks, Zaizen Zaizen - I doubt you will get legal advice from this community. If you do get any other advice than to consult a lawyer, you should ignore it and consult a lawyer. Or move to New Zealand - I hear they outlawed software patents there. See that's just the sort of unhelpful legal advice you're likely to get here :) -Mike -- sk...@alum.mit.edu (604) 446-2460
Re: Document Similarity Algorithm at Solr/Lucene
Here is a paper that I found useful: http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI furkankam...@gmail.com wrote: Thanks for your comments. 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com if you need a specialized algorithm for detecting blogposts plagiarism / quotations (which are different tasks IMHO) I think you have 2 options: 1. implement a dedicated one based on your features / metrics / domain 2. try to fine tune an existing algorithm that is flexible enough If I were to do it with Solr I'd probably do something like: 1. index original blogposts in Solr (possibly using Jack's suggestion about ngrams / shingles) 2. do MLT queries with candidate blogposts copies text 3. get the first, say, 2-3 hits 4. mark it as quote / plagiarism 5. eventually train a classifier to help you mark other texts as quote / plagiarism HTH, Tommaso 2013/7/23 Furkan KAMACI furkankam...@gmail.com Actually I need a specialized algorithm. I want to use that algorithm to detect duplicate blog posts. 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com Hi, I you may leverage and / or improve MLT component [1]. HTH, Tommaso [1] : http://wiki.apache.org/solr/MoreLikeThis 2013/7/23 Furkan KAMACI furkankam...@gmail.com Hi; Sometimes a huge part of a document may exist in another document. As like in student plagiarism or quotation of a blog post at another blog post. Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to detect it?
Re: Search for misspelled words in corpus
n-grams might help, followed by a edit distance metric such as Jaro-Winkler or Smith-Waterman-Gotoh to further filter out. On Sun, Jun 9, 2013 at 1:59 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Interesting problem. The first thing that comes to mind is to do word expansion during indexing. Kind of like synonym expansion, but maybe a bit more dynamic. If you can have a dictionary of correctly spelled words, then for each token emitted by the tokenizer you could look up the dictionary and expand the token to all other words that are similar/close enough. This would not be super fast, and you'd likely have to add some custom heuristic for figuring out what similar/close enough means, but it might work. I'd love to hear other ideas... Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Jun 5, 2013 at 9:10 AM, కామేశ్వర రావు భైరవభట్ల kamesh...@gmail.com wrote: Hi, I have a problem where our text corpus on which we need to do search contains many misspelled words. Same word could also be misspelled in several different ways. It could also have documents that have correct spellings However, the search term that we give in query would always be correct spelling. Now when we search on a term, we would like to get all the documents that contain both correct and misspelled forms of the search term. We tried fuzzy search, but it doesn't work as per our expectations. It returns any close match, not specifically misspelled words. For example, if I'm searching for a word like fight, I would like to return the documents that have words like figth and feight, not documents with words like sight and light. Is there any suggested approach for doing this? regards, Kamesh
Re: How apache solr stores indexes
Better still start here: http://en.wikipedia.org/wiki/Inverted_index http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html And there are several books on search engines and related algorithms. On Tue, May 28, 2013 at 10:41 PM, Alexandre Rafalovitch arafa...@gmail.comwrote: And you need to know this why? If you are really trying to understand how this all works under the covers, you need to look at Lucene's inverted index as a start. Start here: http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#package_description Might take you a couple of weeks to put it all together. Or you could try asking the actual business-level question that you need an answer to. :-) Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, May 28, 2013 at 10:13 PM, Kamal Palei palei.ka...@gmail.com wrote: Dear All I have a basic doubt how the data is stored in apache solr indexes. Say I have thousand registered users in my site. Lets say I want to store skills of each users as a multivalued string index. Say user 1 has skill set - Java, MySql, PHP user 2 has skill set - C++, MySql, PHP user 3 has skill set - Java, Android, iOS ... so on You can see user 1 and 2 has two common skills that is MySql and PHP In an actual case there might be millions of repetition of words. Now question is, does apache solr stores them as just words, OR converts each words to an unique number and stores the number only. Best Regards Kamal Net Cloud Systems Bangalore, India
Re: Could I use Solr to index multiple applications?
Look up multicore solr. Another choice could be ElasticSearch - which is more straightforward in managing multiple indexes IMO. On Tue, Jul 17, 2012 at 7:53 PM, Zhang, Lisheng lisheng.zh...@broadvision.com wrote: Hi, We have an application where we index data into many different directories (each directory is corresponding to a different lucene IndexSearcher). Looking at Solr config it seems that Solr expects there is only one indexed data directory, can we use Solr for our application? Thanks very much for helps, Lisheng
Re: Could I use Solr to index multiple applications?
My suggestion would be to look into Multi Tenancy http://www.elasticsearch.org/. It is easy to setup and use for multiple indexes. On Tue, Jul 17, 2012 at 9:26 PM, Zhang, Lisheng lisheng.zh...@broadvision.com wrote: Thanks very much for quick help! Multicore sounds interesting, I roughly read the doc, so we need to put each core name into Solr config XML, if we add another core and change XML, do we need to restart Solr? Best regards, Lisheng -Original Message- From: shashi@gmail.com [mailto:shashi@gmail.com]On Behalf Of Shashi Kant Sent: Tuesday, July 17, 2012 5:46 PM To: solr-user@lucene.apache.org Subject: Re: Could I use Solr to index multiple applications? Look up multicore solr. Another choice could be ElasticSearch - which is more straightforward in managing multiple indexes IMO. On Tue, Jul 17, 2012 at 7:53 PM, Zhang, Lisheng lisheng.zh...@broadvision.com wrote: Hi, We have an application where we index data into many different directories (each directory is corresponding to a different lucene IndexSearcher). Looking at Solr config it seems that Solr expects there is only one indexed data directory, can we use Solr for our application? Thanks very much for helps, Lisheng
Re: Does Solr fit my needs?
We have used both Solr and graph databases for our XML file indexing. Both are equivalent in terms of performance, but a graph db (such as Neo4j) offers a lot more flexibility in joining across the nodes and traversing. If your data is strictly hierarchical Solr might do it, alternately suggest looking at a Graph database such as Neo4j. On Fri, Apr 27, 2012 at 10:36 AM, Bob Sandiford bob.sandif...@sirsidynix.com wrote: ndexing and searching of the specific fields, it is certainly possible to retrieve the xml file. While Solr isn't a DB, it does allow a binary field to be associated with an index document. We store a GZipped XML file in a binary field and retrieve that under certain conditions to get at original document information. We've found that Solr can handle these much faster than our DB can do. (We regularly index a large portion of our documents, and the XML files are prone to frequent changes). If you DO keep such a blob in your Solr index, make sure you retrieve that field ONLY when you really want it...
Re: How to Sort By a PageRank-Like Complicated Strategy?
You can update the document in the index quite frequently. IDNK what your requirement is, another option would be to boost query time. On Sun, Jan 22, 2012 at 5:51 AM, Bing Li lbl...@gmail.com wrote: Dear Shashi, Thanks so much for your reply! However, I think the value of PageRank is not a static one. It must update on the fly. As I know, Lucene index is not suitable to be updated too frequently. If so, how to deal with that? Best regards, Bing On Sun, Jan 22, 2012 at 12:43 PM, Shashi Kant sk...@sloan.mit.edu wrote: Lucene has a mechanism to boost up/down documents using your custom ranking algorithm. So if you come up with something like Pagerank you might do something like doc.SetBoost(myboost), before writing to index. On Sat, Jan 21, 2012 at 5:07 PM, Bing Li lbl...@gmail.com wrote: Hi, Kai, Thanks so much for your reply! If the retrieving is done on a string field, not a text field, a complete matching approach should be used according to my understanding, right? If so, how does Lucene rank the retrieved data? Best regards, Bing On Sun, Jan 22, 2012 at 5:56 AM, Kai Lu lukai1...@gmail.com wrote: Solr is kind of retrieval step, you can customize the score formula in Lucene. But it supposes not to be too complicated, like it's better can be factorization. It also regards to the stored information, like TF,DF,position, etc. You can do 2nd phase rerank to the top N data you have got. Sent from my iPad On Jan 21, 2012, at 1:33 PM, Bing Li lbl...@gmail.com wrote: Dear all, I am using SolrJ to implement a system that needs to provide users with searching services. I have some questions about Solr searching as follows. As I know, Lucene retrieves data according to the degree of keyword matching on text field (partial matching). But, if I search data by string field (complete matching), how does Lucene sort the retrieved data? If I want to add new sorting ways, Solr's function query seems to support this feature. However, for a complicated ranking strategy, such PageRank, can Solr provide an interface for me to do that? My ranking ways are more complicated than PageRank. Now I have to load all of matched data from Solr first by keyword and rank them again in my ways before showing to users. It is correct? Thanks so much! Bing
Re: How to Sort By a PageRank-Like Complicated Strategy?
Lucene has a mechanism to boost up/down documents using your custom ranking algorithm. So if you come up with something like Pagerank you might do something like doc.SetBoost(myboost), before writing to index. On Sat, Jan 21, 2012 at 5:07 PM, Bing Li lbl...@gmail.com wrote: Hi, Kai, Thanks so much for your reply! If the retrieving is done on a string field, not a text field, a complete matching approach should be used according to my understanding, right? If so, how does Lucene rank the retrieved data? Best regards, Bing On Sun, Jan 22, 2012 at 5:56 AM, Kai Lu lukai1...@gmail.com wrote: Solr is kind of retrieval step, you can customize the score formula in Lucene. But it supposes not to be too complicated, like it's better can be factorization. It also regards to the stored information, like TF,DF,position, etc. You can do 2nd phase rerank to the top N data you have got. Sent from my iPad On Jan 21, 2012, at 1:33 PM, Bing Li lbl...@gmail.com wrote: Dear all, I am using SolrJ to implement a system that needs to provide users with searching services. I have some questions about Solr searching as follows. As I know, Lucene retrieves data according to the degree of keyword matching on text field (partial matching). But, if I search data by string field (complete matching), how does Lucene sort the retrieved data? If I want to add new sorting ways, Solr's function query seems to support this feature. However, for a complicated ranking strategy, such PageRank, can Solr provide an interface for me to do that? My ranking ways are more complicated than PageRank. Now I have to load all of matched data from Solr first by keyword and rank them again in my ways before showing to users. It is correct? Thanks so much! Bing
Re: Solr, SQL Server's LIKE
for a simple, hackish (albeit inefficient) approach look up wildcard searchers e,g foo*, *bar On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten dbaumgar...@nationalcorp.com wrote: I have been tinkering with Solr for a few weeks, and I am convinced that it could be very helpful in many of my upcoming projects. I am trying to decide whether Solr is appropriate for this one, and I haven't had luck looking for answers on Google. I need to search a list of names of companies and individuals pretty exactly. T-SQL's LIKE operator does this with decent performance, but I have a feeling there is a way to configure Solr to do this better. I've tried using an edge N-gram tokenizer, but it feels like it might be more complicated than necessary. What would you suggest? I know this sounds kind of 'Golden Hammer,' but there has been talk of other, more complicated (magic) searches that I don't think SQL Server can handle, since its tokens (as far as I know) can't be smaller than one word. Thanks, Devon Baumgarten
Re: How to run the solr dedup for the document which match 80% or match almost.
You can also look at cosine similarity (or related metrics) to measure document similarity. On Tue, Dec 27, 2011 at 6:51 AM, vibhoreng04 vibhoren...@gmail.com wrote: Hi iorixxx, Thanks for the quick update.I hope I can take it from here ! Regards, Vibhor -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3614253.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Score
https://wiki.apache.org/lucene-java/ScoresAsPercentages On Mon, Aug 15, 2011 at 8:13 PM, Bill Bell billnb...@gmail.com wrote: How do I change the score to scale it between 0 and 100 irregardless of the score? q.alt=*:*bq=lang:SpanishdefType=dismax Bill Bell Sent from mobile
Re: Multiple Cores on different machines?
Betamax VCR? really ? :-) On Tue, Aug 9, 2011 at 3:38 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : A quick question - is it possible to have 2 cores in Solr on two different : machines? your question is a little vague ... like asking is it possible to have to have two betamax VCRs in two different rooms of my house ... sure, if you want ... but why are you asking the question? are you expecting those VCRs to be doing something special that makes you wonder if that special thing will work when there are two of them? https://people.apache.org/~hossman/#xyproblem XY Problem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
Re: Solr can not index F**K!
Check your Stop words list On Jul 31, 2011 6:25 PM, François Schiettecatte fschietteca...@gmail.com wrote: That seems a little far fetched, have you checked your analysis? François On Jul 31, 2011, at 4:58 PM, randohi wrote: One of our clients (a hot girl!) brought this to our attention: In this document there are many f* words: http://sec.gov/Archives/edgar/data/1474227/00014742271032/d424b3.htm and we have indexed it with latest version of Solr (ver 3.3). But, we if we search F**K, it does not return the document back! We have tried to index it with different text types, but still not working. Any idea why F* can not be indexed - being censored by the government? :D -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-can-not-index-F-K-tp3214246p3214246.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: searching a subset of SOLR index
Range query On Tue, Jul 5, 2011 at 4:37 AM, Jame Vaalet jvaa...@capitaliq.com wrote: Hi, Let say, I have got 10^10 documents in an index with unique id being document id which is assigned to each of those from 1 to 10^10 . Now I want to search a particular query string in a subset of these documents say ( document id 100 to 1000). The question here is.. will SOLR able to search just in this set of documents rather than the entire index ? if yes what should be query to limit search into this subset ? Regards, JAME VAALET Software Developer EXT :8108 Capital IQ
Re: Solr vs ElasticSearch
Here is a very interesting comparison http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/ -Original Message- From: Mark Sent: May-31-11 10:33 PM To: solr-user@lucene.apache.org Subject: Solr vs ElasticSearch I've been hearing more and more about ElasticSearch. Can anyone give me a rough overview on how these two technologies differ. What are the strengths/weaknesses of each. Why would one choose one of the other? Thanks
Re: I need an available solr lucene consultant
You might be better off looking for freelancers on sites such as odesk.com, guru.com, rentacoder.com, elance.com many more On Tue, May 17, 2011 at 4:09 PM, Markus Jelsma markus.jel...@openindex.io wrote: Check this out: http://wiki.apache.org/solr/Support Hi, I am looking for an experienced and skilled Solr Lucene developer/consultant to work on a software project incorporating natural language processing and machine learning algorithms. As part of a larger NLP/AI project that is under way, we need someone to install, refine and optimize Solr and Lucene for our website. The data being analyzed will be from user-generated textual discussions around a multitude of topics that will continuously be updated. You must be able to work in a LAMP environment with other developers, be smart, reliable, and a self-starter with excellent problem solving and analytical abilities. You must have a solid grasp of English – written and verbal. Please note that I am a start-up and I am not going to be able to pay what a large established company can pay. Thank you, Lance - Lance
Re: Looking for help with Solr implementation
Have you tried posting on odesk.com? I have had decent success finding Solr/Lucene resources there. On Thu, Nov 11, 2010 at 7:52 PM, AC acanuc...@yahoo.com wrote: Hi, Not sure if this is the correct place to post but I'm looking for someone to help finish a Solr install on our LAMP based website. This would be a paid project. The programmer that started the project got too busy with his full-time job to finish the project. Solr has been installed and a basic search is working but we need to configure it to work across the site and also set-up faceted search.I tried posting on some popular freelance sites but haven't been able to find anyone with real Solr expertise / experience. If you think you can help me with this project please let me know and I can supply more details. Regards
Re: Would it be nuts to store a bunch of large attachments (images, videos) in stored but-not-indexed fields
On Fri, Oct 29, 2010 at 6:00 PM, Ron Mayer r...@0ape.com wrote: I have some documents with a bunch of attachments (images, thumbnails for them, audio clips, word docs, etc); and am currently dealing with them by just putting a path on a filesystem to them in solr; and then jumping through hoops of keeping them in sync with solr. Not sure why that is an issue. Keeping them in sync with solr would be the same as storing within a file-system. Why would storing within solr be any different. Would it be nuts to stick the image data itself in solr? More specifically - if I have a bunch of large stored fields, would it significantly impact search performance in the cases when those fields aren't fetched. Hard to say. Assume you mean storing by converting into a base64 format. If you do not retrieve the field when fetching, AFAIK should not affect it significantly, if at all. So if you manage your retrieval should be fine. Searches are very common in this system, and it's very rare that someone actually opens up one of these attachments so I'm not really worried about the time it takes to fetch them when someone does actually want one.
Re: Color search for images
What I am envisioning (at least to start) is have all this add two fields in the index. One would be for color information for the color similarity search. The other would be a simple multivalued text field that we put keywords into based on what OpenCV can detect about the image. If it detects faces, we would put face into this field. Other things that it can detect would result in other keywords. For the color search, I have a few inter-related hurdles. I've got to figure out what form the color data actually takes and how to represent it in Solr. I need Java code for Solr that can take an input color value and find similar values in the index. Then I need some code that can go in our feed processing scripts for new content. That code would also go into a crawler script to handle existing images. You are on the right track. You can create a set of representative keywords from the image. OpenCV gets a color histogram from the image - you can set the bin values to be as granular as you need, and create a look-up list of color names to generate a MVF representative of the image. If you want to get more sophisticated, represent the colors with payloads in correlation with the distribution of the color in the image. Another approach would be to segment the image and extract colors from each. So if you have a red rose with all white background, the textual representation would be something like: white, white...red...white, white Play around and see which works best. HTH
Re: Color search for images
Lire looks promising, but how hard is it to integrate the content-based search into Solr as opposed to Lucene? I myself am not a Java developer. I have access to people who are, but their time is scarce. Lire is a nascent effort and based on a cursory overview a while back, IMHO was an over-simplified version of what a CBIR engine should be. They use CEDD (color edge descriptors). Wouldn't work for the kind of applications I am working on - which needs among other things, Color, Shape, Orientation, Pose, Edge/Corner etc. OpenCV has a steep learning curve, but having been through it, is very powerful toolkit - the best there is by far! BTW the code is in C++, but has both Java .NET bindings. This is a fabulous book to get hold of: http://www.amazon.com/Learning-OpenCV-Computer-Vision-Library/dp/0596516134, if you are seriously into OpenCV. Pls feel free to reach out of if you need any help with OpenCV + Solr/Lucene. I have spent quite a bit of time on this.
Re: Get all results from a solr query
q=*:* On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross cogr...@gmail.com wrote: I have some queries that I'm running against a solr instance (older, 1.2 I believe), and I would like to get *all* the results back (and not have to put an absurdly large number as a part of the rows parameter). Is there a way that I can do that? Any help would be appreciated. -- Chris
Re: Get all results from a solr query
Start with a *:*, then the “numFound” attribute of the result element should give you the rows to fetch by a 2nd request. On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross cogr...@gmail.com wrote: That will stil just return 10 rows for me. Is there something else in the configuration of solr to have it return all the rows in the results? -- Chris On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant sk...@sloan.mit.edu wrote: q=*:* On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross cogr...@gmail.com wrote: I have some queries that I'm running against a solr instance (older, 1.2 I believe), and I would like to get *all* the results back (and not have to put an absurdly large number as a part of the rows parameter). Is there a way that I can do that? Any help would be appreciated. -- Chris
Re: Color search for images
Shawn, I have done some research into this, machine-vision especially on a large scale is a hard problem, not to be entered into lightly. I would recommend starting with OpenCV - a comprehensive toolkit for extracting various features such as Color, Edge etc from images. Also there is a project LIRE http://www.semanticmetadata.net/lire/ which attempts to do something along what you are thinking of. Not sure how well it works. HTH, Shashi On Wed, Sep 15, 2010 at 10:59 AM, Shawn Heisey s...@elyograg.org wrote: My index consists of metadata for a collection of 45 million objects, most of which are digital images. The executives have fallen in love with Google's color image search. Here's a search for flower with a red color filter: http://www.google.com/images?q=flowertbs=isch:1,ic:specific,isc:red I am interested in duplicating this. Can this group of fine people point me in the right direction? I don't want anyone to do it for me, just help me find software and/or algorithms that can extract the color information, then find a way to get Solr to index and search it. Thanks, Shawn
Re: Color search for images
On a related note, I'm curious if anyone has run across a good set of algorithms (or hopefully a library) for doing naive image classification. I'm looking for something that can classify images into something similar to the broad categories that Google image search has (Face, Photo, Clip Art, Line Drawing, etc.). --Paul OpenCV is the way to go.Very comprehensive set of algorithms.
Re: Color search for images
I'm sure there's some post doctoral types who could get a graphic shape analyzer, color analyzer, to at least say it's a flower. However, even Google would have to build new datacenters to have the horsepower to do that kind of graphic processing. Not necessarily true. Like.com - which incidentally got acquired by Google recently - built a true visual search technology and applied it on a large scale.
Re: Indexing all versions of Microsoft Office Documents
If you are on Windows try the Microsoft IFilter API - it supports current Office versions. http://www.microsoft.com/downloads/details.aspx?FamilyId=60C92A37-719C-4077-B5C6-CAC34F4227CCdisplaylang=en On Tue, Apr 27, 2010 at 6:08 AM, Roland Villemoes r...@alpha-solutions.dk wrote: Hi All, Does anyone have a running solution indexing Microsoft Office Documents e.g. .docx .xlsx etc. ? I can see a lot of examples using Tika for rich content extraction, but still nothing when it comes to newer versions of Microsoft Office? What libraries to use of not Tika? med venlig hilsen/best regards Roland Villemoes Tel: (+45) 22 69 59 62 E-Mail: mailto:r...@alpha-solutions.dk Alpha Solutions A/S Borgergade 2, 3.sal, 1300 København K Tel: (+45) 70 20 65 38 Web: http://www.alpha-solutions.dkhttp://www.alpha-solutions.dk/ ** This message including any attachments may contain confidential and/or privileged information intended only for the person or entity to which it is addressed. If you are not the intended recipient you should delete this message. Any printing, copying, distribution or other use of this message is strictly prohibited. If you have received this message in error, please notify the sender immediately by telephone, or e-mail and delete all copies of this message and any attachments from your system. Thank you.
Re: LucidWorks Solr
Why do these approaches have to be mutually exclusive? Do a dictionary lookup, if no satisfactory match found use an algorithmic stemmer. Would probably save a few CPU cycles by algorithmic stemming iff necessary. On Wed, Apr 21, 2010 at 1:31 PM, Robert Muir rcm...@gmail.com wrote: sy to look at the faults of some algorithmic stemmer, in truth its only purpose is to cause related forms of the word to conflate to the same form, and hopefully avoiding unrelated terms from conflating to this form. A dictionary-based stemmer is out-of-date the day you put it into production: languages aren't static. For example, you can't expect a dictionary-based stemmer to properly deal with forms like googling or tweets that have recently slipped into English vocabulary, but an algorithmic stemmer will likely deal with these just fine.
Re: Query time only Ranges
In that case, you could just calculate an offset from 00:00:00 in seconds (ignore the date) Pretty simple. On Wed, Mar 31, 2010 at 4:57 PM, abhatna...@vantage.com abhatna...@vantage.com wrote: Hi Sashi, Could you elaborate point no .1 in the light of case where in a field should have just time? Ankit -- View this message in context: http://n3.nabble.com/Query-time-only-Ranges-tp688831p689413.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: boost on certain keywords
Look at Payload. On Thu, Jan 28, 2010 at 6:48 AM, murali k ilar...@gmail.com wrote: Say I have a clothes store, i have ladies clothes, mens clothes when someone searches for clothes, i want to prioritize mens clothing results, how can I achieve this ? this logic should only apply for this keyword, other keywords should work as-is should I be trying with something on synonyms or during the process of indexing ? or something in dismax request handler ? -- View this message in context: http://old.nabble.com/boost-on-certain-keywords-tp27354717p27354717.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: boost on certain keywords
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/ On Thu, Jan 28, 2010 at 6:54 AM, Shashi Kant sk...@sloan.mit.edu wrote: Look at Payload. On Thu, Jan 28, 2010 at 6:48 AM, murali k ilar...@gmail.com wrote: Say I have a clothes store, i have ladies clothes, mens clothes when someone searches for clothes, i want to prioritize mens clothing results, how can I achieve this ? this logic should only apply for this keyword, other keywords should work as-is should I be trying with something on synonyms or during the process of indexing ? or something in dismax request handler ? -- View this message in context: http://old.nabble.com/boost-on-certain-keywords-tp27354717p27354717.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: HI
http://lmgtfy.com/?q=lucene+basics On Sun, Dec 13, 2009 at 1:01 PM, Faire Mii faire@gmail.com wrote: Hi, I am a beginner and i wonder what a document, entity and a field relates to in a database? And i wonder if there are some good tutorials that learn you how to design your schema. Because all other articles i have read aren't very helpful for beginners. Regards Fayer
Re: Migrating to Solr
Here is a link that might be helpful: http://sesat.no/moving-from-fast-to-solr-review.html The site is choc-a-bloc with great information on their migration experience. On Tue, Nov 24, 2009 at 8:55 AM, Tommy Molto tommymo...@gmail.com wrote: Hi, I'm new at Solr and i need to make a test pilot of a migration from Fast ESP to Apache Solr, anyone had this experience before? Att,
Re: Solr - Load Increasing.
I think it would be useful for members of this list to realize that not everyone uses the same metrology and terms. It is very easy for Americans to use the imperial system and presume everyone does the same; Europeans to use the metric system etc. Hopefully members on this list would be persuaded to use or at least clarify their terminology. While the apocryphal saying goes the great thing about standards is they are so many choose from, we should all make an effort to communicate across cultures and nations. On Mon, Nov 16, 2009 at 5:33 PM, Israel Ekpo israele...@gmail.com wrote: On Mon, Nov 16, 2009 at 5:22 PM, Walter Underwood wun...@wunderwood.org wrote: Probably lakh: 100,000. So, 900k qpd and 3M docs. http://en.wikipedia.org/wiki/Lakh wunder On Nov 16, 2009, at 2:17 PM, Otis Gospodnetic wrote: Hi, Your autoCommit settings are very aggressive. I'm guessing that's what's causing the CPU load. btw. what is laks? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: kalidoss kalidoss.muthuramalin...@sifycorp.com To: solr-user@lucene.apache.org Sent: Mon, November 16, 2009 9:11:21 AM Subject: Solr - Load Increasing. Hi All. My server solr box cpu utilization increasing b/w 60 to 90% and some time solr is getting down and we are restarting it manually. No of documents in solr 30 laks. No of add/update requrest solr 30 thousand / day. Avg of every 30 minutes around 500 writes. No of search request 9laks / day. Size of the data directory: 4gb. My system ram is 8gb. System available space 12gb. processor Family: Pentium Pro Our solr data size can be increase in number like 90 laks. and writes per day will be around 1laks. - Hope its possible by solr. For write commit i have configured like 1 10 Is all above can be possible? 90laks datas and 1laks per day writes and 30laks per day read?? - if yes what type of system configuration would require. Please suggest us. thanks, Kalidoss.m, Get your world in your inbox! Mail, widgets, documents, spreadsheets, organizer and much more with your Sifymail WIYI id! Log on to http://www.sify.com ** DISCLAIMER ** Information contained and transmitted by this E-MAIL is proprietary to Sify Limited and is intended for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If this is a forwarded message, the content of this E-MAIL may not have been sent with the authority of the Company. If you are not the intended recipient, an agent of the intended recipient or a person responsible for delivering the information to the named recipient, you are notified that any use, distribution, transmission, printing, copying or dissemination of this information in any way or in any manner is strictly prohibited. If you have received this communication in error, please delete this mail notify us immediately at ad...@sifycorp.com Thanks Walter for clarifying that. I too was wondering what laks meant. It was a bit distracting when I read the original post. -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once.
Re: Search Within
This post describes the search-within-search implementation. http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.html Shashi On Sat, Apr 4, 2009 at 1:21 PM, Vernon Chapman chapman.li...@gmail.comwrote: Bess, I think that might work I'll try it out and see how it works for my case. thanks Bess Sadler wrote: Hi, Vernon. In Blacklight, the way we've been doing this is just to stack queries on top of each other. It's a conceptual shift from the way one might think about search within, but it accomplishes the same thing. For example: search1 == q=horse search2 == q=horse AND dog The second search, from the user's point of view, takes the search results from the horse search and further narrows them to those items that also contain dog. But you're really just doing a new search, one that contains both search values. Does that help? Or am I misunderstanding your question? Bess On 4-Apr-09, at 12:10 PM, Vernon Chapman wrote: I am not sure if this is a really easy or newbee-ish type question. I would like to implement a search within these results type feature. Has anyone done this and could you please share some tips, pointers and or documentation on how to implement this. Thanks Vern
Re: Hardware Questions...
Have you looked at http://wiki.apache.org/solr/SolrPerformanceData ?http://wiki.apache.org/solr/SolrPerformanceData On Tue, Mar 24, 2009 at 4:51 PM, solr s...@highbeam.com wrote: We have three Solr servers (several two processor Dell PowerEdge servers). I'd like to get three newer servers and I wanted to see what we should be getting. I'm thinking the following... Dell PowerEdge 2950 III 2x2.33GHz/12M 1333MHz Quad Core 16GB RAM 6 x 146GB 15K RPM RAID-5 drives How do people spec out servers, especially CPU, memory and disk? Is this all based on the number of doc's, indexes, etc... Also, what are people using for benchmarking and monitoring Solr? Thanks - Mike
Re: Use of scanned documents for text extraction and indexing
Another project worth investigating is Tesseract. http://code.google.com/p/tesseract-ocr/ - Original Message From: Hannes Carl Meyer m...@hcmeyer.com To: solr-user@lucene.apache.org Sent: Thursday, February 26, 2009 11:35:14 AM Subject: Re: Use of scanned documents for text extraction and indexing Hi Sithu, there is a project called ocropus done by the DFKI, check the online demo here: http://demo.iupr.org/cgi-bin/main.cgi And also http://sites.google.com/site/ocropus/ Regards Hannes m...@hcmeyer.com http://mimblog.de On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. sithu.sudar...@fda.hhs.gov wrote: Hi All: Is there any study / research done on using scanned paper documents as images (may be PDF), and then use some OCR or other technique for extracting text, and the resultant index quality? Thanks in advance, Sithu D Sudarsan sithu.sudar...@fda.hhs.gov sdsudar...@ualr.edu
Re: Use of scanned documents for text extraction and indexing
Can anyone back that up? IMHO Tesseract is the state-of-the-art in OCR, but not sure that Ocropus builds on Tesseract. Can you confirm that Vikram has a point? Shashi - Original Message From: Vikram Kumar vikrambku...@gmail.com To: solr-user@lucene.apache.org; Shashi Kant sk...@sloan.mit.edu Sent: Thursday, February 26, 2009 9:21:07 PM Subject: Re: Use of scanned documents for text extraction and indexing Tesseract is pure OCR. Ocropus builds on Tesseract. Vikram On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant shashi_k...@yahoo.com wrote: Another project worth investigating is Tesseract. http://code.google.com/p/tesseract-ocr/ - Original Message From: Hannes Carl Meyer m...@hcmeyer.com To: solr-user@lucene.apache.org Sent: Thursday, February 26, 2009 11:35:14 AM Subject: Re: Use of scanned documents for text extraction and indexing Hi Sithu, there is a project called ocropus done by the DFKI, check the online demo here: http://demo.iupr.org/cgi-bin/main.cgi And also http://sites.google.com/site/ocropus/ Regards Hannes m...@hcmeyer.com http://mimblog.de On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. sithu.sudar...@fda.hhs.gov wrote: Hi All: Is there any study / research done on using scanned paper documents as images (may be PDF), and then use some OCR or other technique for extracting text, and the resultant index quality? Thanks in advance, Sithu D Sudarsan sithu.sudar...@fda.hhs.gov sdsudar...@ualr.edu
Re: why don't we have a forum for discussion?
one man's crap is another man's treasure. :-P So how would you decide what is worth posting? If you feel the list is overwhelming your email, set some filters. Shashi - Original Message From: Tony Wang ivyt...@gmail.com To: solr-user@lucene.apache.org Sent: Wednesday, February 18, 2009 2:06:57 PM Subject: why don't we have a forum for discussion? I am just curious why we don't have a forum for discussion or you guys think it's really necessary to receive lots of crap information about Solr and nutch in email? I can offer you a forum for discussion anyway. -- Are you RCholic? www.RCholic.com 温 良 恭 俭 让 仁 义 礼 智 信
Re: why don't we have a forum for discussion?
Steve - could you not just subscribe to the list from another (off-mobile device) email (Gmail or Yahoo) for example? We discourage using corporate email for subscribing mailing lists precisely for such reasons : volume, spam, malware risks etc. Shashi - Original Message From: Stephen Weiss swe...@stylesight.com To: solr-user@lucene.apache.org Sent: Wednesday, February 18, 2009 7:34:30 PM Subject: Re: why don't we have a forum for discussion? Like an earlier poster, my issue isn't on the laptop, it's with my mobile device. The sheer volume of e-mail overwhelms the thing sometimes (right now, for instance). There's really no option for moving the e-mail off to some other folder, it just all goes to one place. Perhaps that means I need a better phone, it's just the obvious solutions aren't always practical. Forums can conversely just as easily be set up to emulate mailing lists as well... Our company's internal forum works this way. -- Steve On Feb 18, 2009, at 7:16 PM, Mike Klaas wrote: 2. Many people greatly prefer the mailing list format (obviously, it takes a little bit of effort to use mailinglists effectively (e.g., directing the traffic to a folder/tag/etc.)