LSH in Solr/Lucene

2014-01-20 Thread Shashi Kant
Hi folks, have any of you successfully implemented LSH (MinHash) in
Solr? If so, could you share some details of how you went about it?

I know LSH is available in Mahout, but was hoping if someone has a
solr or Lucene implementation.

Thanks


Searching Numeric Data

2014-01-11 Thread Shashi Kant
Hi all, I have a use-case where I would need to search a set of
numeric values, using a query set. My business case is

1. I have various Rock samples from various locations {R1...Rn} with
multiple measurements like Porosity [255] - an array of values ,
Conductivity [1028] - also an array of numbers, and several such
metrics etc.

They are arrays becauses measurements are taken in various ambient conditions.

2. For a new rock-sample Rn+1 I would like to query Solr and get a
ranked list of ordered by their multidimensional similarity.

I was thinking using Solr as a way of perform this query, by
representing the numeric arrays as text and creating a document for
each sample with fields for each of the measurements.

Has anyone approached in such a fashion? If so, could you share some
details about your approach.

Regards
Shashi


-- 
sk...@alum.mit.edu
(604) 446-2460


Re: Solr Patent

2013-09-14 Thread Shashi Kant
You can ask on this site http://patents.stackexchange.com/



On Sat, Sep 14, 2013 at 10:03 AM, Michael Sokolov
msoko...@safaribooksonline.com wrote:
 On 9/13/2013 9:14 PM, Zaizen Ushio wrote:

 Hello
 I have a question about patent.  I believe Apache license is protecting
 Solr developers from patent issue in Solr community.  But is there any case
 that Solr developer or Solr users are alleged by outside of Solr Community?
 Is there any cases somebody experienced?  Any advice is appreciated.

 Thanks,  Zaizen



 Zaizen - I doubt you will get legal advice from this community.  If you do
 get any other advice than to consult a lawyer, you should ignore it and
 consult a lawyer.  Or move to New Zealand - I hear they outlawed software
 patents there.  See that's just the sort of unhelpful legal advice you're
 likely to get here :)

 -Mike



-- 
sk...@alum.mit.edu
(604) 446-2460


Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Shashi Kant
Here is a paper that I found useful:
http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf


On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI furkankam...@gmail.com wrote:
 Thanks for your comments.

 2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com

 if you need a specialized algorithm for detecting blogposts plagiarism /
 quotations (which are different tasks IMHO) I think you have 2 options:
 1. implement a dedicated one based on your features / metrics / domain
 2. try to fine tune an existing algorithm that is flexible enough

 If I were to do it with Solr I'd probably do something like:
 1. index original blogposts in Solr (possibly using Jack's suggestion
 about ngrams / shingles)
 2. do MLT queries with candidate blogposts copies text
 3. get the first, say, 2-3 hits
 4. mark it as quote / plagiarism
 5. eventually train a classifier to help you mark other texts as quote /
 plagiarism

 HTH,
 Tommaso



 2013/7/23 Furkan KAMACI furkankam...@gmail.com

  Actually I need a specialized algorithm. I want to use that algorithm to
  detect duplicate blog posts.
 
  2013/7/23 Tommaso Teofili tommaso.teof...@gmail.com
 
   Hi,
  
   I you may leverage and / or improve MLT component [1].
  
   HTH,
   Tommaso
  
   [1] : http://wiki.apache.org/solr/MoreLikeThis
  
  
   2013/7/23 Furkan KAMACI furkankam...@gmail.com
  
Hi;
   
Sometimes a huge part of a document may exist in another document. As
   like
in student plagiarism or quotation of a blog post at another blog
 post.
Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class
  to
detect it?
   
  
 



Re: Search for misspelled words in corpus

2013-06-09 Thread Shashi Kant
n-grams might help, followed by a edit distance metric such as Jaro-Winkler
or Smith-Waterman-Gotoh to further filter out.


On Sun, Jun 9, 2013 at 1:59 AM, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

 Interesting problem.  The first thing that comes to mind is to do
 word expansion during indexing.  Kind of like synonym expansion, but
 maybe a bit more dynamic. If you can have a dictionary of correctly
 spelled words, then for each token emitted by the tokenizer you could
 look up the dictionary and expand the token to all other words that
 are similar/close enough.  This would not be super fast, and you'd
 likely have to add some custom heuristic for figuring out what
 similar/close enough means, but it might work.

 I'd love to hear other ideas...

 Otis
 --
 Solr  ElasticSearch Support
 http://sematext.com/





 On Wed, Jun 5, 2013 at 9:10 AM, కామేశ్వర రావు భైరవభట్ల
 kamesh...@gmail.com wrote:
  Hi,
 
  I have a problem where our text corpus on which we need to do search
  contains many misspelled words. Same word could also be misspelled in
  several different ways. It could also have documents that have correct
  spellings However, the search term that we give in query would always be
  correct spelling. Now when we search on a term, we would like to get all
  the documents that contain both correct and misspelled forms of the
 search
  term.
  We tried fuzzy search, but it doesn't work as per our expectations. It
  returns any close match, not specifically misspelled words. For example,
 if
  I'm searching for a word like fight, I would like to return the
 documents
  that have words like figth and feight, not documents with words like
  sight and light.
  Is there any suggested approach for doing this?
 
  regards,
  Kamesh



Re: How apache solr stores indexes

2013-05-28 Thread Shashi Kant
Better still start here: http://en.wikipedia.org/wiki/Inverted_index

http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html

And there are several books on search engines and related algorithms.



On Tue, May 28, 2013 at 10:41 PM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 And you need to know this why?

 If you are really trying to understand how this all works under the
 covers, you need to look at Lucene's inverted index as a start. Start
 here:
 http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/codecs/lucene42/package-summary.html#package_description

 Might take you a couple of weeks to put it all together.

 Or you could try asking the actual business-level question that you
 need an answer to. :-)

 Regards,
Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all
 at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
 book)


 On Tue, May 28, 2013 at 10:13 PM, Kamal Palei palei.ka...@gmail.com
 wrote:
  Dear All
  I have a basic doubt how the data is stored in apache solr indexes.
 
  Say I have thousand registered users in my site. Lets say I want to store
  skills of each users as a multivalued string index.
 
  Say
  user 1 has skill set - Java, MySql, PHP
  user 2 has skill set - C++, MySql, PHP
  user 3 has skill set - Java, Android, iOS
  ... so on
 
  You can see user 1 and 2 has two common skills that is MySql and PHP
  In an actual case there might be millions of repetition of words.
 
  Now question is, does apache solr stores them as just words, OR converts
  each words to an unique number and stores the number only.
 
  Best Regards
  Kamal
  Net Cloud Systems
  Bangalore, India



Re: Could I use Solr to index multiple applications?

2012-07-17 Thread Shashi Kant
Look up multicore solr. Another choice could be ElasticSearch - which
is more straightforward in managing multiple indexes IMO.



On Tue, Jul 17, 2012 at 7:53 PM, Zhang, Lisheng
lisheng.zh...@broadvision.com wrote:
 Hi,

 We have an application where we index data into many different directories 
 (each directory
 is corresponding to a different lucene IndexSearcher).

 Looking at Solr config it seems that Solr expects there is only one indexed 
 data directory,
 can we use Solr for our application?

 Thanks very much for helps, Lisheng



Re: Could I use Solr to index multiple applications?

2012-07-17 Thread Shashi Kant
My suggestion would be to look into Multi Tenancy http://www.elasticsearch.org/.
It is easy to setup and use for multiple indexes.


On Tue, Jul 17, 2012 at 9:26 PM, Zhang, Lisheng
lisheng.zh...@broadvision.com wrote:
 Thanks very much for quick help! Multicore sounds interesting,
 I roughly read the doc, so we need to put each core name into
 Solr config XML, if we add another core and change XML, do we
 need to restart Solr?

 Best regards, Lisheng

 -Original Message-
 From: shashi@gmail.com [mailto:shashi@gmail.com]On Behalf Of
 Shashi Kant
 Sent: Tuesday, July 17, 2012 5:46 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Could I use Solr to index multiple applications?


 Look up multicore solr. Another choice could be ElasticSearch - which
 is more straightforward in managing multiple indexes IMO.



 On Tue, Jul 17, 2012 at 7:53 PM, Zhang, Lisheng
 lisheng.zh...@broadvision.com wrote:
 Hi,

 We have an application where we index data into many different directories 
 (each directory
 is corresponding to a different lucene IndexSearcher).

 Looking at Solr config it seems that Solr expects there is only one indexed 
 data directory,
 can we use Solr for our application?

 Thanks very much for helps, Lisheng



Re: Does Solr fit my needs?

2012-04-27 Thread Shashi Kant
We have used both Solr and graph databases for our XML file indexing. Both
are equivalent in terms of performance, but a graph db (such as Neo4j)
offers a lot more flexibility in joining across the nodes and traversing.
If your data is strictly hierarchical Solr might do it, alternately suggest
looking at a Graph database such as Neo4j.



On Fri, Apr 27, 2012 at 10:36 AM, Bob Sandiford 
bob.sandif...@sirsidynix.com wrote:

 ndexing and searching of the specific fields, it is certainly possible to
 retrieve the xml file.  While Solr isn't a DB, it does allow a binary field
 to be associated with an index document.  We store a GZipped XML file in a
 binary field and retrieve that under certain conditions to get at original
 document information.  We've found that Solr can handle these much faster
 than our DB can do.  (We regularly index a large portion of our documents,
 and the XML files are prone to frequent changes).  If you DO keep such a
 blob in your Solr index, make sure you retrieve that field ONLY when you
 really want it...


Re: How to Sort By a PageRank-Like Complicated Strategy?

2012-01-23 Thread Shashi Kant
You can update the document in the index quite frequently. IDNK what
your requirement is, another option would be to boost query time.

On Sun, Jan 22, 2012 at 5:51 AM, Bing Li lbl...@gmail.com wrote:
 Dear Shashi,

 Thanks so much for your reply!

 However, I think the value of PageRank is not a static one. It must update
 on the fly. As I know, Lucene index is not suitable to be updated too
 frequently. If so, how to deal with that?

 Best regards,
 Bing


 On Sun, Jan 22, 2012 at 12:43 PM, Shashi Kant sk...@sloan.mit.edu wrote:

 Lucene has a mechanism to boost up/down documents using your custom
 ranking algorithm. So if you come up with something like Pagerank
 you might do something like doc.SetBoost(myboost), before writing to
 index.



 On Sat, Jan 21, 2012 at 5:07 PM, Bing Li lbl...@gmail.com wrote:
  Hi, Kai,
 
  Thanks so much for your reply!
 
  If the retrieving is done on a string field, not a text field, a
  complete
  matching approach should be used according to my understanding, right?
  If
  so, how does Lucene rank the retrieved data?
 
  Best regards,
  Bing
 
  On Sun, Jan 22, 2012 at 5:56 AM, Kai Lu lukai1...@gmail.com wrote:
 
  Solr is kind of retrieval step, you can customize the score formula in
  Lucene. But it supposes not to be too complicated, like it's better can
  be
  factorization. It also regards to the stored information, like
  TF,DF,position, etc. You can do 2nd phase rerank to the top N data you
  have
  got.
 
  Sent from my iPad
 
  On Jan 21, 2012, at 1:33 PM, Bing Li lbl...@gmail.com wrote:
 
   Dear all,
  
   I am using SolrJ to implement a system that needs to provide users
   with
   searching services. I have some questions about Solr searching as
  follows.
  
   As I know, Lucene retrieves data according to the degree of keyword
   matching on text field (partial matching).
  
   But, if I search data by string field (complete matching), how does
  Lucene
   sort the retrieved data?
  
   If I want to add new sorting ways, Solr's function query seems to
   support
   this feature.
  
   However, for a complicated ranking strategy, such PageRank, can Solr
   provide an interface for me to do that?
  
   My ranking ways are more complicated than PageRank. Now I have to
   load
  all
   of matched data from Solr first by keyword and rank them again in my
   ways
   before showing to users. It is correct?
  
   Thanks so much!
   Bing
 




Re: How to Sort By a PageRank-Like Complicated Strategy?

2012-01-21 Thread Shashi Kant
Lucene has a mechanism to boost up/down documents using your custom
ranking algorithm. So if you come up with something like Pagerank
you might do something like doc.SetBoost(myboost), before writing to index.



On Sat, Jan 21, 2012 at 5:07 PM, Bing Li lbl...@gmail.com wrote:
 Hi, Kai,

 Thanks so much for your reply!

 If the retrieving is done on a string field, not a text field, a complete
 matching approach should be used according to my understanding, right? If
 so, how does Lucene rank the retrieved data?

 Best regards,
 Bing

 On Sun, Jan 22, 2012 at 5:56 AM, Kai Lu lukai1...@gmail.com wrote:

 Solr is kind of retrieval step, you can customize the score formula in
 Lucene. But it supposes not to be too complicated, like it's better can be
 factorization. It also regards to the stored information, like
 TF,DF,position, etc. You can do 2nd phase rerank to the top N data you have
 got.

 Sent from my iPad

 On Jan 21, 2012, at 1:33 PM, Bing Li lbl...@gmail.com wrote:

  Dear all,
 
  I am using SolrJ to implement a system that needs to provide users with
  searching services. I have some questions about Solr searching as
 follows.
 
  As I know, Lucene retrieves data according to the degree of keyword
  matching on text field (partial matching).
 
  But, if I search data by string field (complete matching), how does
 Lucene
  sort the retrieved data?
 
  If I want to add new sorting ways, Solr's function query seems to support
  this feature.
 
  However, for a complicated ranking strategy, such PageRank, can Solr
  provide an interface for me to do that?
 
  My ranking ways are more complicated than PageRank. Now I have to load
 all
  of matched data from Solr first by keyword and rank them again in my ways
  before showing to users. It is correct?
 
  Thanks so much!
  Bing



Re: Solr, SQL Server's LIKE

2011-12-29 Thread Shashi Kant
for a simple, hackish (albeit inefficient) approach look up wildcard searchers

e,g foo*, *bar



On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten
dbaumgar...@nationalcorp.com wrote:
 I have been tinkering with Solr for a few weeks, and I am convinced that it 
 could be very helpful in many of my upcoming projects. I am trying to decide 
 whether Solr is appropriate for this one, and I haven't had luck looking for 
 answers on Google.

 I need to search a list of names of companies and individuals pretty exactly. 
 T-SQL's LIKE operator does this with decent performance, but I have a feeling 
 there is a way to configure Solr to do this better. I've tried using an edge 
 N-gram tokenizer, but it feels like it might be more complicated than 
 necessary. What would you suggest?

 I know this sounds kind of 'Golden Hammer,' but there has been talk of other, 
 more complicated (magic) searches that I don't think SQL Server can handle, 
 since its tokens (as far as I know) can't be smaller than one word.

 Thanks,

 Devon Baumgarten



Re: How to run the solr dedup for the document which match 80% or match almost.

2011-12-27 Thread Shashi Kant
You can also look at cosine similarity (or related metrics) to measure
document similarity.

On Tue, Dec 27, 2011 at 6:51 AM, vibhoreng04 vibhoren...@gmail.com wrote:
 Hi iorixxx,

 Thanks for the quick update.I hope I can take it from here !


 Regards,

 Vibhor

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-run-the-solr-dedup-for-the-document-which-match-80-or-match-almost-tp3614239p3614253.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Score

2011-08-15 Thread Shashi Kant
https://wiki.apache.org/lucene-java/ScoresAsPercentages



On Mon, Aug 15, 2011 at 8:13 PM, Bill Bell billnb...@gmail.com wrote:

 How do I change the score to scale it between 0 and 100 irregardless of the
 score?

 q.alt=*:*bq=lang:SpanishdefType=dismax

 Bill Bell
 Sent from mobile




Re: Multiple Cores on different machines?

2011-08-09 Thread Shashi Kant
Betamax VCR? really ? :-)



On Tue, Aug 9, 2011 at 3:38 PM, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : A quick question - is it possible to have 2 cores in Solr on two
 different
 : machines?

 your question is a little vague ... like asking is it possible to have to
 have two betamax VCRs in two different rooms of my house ... sure, if you
 want ... but why are you asking the question?  are you expecting those
 VCRs to be doing something special that makes you wonder if that special
 thing will work when there are two of them?

 https://people.apache.org/~hossman/#xyproblem
 XY Problem

 Your question appears to be an XY Problem ... that is: you are dealing
 with X, you are assuming Y will help you, and you are asking about Y
 without giving more details about the X so that we can understand the
 full issue.  Perhaps the best solution doesn't involve Y at all?
 See Also: http://www.perlmonks.org/index.pl?node_id=542341


 -Hoss



Re: Solr can not index F**K!

2011-07-31 Thread Shashi Kant
Check your Stop words list
On Jul 31, 2011 6:25 PM, François Schiettecatte fschietteca...@gmail.com
wrote:
 That seems a little far fetched, have you checked your analysis?

 François

 On Jul 31, 2011, at 4:58 PM, randohi wrote:

 One of our clients (a hot girl!) brought this to our attention:
 In this document there are many f* words:

 http://sec.gov/Archives/edgar/data/1474227/00014742271032/d424b3.htm

 and we have indexed it with latest version of Solr (ver 3.3). But, we if
we
 search F**K, it does not return the document back!

 We have tried to index it with different text types, but still not
working.

 Any idea why F* can not be indexed - being censored by the government? :D


 --
 View this message in context:
http://lucene.472066.n3.nabble.com/Solr-can-not-index-F-K-tp3214246p3214246.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: searching a subset of SOLR index

2011-07-05 Thread Shashi Kant
Range query


On Tue, Jul 5, 2011 at 4:37 AM, Jame Vaalet jvaa...@capitaliq.com wrote:
 Hi,
 Let say, I have got 10^10 documents in an index with unique id being document 
 id which is assigned to each of those from 1 to 10^10 .
 Now I want to search a particular query string in a subset of these documents 
 say ( document id 100 to 1000).

 The question here is.. will SOLR able to search just in this set of documents 
 rather than the entire index ? if yes what should be query to limit search 
 into this subset ?

 Regards,
 JAME VAALET
 Software Developer
 EXT :8108
 Capital IQ




Re: Solr vs ElasticSearch

2011-05-31 Thread Shashi Kant
Here is a very interesting comparison

http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/


 -Original Message-
 From: Mark
 Sent: May-31-11 10:33 PM
 To: solr-user@lucene.apache.org
 Subject: Solr vs ElasticSearch

 I've been hearing more and more about ElasticSearch. Can anyone give me a
 rough overview on how these two technologies differ. What are the
 strengths/weaknesses of each. Why would one choose one of the other?

 Thanks




Re: I need an available solr lucene consultant

2011-05-17 Thread Shashi Kant
You might be better off looking for freelancers on sites such as
odesk.com, guru.com, rentacoder.com, elance.com  many more


On Tue, May 17, 2011 at 4:09 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Check this out:
 http://wiki.apache.org/solr/Support

 Hi,

 I am looking for an experienced and skilled Solr  Lucene
 developer/consultant to work on a software project incorporating natural
 language processing and machine learning algorithms. As part of a larger
 NLP/AI project that is under way, we need someone to install, refine and
 optimize Solr and Lucene for our website. The data being analyzed will be
 from user-generated textual discussions around a multitude of topics that
 will continuously be updated. You must be able to work in a LAMP
 environment with other developers, be smart, reliable, and a self-starter
 with excellent problem solving and analytical abilities. You must have a
 solid grasp of English – written and verbal.

 Please note that I am a start-up and I am not going to be able to pay what
 a large established company can pay.

 Thank you,

 Lance

 -
 Lance



Re: Looking for help with Solr implementation

2010-11-12 Thread Shashi Kant
Have you tried posting on odesk.com? I have had decent success finding
Solr/Lucene resources there.


On Thu, Nov 11, 2010 at 7:52 PM, AC acanuc...@yahoo.com wrote:

 Hi,


 Not sure if this is the correct place to post but I'm looking for someone
 to
 help finish a Solr install on our LAMP based website.  This would be a paid
 project.


 The programmer that started the project got too busy with his full-time job
 to
 finish the project.  Solr has been installed and a basic search is working
 but
 we need to configure it to work across the site and also set-up faceted
 search.I tried posting on some popular freelance sites but haven't been
 able
 to find anyone with real Solr expertise / experience.


 If you think you can help me with this project please let me know and I can
 supply more details.


 Regards





Re: Would it be nuts to store a bunch of large attachments (images, videos) in stored but-not-indexed fields

2010-10-29 Thread Shashi Kant
On Fri, Oct 29, 2010 at 6:00 PM, Ron Mayer r...@0ape.com wrote:

 I have some documents with a bunch of attachments (images, thumbnails
 for them, audio clips, word docs, etc); and am currently dealing with
 them by just putting a path on a filesystem to them in solr; and then
 jumping through hoops of keeping them in sync with solr.



Not sure why that is an issue. Keeping them in sync with solr would be the
same as storing within a file-system. Why would storing within solr be any
different.


 Would it be nuts to stick the image data itself in solr?

 More specifically - if I have a bunch of large stored fields,
 would it significantly impact search performance in the
 cases when those fields aren't fetched.


Hard to say. Assume you mean storing by converting into a base64 format. If
you do not retrieve the field when fetching, AFAIK should not affect it
significantly, if at all.
So if you manage your retrieval should be fine.


 Searches are very common in this system, and it's very rare
 that someone actually opens up one of these attachments
 so I'm not really worried about the time it takes to fetch
 them when someone does actually want one.




Re: Color search for images

2010-09-17 Thread Shashi Kant

 What I am envisioning (at least to start) is have all this add two fields in
 the index.  One would be for color information for the color similarity
 search.  The other would be a simple multivalued text field that we put
 keywords into based on what OpenCV can detect about the image.  If it
 detects faces, we would put face into this field.  Other things that it
 can detect would result in other keywords.

 For the color search, I have a few inter-related hurdles.  I've got to
 figure out what form the color data actually takes and how to represent it
 in Solr.  I need Java code for Solr that can take an input color value and
 find similar values in the index.  Then I need some code that can go in our
 feed processing scripts for new content.  That code would also go into a
 crawler script to handle existing images.


You are on the right track. You can create a set of representative
keywords from the image. OpenCV  gets a color histogram from the image
- you can set the bin values to be as granular as you need, and create
a look-up list of color names to generate a MVF representative of the
image.
If you want to get more sophisticated, represent the colors with
payloads in correlation with the distribution of the color in the
image.

Another approach would be to segment the image and extract colors from
each. So if you have a red rose with all white background, the textual
representation would be something like:

white, white...red...white, white

Play around and see which works best.

HTH


Re: Color search for images

2010-09-16 Thread Shashi Kant
 Lire looks promising, but how hard is it to integrate the content-based
 search into Solr as opposed to Lucene?  I myself am not a Java developer.  I
 have access to people who are, but their time is scarce.



Lire is a nascent effort and based on a cursory overview a while back,
IMHO was an over-simplified version of what a CBIR engine should be.
They use CEDD (color  edge descriptors).
Wouldn't work for the kind of applications I am working on - which
needs among other things, Color, Shape, Orientation, Pose, Edge/Corner
etc.

OpenCV has a steep learning curve, but having been through it, is very
powerful toolkit - the best there is by far! BTW the code is in C++,
but has both Java  .NET bindings.
This is a fabulous book to get hold of:
http://www.amazon.com/Learning-OpenCV-Computer-Vision-Library/dp/0596516134,
if you are seriously into OpenCV.

Pls feel free to reach out of if you need any help with OpenCV +
Solr/Lucene. I have spent quite a bit of time on this.


Re: Get all results from a solr query

2010-09-16 Thread Shashi Kant
q=*:*

On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross cogr...@gmail.com wrote:
 I have some queries that I'm running against a solr instance (older,
 1.2 I believe), and I would like to get *all* the results back (and
 not have to put an absurdly large number as a part of the rows
 parameter).

 Is there a way that I can do that?  Any help would be appreciated.

 -- Chris



Re: Get all results from a solr query

2010-09-16 Thread Shashi Kant
Start with a *:*, then the “numFound” attribute of the result
element should give you the rows to fetch by a 2nd request.


On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross cogr...@gmail.com wrote:
 That will stil just return 10 rows for me.  Is there something else in
 the configuration of solr to have it return all the rows in the
 results?

 -- Chris



 On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant sk...@sloan.mit.edu wrote:
 q=*:*

 On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross cogr...@gmail.com wrote:
 I have some queries that I'm running against a solr instance (older,
 1.2 I believe), and I would like to get *all* the results back (and
 not have to put an absurdly large number as a part of the rows
 parameter).

 Is there a way that I can do that?  Any help would be appreciated.

 -- Chris





Re: Color search for images

2010-09-15 Thread Shashi Kant
Shawn, I have done some research into this, machine-vision especially
on a large scale is a hard problem, not to be entered into lightly. I
would recommend starting with OpenCV - a comprehensive toolkit for
extracting various features such as Color, Edge etc from images. Also
there is a project LIRE http://www.semanticmetadata.net/lire/ which
attempts to do something along what you are thinking of. Not sure how
well it works.

HTH,
Shashi


On Wed, Sep 15, 2010 at 10:59 AM, Shawn Heisey s...@elyograg.org wrote:
  My index consists of metadata for a collection of 45 million objects, most
 of which are digital images.  The executives have fallen in love with
 Google's color image search.  Here's a search for flower with a red color
 filter:

 http://www.google.com/images?q=flowertbs=isch:1,ic:specific,isc:red

 I am interested in duplicating this.  Can this group of fine people point me
 in the right direction?  I don't want anyone to do it for me, just help me
 find software and/or algorithms that can extract the color information, then
 find a way to get Solr to index and search it.

 Thanks,
 Shawn




Re: Color search for images

2010-09-15 Thread Shashi Kant

 On a related note, I'm curious if anyone has run across a good set of
 algorithms (or hopefully a library) for doing naive image
 classification. I'm looking for something that can classify images
 into something similar to the broad categories that Google image
 search has (Face, Photo, Clip Art, Line Drawing, etc.).


 --Paul


OpenCV is the way to go.Very comprehensive set of algorithms.


Re: Color search for images

2010-09-15 Thread Shashi Kant
 I'm sure there's some post doctoral types who could get a graphic shape 
 analyzer, color analyzer, to at least say it's a flower.

 However, even Google would have to build new datacenters to have the 
 horsepower to do that kind of graphic processing.


Not necessarily true. Like.com - which incidentally got acquired by
Google recently - built a true visual search technology and applied it
on a large scale.


Re: Indexing all versions of Microsoft Office Documents

2010-04-27 Thread Shashi Kant
If you are on Windows try the Microsoft IFilter API - it supports
current Office versions.
http://www.microsoft.com/downloads/details.aspx?FamilyId=60C92A37-719C-4077-B5C6-CAC34F4227CCdisplaylang=en



On Tue, Apr 27, 2010 at 6:08 AM, Roland Villemoes r...@alpha-solutions.dk 
wrote:
 Hi All,

 Does anyone have a running solution indexing Microsoft Office Documents e.g. 
 .docx .xlsx etc. ?

 I can see a lot of examples using Tika for rich content extraction, but still 
 nothing when it comes to newer versions of Microsoft Office?
 What libraries to use of not Tika?

 med venlig hilsen/best regards

 Roland Villemoes
 Tel: (+45) 22 69 59 62
 E-Mail: mailto:r...@alpha-solutions.dk

 Alpha Solutions A/S
 Borgergade 2, 3.sal, 1300 København K
 Tel: (+45) 70 20 65 38
 Web: http://www.alpha-solutions.dkhttp://www.alpha-solutions.dk/

 ** This message including any attachments may contain confidential and/or 
 privileged information intended only for the person or entity to which it is 
 addressed. If you are not the intended recipient you should delete this 
 message. Any printing, copying, distribution or other use of this message is 
 strictly prohibited. If you have received this message in error, please 
 notify the sender immediately by telephone, or e-mail and delete all copies 
 of this message and any attachments from your system. Thank you.




Re: LucidWorks Solr

2010-04-21 Thread Shashi Kant
Why do these approaches have to be mutually exclusive?
Do a dictionary lookup, if no satisfactory match found use an
algorithmic stemmer. Would probably save a few CPU cycles by
algorithmic stemming iff necessary.


On Wed, Apr 21, 2010 at 1:31 PM, Robert Muir rcm...@gmail.com wrote:
 sy to look at the faults of some algorithmic stemmer, in truth its
 only purpose is to cause related forms of the word to conflate to the same
 form, and hopefully avoiding unrelated terms from conflating to this form.

 A dictionary-based stemmer is out-of-date the day you put it into
 production: languages aren't static. For example, you can't expect a
 dictionary-based stemmer to properly deal with forms like googling or
 tweets that have recently slipped into English vocabulary, but an
 algorithmic stemmer will likely deal with these just fine.


Re: Query time only Ranges

2010-03-31 Thread Shashi Kant
In that case, you could just calculate an offset from 00:00:00 in
seconds (ignore the date)
Pretty simple.


On Wed, Mar 31, 2010 at 4:57 PM, abhatna...@vantage.com
abhatna...@vantage.com wrote:

 Hi Sashi,
 Could you elaborate point no .1 in the light of case where in a field should
 have just time?


 Ankit


 --
 View this message in context: 
 http://n3.nabble.com/Query-time-only-Ranges-tp688831p689413.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: boost on certain keywords

2010-01-28 Thread Shashi Kant
Look at Payload.

On Thu, Jan 28, 2010 at 6:48 AM, murali k ilar...@gmail.com wrote:

 Say I have a clothes store,  i have ladies clothes, mens clothes

 when someone searches for clothes, i want to prioritize mens clothing
 results,
 how can I achieve this ?
 this logic should only apply for this keyword, other keywords should work
 as-is

 should I be trying with something on synonyms or during the process of
 indexing ? or something in dismax request handler ?




 --
 View this message in context: 
 http://old.nabble.com/boost-on-certain-keywords-tp27354717p27354717.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: boost on certain keywords

2010-01-28 Thread Shashi Kant
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/


On Thu, Jan 28, 2010 at 6:54 AM, Shashi Kant sk...@sloan.mit.edu wrote:
 Look at Payload.

 On Thu, Jan 28, 2010 at 6:48 AM, murali k ilar...@gmail.com wrote:

 Say I have a clothes store,  i have ladies clothes, mens clothes

 when someone searches for clothes, i want to prioritize mens clothing
 results,
 how can I achieve this ?
 this logic should only apply for this keyword, other keywords should work
 as-is

 should I be trying with something on synonyms or during the process of
 indexing ? or something in dismax request handler ?




 --
 View this message in context: 
 http://old.nabble.com/boost-on-certain-keywords-tp27354717p27354717.html
 Sent from the Solr - User mailing list archive at Nabble.com.





Re: HI

2009-12-13 Thread Shashi Kant
http://lmgtfy.com/?q=lucene+basics


On Sun, Dec 13, 2009 at 1:01 PM, Faire Mii faire@gmail.com wrote:

 Hi,

 I am a beginner and i wonder what a document, entity and a field relates to
 in a database?

 And i wonder if there are some good tutorials that learn you how to design
 your schema. Because all other articles i have

 read aren't very helpful for beginners.

 Regards

 Fayer



Re: Migrating to Solr

2009-11-24 Thread Shashi Kant
Here is a link that might be helpful:

http://sesat.no/moving-from-fast-to-solr-review.html

The site is choc-a-bloc with great information on their migration
experience.


On Tue, Nov 24, 2009 at 8:55 AM, Tommy Molto tommymo...@gmail.com wrote:

 Hi,

 I'm new at Solr and i need to make a test pilot of a migration from Fast
 ESP to Apache Solr, anyone had this experience before?


 Att,



Re: Solr - Load Increasing.

2009-11-16 Thread Shashi Kant
I think it would be useful for members of this list to realize that not
everyone uses the same metrology and terms.

It is very easy for Americans to use the imperial system and presume
everyone does the same; Europeans to use the metric system etc. Hopefully
members on this list would be persuaded to use or at least clarify their
terminology.

While the apocryphal saying goes  the great thing about standards is they
are so many choose from, we should all make an effort to communicate across
cultures and nations.



On Mon, Nov 16, 2009 at 5:33 PM, Israel Ekpo israele...@gmail.com wrote:

 On Mon, Nov 16, 2009 at 5:22 PM, Walter Underwood wun...@wunderwood.org
 wrote:

  Probably lakh: 100,000.
 
  So, 900k qpd and 3M docs.
 
  http://en.wikipedia.org/wiki/Lakh
 
  wunder
 
  On Nov 16, 2009, at 2:17 PM, Otis Gospodnetic wrote:
 
   Hi,
  
   Your autoCommit settings are very aggressive.  I'm guessing that's
 what's
  causing the CPU load.
  
   btw. what is laks?
  
   Otis
   --
   Sematext is hiring -- http://sematext.com/about/jobs.html?mls
   Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
  
  
  
   - Original Message 
   From: kalidoss kalidoss.muthuramalin...@sifycorp.com
   To: solr-user@lucene.apache.org
   Sent: Mon, November 16, 2009 9:11:21 AM
   Subject: Solr - Load Increasing.
  
   Hi All.
  
 My server solr box cpu utilization  increasing b/w 60 to 90% and
 some
  time
   solr is getting down and we are restarting it manually.
  
 No of documents in solr 30 laks.
 No of add/update requrest solr 30 thousand / day. Avg of every 30
  minutes
   around 500 writes.
 No of search request 9laks / day.
 Size of the data directory: 4gb.
  
  
 My system ram is 8gb.
 System available space 12gb.
 processor Family: Pentium Pro
  
 Our solr data size can be increase in number like 90 laks. and
 writes
  per day
   will be around 1laks.   - Hope its possible by solr.
  
 For write commit i have configured like
  
1
10
  
  
 Is all above can be possible? 90laks datas and 1laks per day writes
  and
   30laks per day read??  - if yes what type of system configuration
 would
  require.
  
 Please suggest us.
  
   thanks,
   Kalidoss.m,
  
  
   Get your world in your inbox!
  
   Mail, widgets, documents, spreadsheets, organizer and much more with
  your
   Sifymail WIYI id!
   Log on to http://www.sify.com
  
   ** DISCLAIMER **
   Information contained and transmitted by this E-MAIL is proprietary to
  Sify
   Limited and is intended for use only by the individual or entity to
  which it is
   addressed, and may contain information that is privileged,
 confidential
  or
   exempt from disclosure under applicable law. If this is a forwarded
  message, the
   content of this E-MAIL may not have been sent with the authority of
 the
  Company.
   If you are not the intended recipient, an agent of the intended
  recipient or a
   person responsible for delivering the information to the named
  recipient,  you
   are notified that any use, distribution, transmission, printing,
 copying
  or
   dissemination of this information in any way or in any manner is
  strictly
   prohibited. If you have received this communication in error, please
  delete this
   mail  notify us immediately at ad...@sifycorp.com
  
 
 


 Thanks Walter for clarifying that.

 I too was wondering what laks meant.

 It was a bit distracting when I read the original post.
 --
 Good Enough is not good enough.
 To give anything less than your best is to sacrifice the gift.
 Quality First. Measure Twice. Cut Once.



Re: Search Within

2009-04-04 Thread Shashi Kant
This post describes the search-within-search implementation.

http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.html


Shashi


On Sat, Apr 4, 2009 at 1:21 PM, Vernon Chapman chapman.li...@gmail.comwrote:

 Bess,

 I think that might work I'll try it out and see how it works for my case.

 thanks


 Bess Sadler wrote:

 Hi, Vernon.

 In Blacklight, the way we've been doing this is just to stack queries on
 top of each other. It's a conceptual shift from the way one might think
 about search within, but it accomplishes the same thing. For example:

 search1 == q=horse

 search2 == q=horse AND dog

 The second search, from the user's point of view, takes the search results
 from the horse search and further narrows them to those items that also
 contain dog. But you're really just doing a new search, one that contains
 both search values.

 Does that help? Or am I misunderstanding your question?

 Bess

 On 4-Apr-09, at 12:10 PM, Vernon Chapman wrote:

  I am not sure if this is a really easy or newbee-ish type question.
 I would like to implement a search within these results type feature.
 Has anyone done this and could you please share some tips, pointers and
 or documentation on how to implement this.

 Thanks

 Vern






Re: Hardware Questions...

2009-03-24 Thread Shashi Kant
Have you looked at http://wiki.apache.org/solr/SolrPerformanceData
?http://wiki.apache.org/solr/SolrPerformanceData

On Tue, Mar 24, 2009 at 4:51 PM, solr s...@highbeam.com wrote:

 We have three Solr servers (several two processor Dell PowerEdge
 servers). I'd like to get three newer servers and I wanted to see what
 we should be getting. I'm thinking the following...



 Dell PowerEdge 2950 III

 2x2.33GHz/12M 1333MHz Quad Core

 16GB RAM
 6 x 146GB 15K RPM RAID-5 drives



 How do people spec out servers, especially CPU, memory and disk? Is this
 all based on the number of doc's, indexes, etc...



 Also, what are people using for benchmarking and monitoring Solr? Thanks
 - Mike




Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Shashi Kant
Another project worth investigating is Tesseract.

http://code.google.com/p/tesseract-ocr/




- Original Message 
From: Hannes Carl Meyer m...@hcmeyer.com
To: solr-user@lucene.apache.org
Sent: Thursday, February 26, 2009 11:35:14 AM
Subject: Re: Use of scanned documents for text extraction and indexing

Hi Sithu,

there is a project called ocropus done by the DFKI, check the online demo
here: http://demo.iupr.org/cgi-bin/main.cgi

And also http://sites.google.com/site/ocropus/

Regards

Hannes

m...@hcmeyer.com
http://mimblog.de

On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. 
sithu.sudar...@fda.hhs.gov wrote:


 Hi All:

 Is there any study / research done on using scanned paper documents as
 images (may be PDF), and then use some OCR or other technique for
 extracting text, and the resultant index quality?


 Thanks in advance,
 Sithu D Sudarsan

 sithu.sudar...@fda.hhs.gov
 sdsudar...@ualr.edu






Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Shashi Kant
Can anyone back that up?

IMHO Tesseract is the state-of-the-art in OCR, but not sure that Ocropus 
builds on Tesseract.
Can you confirm that Vikram has a point?

Shashi




- Original Message 
From: Vikram Kumar vikrambku...@gmail.com
To: solr-user@lucene.apache.org; Shashi Kant sk...@sloan.mit.edu
Sent: Thursday, February 26, 2009 9:21:07 PM
Subject: Re: Use of scanned documents for text extraction and indexing

Tesseract is pure OCR. Ocropus builds on Tesseract.
Vikram

On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant shashi_k...@yahoo.com wrote:

 Another project worth investigating is Tesseract.

 http://code.google.com/p/tesseract-ocr/




 - Original Message 
 From: Hannes Carl Meyer m...@hcmeyer.com
 To: solr-user@lucene.apache.org
 Sent: Thursday, February 26, 2009 11:35:14 AM
 Subject: Re: Use of scanned documents for text extraction and indexing

 Hi Sithu,

 there is a project called ocropus done by the DFKI, check the online demo
 here: http://demo.iupr.org/cgi-bin/main.cgi

 And also http://sites.google.com/site/ocropus/

 Regards

 Hannes

 m...@hcmeyer.com
 http://mimblog.de

 On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. 
 sithu.sudar...@fda.hhs.gov wrote:

 
  Hi All:
 
  Is there any study / research done on using scanned paper documents as
  images (may be PDF), and then use some OCR or other technique for
  extracting text, and the resultant index quality?
 
 
  Thanks in advance,
  Sithu D Sudarsan
 
  sithu.sudar...@fda.hhs.gov
  sdsudar...@ualr.edu
 
 
 





Re: why don't we have a forum for discussion?

2009-02-18 Thread Shashi Kant
one man's crap is another man's treasure. :-P

So how would you decide what is worth posting? 
If you feel the list is overwhelming your email, set some filters.


Shashi


- Original Message 
From: Tony Wang ivyt...@gmail.com
To: solr-user@lucene.apache.org
Sent: Wednesday, February 18, 2009 2:06:57 PM
Subject: why don't we have a forum for discussion?

I am just curious why we don't have a forum for discussion or you guys think
it's really necessary to receive lots of crap information about Solr and
nutch in email? I can offer you a forum for discussion anyway.

-- 
Are you RCholic? www.RCholic.com
温 良 恭 俭 让 仁 义 礼 智 信



Re: why don't we have a forum for discussion?

2009-02-18 Thread Shashi Kant
Steve - could you not just subscribe to the list from another (off-mobile 
device) email (Gmail or Yahoo) for example?
We discourage using corporate email for subscribing mailing lists precisely for 
such reasons : volume, spam, malware risks etc.

Shashi




- Original Message 
From: Stephen Weiss swe...@stylesight.com
To: solr-user@lucene.apache.org
Sent: Wednesday, February 18, 2009 7:34:30 PM
Subject: Re: why don't we have a forum for discussion?

Like an earlier poster, my issue isn't on the laptop, it's with my mobile 
device.  The sheer volume of e-mail overwhelms the thing sometimes (right now, 
for instance).  There's really no option for moving the e-mail off to some 
other folder, it just all goes to one place.

Perhaps that means I need a better phone, it's just the obvious solutions 
aren't always practical.  Forums can conversely just as easily be set up to 
emulate mailing lists as well...  Our company's internal forum works this way.

--
Steve

On Feb 18, 2009, at 7:16 PM, Mike Klaas wrote:

 
 
 2. Many people greatly prefer the mailing list format (obviously, it takes a 
 little bit of effort to use mailinglists effectively (e.g., directing the 
 traffic to a folder/tag/etc.)