Re: Adding another dimension to Lucene searches

2010-05-10 Thread J. Delgado
Hierachical documents is a key concept towads a unified structured+unstructured search. It should allow us to fully implement things such as XQuery + Full-Text (http://www.w3.org/TR/xquery-full-text/) Additionally it solves a century old problem: how to deal with section/sub-sections in very

Indexing Boolean Expressions

2012-02-21 Thread J. Delgado
Hi, I would like to propose implementing Indexing Boolean Expressions (See http://www.vldb.org/pvldb/2/vldb09-83.pdf) as a Lucene-based project for GSoC. Here is a snippet from the Abstract of the paper: We consider the problem of efficiently indexing Disjunctive Normal Form (DNF) and Conjunctive

Re: Indexing Boolean Expressions

2012-02-21 Thread J. Delgado
, Aayush Kothari aayush.kothar...@gmail.comwrote: That's a really nice application of DNF and CNF. I'd be happy to work at it if it gets approved in GSoC. On 21 February 2012 14:09, J. Delgado joaquin.delg...@gmail.com wrote: Hi, I would like to propose implementing Indexing Boolean

Re: Indexing Boolean Expressions

2012-03-05 Thread J. Delgado
I looked at LUCENE-2987 and its work on the query side (changes to the accepted syntax to accept lower case 'or' and 'and'), which isn't really related to my proposal. What I'm proposing is to be able to index complex boolean expressions using Lucene. This can be viewed as the opposite of the

Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts

2012-03-22 Thread J. Delgado
Mark, can you share more on what K-V (NoSQL) stores have you've been benchmarking and what have been the results? Did you try all the well known ones? http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis -- J On Thu, Mar 22, 2012 at 10:42 AM, mark harwood markharw...@yahoo.co.ukwrote:

Re: Indexing Boolean Expressions

2012-03-26 Thread J. Delgado
() Negative clauses, and multivalue can be covered also, I believe. WDYT? On Mon, Mar 5, 2012 at 10:05 PM, J. Delgado joaquin.delg...@gmail.comwrote: I looked at LUCENE-2987 and its work on the query side (changes to the accepted syntax to accept lower case 'or' and 'and'), which isn't really

Re: Indexing Boolean Expressions

2012-03-26 Thread J. Delgado
/text.111/b28303/query.htm#autoId8 http://docs.oracle.com/cd/B28359_01/text.111/b28303/classify.htm#g1011013 -- J On Mon, Mar 26, 2012 at 10:07 AM, J. Delgado joaquin.delg...@gmail.comwrote: In full dislosure, there is a patent application that Yahoo! has filed for the use of inverted indexes

Re: Welcome back, Wolfgang Hoschek!

2013-09-26 Thread J. Delgado
Percolator for Solr? :{ On Thursday, September 26, 2013, Otis Gospodnetic wrote: Another welcome back! Any specific area where you plan on contributing? Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Fri, Sep 27, 2013

Re: Indexing Boolean Expressions

2013-02-11 Thread J. Delgado
://www.linkedin.com/pub/profile/0/04b/277 On Mon, Mar 26, 2012 at 10:17 AM, Walter Underwood wun...@wunderwood.orgwrote: Efficient rule matching goes further back, at least to alerting in Verity K2. wunder Search Guy, Chegg On Mar 26, 2012, at 10:15 AM, J. Delgado wrote: BTW, the idea

Re: Indexing Boolean Expressions

2013-02-11 Thread J. Delgado
top N best matching ads against queries derived from page content, you need that relevancy score to get not all matching docs, but just those top N. No? Otis -- http://sematext.com/ On Mon, Feb 11, 2013 at 11:22 AM, J. Delgado joaquin.delg...@gmail.comwrote: I guess ElasticSearch

Re: Where Search Meets Machine Learning

2015-05-02 Thread J. Delgado
tracking and other relevance feedback data: Good stuff! Again, thanks for sharing, -Doug On Wed, Apr 29, 2015 at 6:58 PM, J. Delgado joaquin.delg...@gmail.com wrote: Here is a presentation on the topic: http://www.slideshare.net/joaquindelgado1/where-search-meets-machine

Where Search Meets Machine Learning

2015-04-29 Thread J. Delgado
Here is a presentation on the topic: http://www.slideshare.net/joaquindelgado1/where-search-meets-machine-learning04252015final Search can be viewed as a combination of a) A problem of constraint satisfaction, which is the process of finding a solution to a set of constraints (query) that impose

Re: Where Search Meets Machine Learning

2015-05-04 Thread J. Delgado
as context (such as location, time-of-the-day, day-of-the-week, site-section, device type, etc) to make predictions/scoring. This can still be combined with the usual IR based scoring to keep semantics as the driving force. -J On Monday, May 4, 2015, J. Delgado joaquin.delg...@gmail.com wrote

Re: Where Search Meets Machine Learning

2015-05-04 Thread J. Delgado
I totally agree that it depends at the task at hand and the amount/quality of the data that you can get hold of. The problem of relevancy in traditional document/semantic information retrieval (IR) task is such a hard thing because there is little or no source of truth you could use as training

Re: Where Search Meets Machine Learning

2015-05-04 Thread J. Delgado
BTW, as i mentioned, the machine learning On Monday, May 4, 2015, J. Delgado joaquin.delg...@gmail.com wrote: I totally agree that it depends at the task at hand and the amount/quality of the data that you can get hold of. The problem of relevancy in traditional document/semantic information

Game time

2015-05-05 Thread J. Delgado
on the 16 Alex has baseball game at 12:30

Re: Game time

2015-05-05 Thread J. Delgado
Sorry mistake ... On Tuesday, May 5, 2015, J. Delgado joaquin.delg...@gmail.com wrote: on the 16 Alex has baseball game at 12:30

CFP RecSysTV 2015

2015-05-08 Thread J. Delgado
Apologies for any cross-posting. Please distribute to colleagues who may be interested. -- Joaquin (on behalf of the Organizers) CALL FOR PAPERS 2nd Workshop on Recommender Systems for Television and Online Video http://www.recsys.tv We are pleased to invite you to participate in the 2nd

Word Embedding stored in Lucene Index

2017-12-09 Thread J. Delgado
It has been a couple of years since the Neu-IR WS ( https://staff.fnwi.uva.nl/m.derijke/wp-content/papercite-data/pdf/craswell-report-2016.pdf). I'm wondering if anyone has tinkered with storing word/document embeddings and using inside Lucene to improve the core relevance model. One of the key

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-16 Thread J. Delgado
What about the use of word embeddings (see https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa) to compute word similarity? On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull < dturnb...@opensourceconnections.com> wrote: > Hey folks, > > I wanted to open up a

Re: Vector based store and ANN

2019-02-28 Thread J. Delgado
Lucene’s scoring function (which I believe is okapi BM25 https://en.m.wikipedia.org/wiki/Okapi_BM25) is a kind of nearest neighbor using the TF-IDF vector representation of documents and query. Are you interested in ANN to be applied to a different kind of vector representation, say for example

Re: Vector based store and ANN

2019-03-01 Thread J. Delgado
ng the nearest neighbors to a >given query could also benefit from these ANN algorithms (although doesn’t >necessarily need the vector based index) > > > > I would be grateful to hear your thoughts and whether the community is > open to a conversation on this topic with m

Re: Vector based store and ANN

2019-03-02 Thread J. Delgado
> > <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnmslib%2Fhnswlib=02%7C01%7Cpedramr%40microsoft.com%7Cd78c3778fd334445ca1c08d69e9cfe5d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636870794499787389=2ZxNZFReYuryCjGak9Szz5BmgjT9G59IBOw9q3RlCbo%3D=0> >

Parallel Scoring

2019-02-01 Thread J. Delgado
Hi folks, Assuming documents can be scored independently, what is the level of document scoring parallelism (thread or process wise) that have people experimented with on a single multi-core machine containing a single shard?

Re: Maximum score estimation

2022-12-19 Thread J. Delgado
Actually, I believe that the Lucene scoring function is based on *Okapi BM25* (BM is an abbreviation of best matching) which is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson

Job Opportunity (Sunnyvale, CA)

2007-01-09 Thread J. Delgado
(Sorry for the cross-posting) This is a full-time position with an exciting New Venture (now in stealth mode) and will be based out of Sunnyvale, CA. We are looking for Java Developer with search, social networks and/or payment processing related experience. Required Skills: 2+ yrs of

Re: Lucene Scalability Question

2007-01-10 Thread J. Delgado
This is a more general question: Given the fact that most applications require querying a combination of full-text and structured data has anyone looked into building data structures at the most fundamental level (e.g. combination of b-tree and inverted lists) that would enable scalable and

Re: Lucene Scalability Question

2007-01-10 Thread J. Delgado
that changes that! It loads Lucene into the Oracle database (it has a JVM), and allows Lucene syntax to perform full-text searching. On Jan 10, 2007, at 2:37 PM, J. Delgado wrote: No, Oracle Text does not use Lucene. It has its own proprietary full-text engine. It represents documents

Progressive Query Relaxation

2007-04-09 Thread J. Delgado
Has anyone within the Lucene or Solr community attempted to code a progressive query relaxation technique similar to the one described here for Oracle Text? http://www.oracle.com/technology/products/text/htdocs/prog_relax.html Thanks, -- J.D.

Re: Various Ideas from ApacheCon

2007-05-10 Thread J. Delgado
The ever growing presence of mingled structured and unstructured data is a fact of life and modern systems we have to deal with. Clearly, the tendency is that full-text indexing is moving towards DB functionality, i.e. attribute,value fields for projection/filtering, sorting, faceted queries,

Oracle-Lucene integration (OJVMDirectory and Lucene Domain Index) - LONG

2007-09-13 Thread J. Delgado
I'm very happy to announce the partial rework and extension to LUCENE-724 (Oracle-Lucene Integration), primarily based on new requirements from LendingClub.com, who commissioned the work to Marcelo Ochoa, the contributer of the original patch (great job Marcelo!). As contribution of

Re: Lucene Analyzers

2007-10-28 Thread J. Delgado
If you don't want to start from scratch you may look at what is available in the GATE framework, also written in Java: http://gate.ac.uk/gate/doc/plugins.html#hindi 2007/10/28, Grant Ingersoll [EMAIL PROTECTED]: A Google search reveals:

Oracle-Lucene Domain Index (New Release)

2007-12-13 Thread J. Delgado
Once again, LendingClub.com, a social lending network that today announced nation-wide expansion (see Tech Crunch), is please to contribute to the open source community a new release (2.2.0.2.0) of the Oracle-Lucene Domain Index, a fast implementation of text indexing and search using Lucene

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread J. Delgado
I assume that Google also has distributed index over their GFS/MapReduce implementation. Any idea how they achieve this? J.D. On Feb 6, 2008 11:33 AM, Clay Webster [EMAIL PROTECTED] wrote: There seem to be a few other players in this space too. Are you from Rackspace?

Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-06 Thread J. Delgado
. On Feb 6, 2008 4:22 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: (trimming excessive cc-s) Ning Li wrote: No. I'm curious too. :) On Feb 6, 2008 11:44 AM, J. Delgado [EMAIL PROTECTED] wrote: I assume that Google also has distributed index over their GFS/MapReduce implementation. Any

Re: an API for synonym in Lucene-core

2008-03-13 Thread J. Delgado
Mathieu, Have you thought about incorporating a standard format for thesaurus and thus for query/index expansion. Here is the recommendation from NISO: http://www.niso.org/committees/MT-info.html Beyond synonyms, having the capabilities to specify the use of BT (broader terms or Hypernyms) or NT

Fwd: New binary distribution of Oracle-Lucene integration

2008-04-13 Thread J. Delgado
Here is the latest on the Oracle-Lucene Integration. J.D. -- Forwarded message -- From: Marcelo Ochoa [EMAIL PROTECTED] Date: Mon, Apr 7, 2008 at 10:01 AM Subject: New binary distribution of Oracle-Lucene integration To: [EMAIL PROTECTED] Hi all: I just released a new version

Re: How to do a query using less than or greater than

2008-06-24 Thread J. Delgado
I do not believe that the operators and are supported by Lucene, but you can use RANGE SEARCH to do achieve what you want. Just put an unreachable upper boundary for greater than or lower boundary for less than. J.D. On Tue, Jun 24, 2008 at 3:31 PM, Kyle Miller [EMAIL PROTECTED] wrote: Hi all,

Re: My understanding about lucene internals.

2008-06-30 Thread J. Delgado
Prasen, Great summary! On Mon, Jun 30, 2008 at 4:27 AM, Mukherjee, Prasenjit [EMAIL PROTECTED] wrote: Hi, I have tried to consolidate my understanding of lucene with the following ppt slides. I would really aprpeciate your comments ( specially where I am incorrect ) specifically on slide16

Re: Re[4]: lucene scoring

2008-08-08 Thread J. Delgado
The only score that I can think of that can measure quality across different queries are invariant scores such as pagerank. That is to score the document on its general information value and then use that as a filter regardless of the query. This is very different than the problem of trying to

Re: Moving SweetSpotSimilarity out of contrib

2008-09-06 Thread J. Delgado
I cannot agree more with Otis. Its all about exposure! Without references from main JavaDocs, some cool things in contrib just remain in obscurity. -- Joaquin On Sat, Sep 6, 2008 at 1:08 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Regarding SSS (and any other contrib visibility). Perhaps

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado
On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Regarding real-time search and Solr, my feeling is the focus should be on first adding real-time search to Lucene, and then we'll figure out how to incorporate that into Solr later. Otis, what do you mean exactly by

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado
On Sun, Sep 7, 2008 at 2:41 AM, mark harwood [EMAIL PROTECTED]wrote: for example joins are not possible using SOLR). It's largely *because* Lucene doesn't do joins that it can be made to scale out. I've replaced two large-scale database systems this year with distributed Lucene solutions

Re: Realtime Search for Social Networks Collaboration

2008-09-07 Thread J. Delgado
:16 AM, J. Delgado [EMAIL PROTECTED]wrote: On Sun, Sep 7, 2008 at 2:41 AM, mark harwood [EMAIL PROTECTED]wrote: for example joins are not possible using SOLR). It's largely *because* Lucene doesn't do joins that it can be made to scale out. I've replaced two large-scale database systems

Re: Realtime Search for Social Networks Collaboration

2008-09-08 Thread J. Delgado
faster hierarchical queries and perhaps other types of queries that Lucene is not capable of. Is this something Joaquin you are interested in collaborating on? I am definitely interested in it. On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado [EMAIL PROTECTED] wrote: On Sat, Sep 6, 2008 at 1:36 AM

Re: Realtime Search for Social Networks Collaboration

2008-09-21 Thread J. Delgado
? On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado [EMAIL PROTECTED] wrote: Yes, both Marcelo and I would be interested. We looked into H2 and it looks like something similar to Oracle's ODCI can be implemented. Plus the primitive full-text implementación is based on Lucene. I say primitive

Re: Realtime Search for Social Networks Collaboration

2008-09-21 Thread J. Delgado
Sorry, I meant loose (replacing lose) On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado [EMAIL PROTECTED]wrote: On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള്‍ नोब्ळ् [EMAIL PROTECTED] wrote: Moving back to RDBMS model will be a big step backwards where we miss mulivalued fields

Re: Ocean and GData

2008-09-27 Thread J. Delgado
On Sat, Sep 27, 2008 at 5:03 AM, Jason Rutherglen [EMAIL PROTECTED] wrote: Unlike MapReduce, there are no infrastructure whitepapers on how GData/Base works so I had to make a broad comparison rather than a specific one. My understanding is that GBase is based on the infrastructure that

Re: Realtime Search

2008-12-26 Thread J. Delgado
The addition of docs into tiny segments using the current data structures seems the right way to go. Sometime back one of my engineers implemented pseudo real-time using MultiSearcher by having an in-memory (RAM based) short-term index that auto-merged into a disk-based long term index that

Re: Realtime Search

2008-12-26 Thread J. Delgado
-based search services was done using a federator component, very much like shard based searches is done today (I believe). -- Joaquin. l On Fri, Dec 26, 2008 at 10:48 AM, J. Delgado joaquin.delg...@gmail.comwrote: The addition of docs into tiny segments using the current data structures seems

Re: Grouping Lucene search results and calculating frequency by category

2009-04-11 Thread J. Delgado
Have you looked at SOLR? http://lucene.apache.org/solr/ It pretty much has what you are looking for. -- Joaquin On Fri, Apr 10, 2009 at 9:39 PM, mitu2009 musicfrea...@gmail.com wrote: Am working on a store search API using Lucene. I need to show store search results for each City,State

Re: Efficient Query Evaluation using a Two-Level Retrieval Process

2009-11-16 Thread J. Delgado
the link? On Mon, Nov 16, 2009 at 12:35 PM, J. Delgado joaquin.delg...@gmail.com wrote: Please find attached the paper on Efficient Query Evaluation using a Two-Level Retrieval Process. I believe that such approach may improve the way Lucene/Solr evaluates queries today. Cheers

Re: Efficient Query Evaluation using a Two-Level Retrieval Process

2009-11-16 Thread J. Delgado
will be skipped and thus we will need to compute full scores for fewer documents. I think its worth a try... -- Joaquin On Mon, Nov 16, 2009 at 2:54 AM, Andrzej Bialecki a...@getopt.org wrote: J. Delgado wrote: Here is the link to the paper. http://cis.poly.edu/westlab/papers/cntdstrb/p426

Re: Efficient Query Evaluation using a Two-Level Retrieval Process

2009-11-16 Thread J. Delgado
the proper weight assignment model Of course, the devil-is-in-the-details :-( -- Joaquin On Mon, Nov 16, 2009 at 20:26, J. Delgado joaquin.delg...@gmail.com wrote: As I understood it setMinimumNumberShouldMatch(int min) Is used to specify a minimum number of the optional BooleanClauses which must

Job Opportunity (Sunnyvale, CA)

2007-01-09 Thread J. Delgado
This is a full-time position with an exciting New Venture (now in stealth mode) and will be based out of Sunnyvale, CA. We are looking for Java Developer with search, social networks and/or payment processing related experience. Required Skills: 2+ yrs of industrial experience on Search

Re: Reviving Nutch 0.7

2007-01-23 Thread J. Delgado
Nutch Newbie wrote: Again not really proposing a new project but more easy to use re-usable code. IMHO, Nutch will be an umbrella project for ala-Google and Solr will be for ala-Enterpise where Lucene is the index lib, Hadoop is the Mapred/DFS lib ..what is missing is Common Crawler lib, Common

Re: Indexing the Interesting Part Only...

2007-03-09 Thread J. Delgado
You have to build a special HTML Junk parser. 2007/3/9, d e [EMAIL PROTECTED]: If I'm indexing a news article, I want to avoid getting the junk (other than the title, auther and article) into the index. I want to avoid getting the advertizments, etc. How do I do that sort of thing? What parts

Job Opportunity (Sunnyvale, CA)

2007-01-09 Thread J. Delgado
This is a full-time position with an exciting New Venture (now in stealth mode) and will be based out of Sunnyvale, CA. We are looking for Java Developer with search, social networks and/or payment processing related experience. Required Skills: 2+ yrs of industrial experience on Search

Re: Progressive Query Relaxation

2007-04-09 Thread J. Delgado
- Search - Share - Original Message From: J. Delgado [EMAIL PROTECTED] To: java-dev@lucene.apache.org; solr-dev@lucene.apache.org Sent: Monday, April 9, 2007 3:46:40 AM Subject: Progressive Query Relaxation Has anyone within the Lucene or Solr community attempted to code a progressive query

Re: Progressive Query Relaxation

2007-04-10 Thread J. Delgado
the best matching field. That is much more powerful and gives much better results. wunder On 4/9/07 12:46 AM, J. Delgado [EMAIL PROTECTED] wrote: Has anyone within the Lucene or Solr community attempted to code a progressive query relaxation technique similar to the one described here for Oracle

Re: Progressive Query Relaxation

2007-04-10 Thread J. Delgado
See my comments below. 2007/4/10, Walter Underwood [EMAIL PROTECTED]: On 4/10/07 10:06 AM, J. Delgado [EMAIL PROTECTED] wrote: Progressive relaxation, at least as Oracle has defined it, is a flexible, developer defined series of queries that are efficiently executed in progression

Re: Various Ideas from ApacheCon

2007-05-10 Thread J. Delgado
The ever growing presence of mingled structured and unstructured data is a fact of life and modern systems we have to deal with. Clearly, the tendency is that full-text indexing is moving towards DB functionality, i.e. attribute,value fields for projection/filtering, sorting, faceted queries,

Re: Progressive Query Relaxation

2007-05-11 Thread J. Delgado
Hoss, I never got to acknowledge your analisis. Well done. I do want to hear your opinion about the following posting I sent to the list, which aims and looking at the anolalogy between search engines and relational/XML databases as the progress to evolve into a single type of retrieval system:

Oracle-Lucene integration (OJVMDirectory and Lucene Domain Index) - LONG

2007-09-13 Thread J. Delgado
I'm very happy to announce the partial rework and extension to LUCENE-724 (Oracle-Lucene Integration), primarily based on new requirements from LendingClub.com, who commissioned the work to Marcelo Ochoa, the contributer of the original patch (great job Marcelo!). As contribution of

Oracle-Lucene Domain Index (New Release)

2007-12-13 Thread J. Delgado
Once again, LendingClub.com, a social lending network that today announced nation-wide expansion (see Tech Crunch), is please to contribute to the open source community a new release (2.2.0.2.0) of the Oracle-Lucene Domain Index, a fast implementation of text indexing and search using Lucene

Fwd: New binary distribution of Oracle-Lucene integration

2008-04-13 Thread J. Delgado
Here is the latest on the Oracle-Lucene Integration. J.D. -- Forwarded message -- From: Marcelo Ochoa [EMAIL PROTECTED] Date: Mon, Apr 7, 2008 at 10:01 AM Subject: New binary distribution of Oracle-Lucene integration To: [EMAIL PROTECTED] Hi all: I just released a new version

Efficient Query Evaluation using a Two-Level Retrieval Process

2009-11-15 Thread J. Delgado
Please find attached the paper on Efficient Query Evaluation using a Two-Level Retrieval Process. I believe that such approach may improve the way Lucene/Solr evaluates queries today. Cheers, -- Joaquin

Re: Efficient Query Evaluation using a Two-Level Retrieval Process

2009-11-16 Thread J. Delgado
PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Hey Joaquin, The mailing list strips off attachments. Can you please upload it somewhere and give us the link? On Mon, Nov 16, 2009 at 12:35 PM, J. Delgado joaquin.delg...@gmail.com wrote: Please find attached the paper on Efficient

Re: Efficient Query Evaluation using a Two-Level Retrieval Process

2009-11-16 Thread J. Delgado
will be skipped and thus we will need to compute full scores for fewer documents. I think its worth a try... -- Joaquin On Mon, Nov 16, 2009 at 2:54 AM, Andrzej Bialecki a...@getopt.org wrote: J. Delgado wrote: Here is the link to the paper. http://cis.poly.edu/westlab/papers/cntdstrb/p426