[jira] Updated: (SOLR-828) A RequestProcessor to support updates

2008-10-28 Thread Noble Paul (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-828:


Description: 
This is same as SOLR-139. A new issue is opened so that the UpdateProcessor 
approach is highlighted and we can easily focus on that solution. 


The new {{UpdateProcessor}} called ({{UpdateableIndexProcessor}}) must be 
inserted before {{RunUpdateProcessor}}. 

* The {{UpdateProcessor}} must add an update method. 
* the {{AddUpdateCommand}} has a new boolean field append. If append= true 
multivalued fields will be appended else old ones are removed and new ones are 
added
* The schema must have a {{}}
* {{UpdateableIndexProcessor}} registers {{postCommit/postOptimize}} listeners.

h1.Implementation
{{UpdateableIndexProcessor}} maintains two separate Lucene indexes for doing 
the backup
 * *temp.backup.index* : This index stores (not indexed) all the fields (except 
uniquekey which is stored and indexed) in the document 
 * *backup.index* : This index stores (not indexed) all the fields (except 
uniquekey which is stored and indexed) which are not stored in the actual 
schema and the fields which are targets of copyField.
h1.Implementation of various methods

h2.{{processAdd()}}
{{UpdateableIndexProcessor}} writes the document to *temp.backup.index* . Call 
next {{UpdateProcessor}}

h2.{{processDelete()}}
{{UpdateableIndexProcessor}} gets the Searcher from a core query and find the 
documents which matches the query and delete from *backup.index* . if it is a 
delete by id delete the document with that id from *temp.backup.index* . Call 
next {{UpdateProcessor}}

h2.{{processCommit()}}
Call next {{UpdateProcessor}}

h2.on {{postCommit/postOmptize}}
{{UpdateableIndexProcessor}} commits the *temp.backup.index* . Gets all the 
documents from the *temp.backup.index* one by one . If the document is present 
in the main index it is copied to *backup.index* , else it is thrown away 
because a deletebyquery would have deleted it .Finally it commits the 
*backup.index*. *temp.backup.index* is destroyed after that

h2.{{processUpdate()}}
{{UpdateableIndexProcessor}} commits the *temp.backup.index* . Check the 
document first in *temp.backup.index* . If it is present read the document . if 
it is not present , check in *backup.index* .If it is present there , get the 
searcher from the main index and read all the missing fields from there, and 
the backup document is prepared

The single valued fields are used from the incoming document (if present) 
others are fillled from backup doc . If append=true all the multivalues values 
from backup document are added to the incoming document else the values from 
backup document is not used if they are present in incoming document also.

h2. new {{BackupIndexRequestHandler}} registered automatically at {{/backup}}
This exposes the data present in the backp indexes. The user must be able to 
get any document by id by invoking {{/backup?id=}} (multiple id values 
can be sent eg:id=1&id=2&id=4). This helps the user to query the backup index 
and construct the new doc if he wishes to do so. The 
{{BackupIndexRequestHandler}} does a commit on *temp.backup.index* and searches 
the *temp.backup.index* first for the id and if the document is absent then it 
checks in the *backup.index* and returns the document.




  was:
This is same as SOLR-139. A new issue is opened so that the UpdateProcessor 
approach is highlighted and we can easily focus on that solution. 


The new {{UpdateProcessor}} called ({{UpdateableIndexProcessor}}) must be 
inserted before {{RunUpdateProcessor}}. 

* The {{UpdateProcessor}} must add an update method. 
* the {{AddUpdateCommand}} has a new boolean field append. If append= true 
multivalued fields will be appended else old ones are removed and new ones are 
added
* The schema must have a {{}}
* {{UpdateableIndexProcessor}} registers {{postCommit/postOptimize}} listeners.

h1.Implementation
{{UpdateableIndexProcessor}} maintains two separate Lucene indexes for doing 
the backup
 * *temp.backup.index* : This index stores (not indexed) all the fields (except 
uniquekey which is stored and indexed) in the document 
 * *backup.index* : This index stores (not indexed) all the fields (except 
uniquekey which is stored and indexed) which are not stored in the actual 
schema and the fields which are targets of copyField.
h1.Implementation of various methods

h2.{{processAdd()}}
{{UpdateableIndexProcessor}} writes the document to temp.backup.index . And 
calls next {{UpdateProcessor}}

h2.{{processDelete()}}
{{UpdateableIndexProcessor}} gets the Searcher from a core query and find the 
documents which matches the query and delete from *backup.index* . if it is a 
delete by id delete the document with that id from *temp.backup.index* . call 
next {{UpdateProcessor}}

h2.{{processCommit()}}
{{UpdateableIndexProcessor}} calls next {{UpdateProcessor}}

h2.on {{postCom

[jira] Created: (SOLR-828) A RequestProcessor to support updates

2008-10-28 Thread Noble Paul (JIRA)
A RequestProcessor to support updates
-

 Key: SOLR-828
 URL: https://issues.apache.org/jira/browse/SOLR-828
 Project: Solr
  Issue Type: Improvement
Reporter: Noble Paul
 Fix For: 1.4


This is same as SOLR-139. A new issue is opened so that the UpdateProcessor 
approach is highlighted and we can easily focus on that solution. 


The new {{UpdateProcessor}} called ({{UpdateableIndexProcessor}}) must be 
inserted before {{RunUpdateProcessor}}. 

* The {{UpdateProcessor}} must add an update method. 
* the {{AddUpdateCommand}} has a new boolean field append. If append= true 
multivalued fields will be appended else old ones are removed and new ones are 
added
* The schema must have a {{}}
* {{UpdateableIndexProcessor}} registers {{postCommit/postOptimize}} listeners.

h1.Implementation
{{UpdateableIndexProcessor}} maintains two separate Lucene indexes for doing 
the backup
 * *temp.backup.index* : This index stores (not indexed) all the fields (except 
uniquekey which is stored and indexed) in the document 
 * *backup.index* : This index stores (not indexed) all the fields (except 
uniquekey which is stored and indexed) which are not stored in the actual 
schema and the fields which are targets of copyField.
h1.Implementation of various methods

h2.{{processAdd()}}
{{UpdateableIndexProcessor}} writes the document to temp.backup.index . And 
calls next {{UpdateProcessor}}

h2.{{processDelete()}}
{{UpdateableIndexProcessor}} gets the Searcher from a core query and find the 
documents which matches the query and delete from *backup.index* . if it is a 
delete by id delete the document with that id from *temp.backup.index* . call 
next {{UpdateProcessor}}

h2.{{processCommit()}}
{{UpdateableIndexProcessor}} calls next {{UpdateProcessor}}

h2.on {{postCommit/postOmptize}}
{{UpdateableIndexProcessor}} commits the *temp.backup.index* . Gets all the 
documents from the *temp.backup.index* one by one . if the document is present 
in the main index it is copied to *backup.index* .Finally it commits the 
*backup.index*. *temp.backup.index* is detryed after that

h2.{{processUpdate()}}
{{UpdateableIndexProcessor}} commits the *temp.backup.index* . Check the 
document first in *temp.backup.index* . If it is present read the document . if 
it is not present , check in *backup.index* .If it is present there , get the 
searcher from the main index and read all the missing fields from there, and 
the backup document is prepared

The single valued fields are used from the incoming document (if present) 
others are fillled from backup doc . If append=true all the multivalues values 
from backup document are added to the incoming document else the values from 
backup document is not used if they are present in incoming document also.

h2. new {{BackupIndexRequestHandler}} registered automatically at {{/backup}}
This exposes the data present in the backp indexes. The user must be able to 
get any document by id by invoking {{/backup?id=}} (multiple id values 
can be sent eg:id=1&id=2&id=4). This helps the user to query the backup index 
and construct the new doc if he wishes to do so. The 
{{BackupIndexRequestHandler}} does a commit on *temp.backup.index* and searches 
the *temp.backup.index* first for the id and if the document is absent then it 
checks in the *backup.index* and returns the document.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Idea: Add Documentum/Sharepoint/FileNet etc connectivity by emulating a Google Search Appliance's feed interface

2008-10-28 Thread Grant Ingersoll


On Oct 28, 2008, at 5:21 PM, markharw00d wrote:



It may be a good summer of code project if someone wanted to  
implement reading and writing data via the GSA feed interface...


Do you mean the Google Summer Of Code initiative? I can't imagine  
Google would be keen to support a project whose goal was to provide  
an open-source, drop-in replacement for one of their commercial  
products :)



Google doesn't vet GSOC applications...  ;-)  Heck, it's even Apache  
licensed...







I can't believe this idea received no replies!


I guess it's just not an itch anyone here feels a particular need to  
scratch right now  - and that is always what is need to get the ball  
rolling.


Maybe another avenue is to approach the commercial providers who  
have already contributed GSA connectors and ask them to consider  
writing a Solr-based consumer endpoint based on the GSA connector  
protocol. They may be commercially incentivised to do this and can  
then claim their products can hook up to either GSA or open-source  
Solr using the same interface.


I think it may be of interest at some point.  I don't have the cycles  
at the moment, but it's definitely something I think makes sense to  
have in Solr.


Re: Idea: Add Documentum/Sharepoint/FileNet etc connectivity by emulating a Google Search Appliance's feed interface

2008-10-28 Thread Lukáš Vlček
Hi,
I was thinking about using GSA Connector infrastructure with Nutch or Solr
some time ago because we were considering MS SharePoint search functionality
alternatives incuding GSA.

IMHO this is something that makes sense and I think that open source tools
can beat production alternatives in many ways but also I can see some
issues:

- first and the most difficult: try to talk to your management about
relpacing MS SharePoint or GSA with open source. This conversation can be
very difficult.

- GSA connectors are buggy... try listing through Google group forums (may
be this got better by now).

- I found it is very hard to rely on open source when it comes to parsing of
Microsoft documents in your net (word, excel, power point). It can handle
98% or 99% all your document but not 100% (correct me if I am wrong please).
It should be possible to include some MS document server into the loop but
this make the thing more complicated and requires non-open source
components.

Regards,
Lukas

On Tue, Oct 28, 2008 at 10:21 PM, markharw00d <[EMAIL PROTECTED]>wrote:

>
>  It may be a good summer of code project if someone wanted to implement
>> reading and writing data via the GSA feed interface...
>>
>
> Do you mean the Google Summer Of Code initiative? I can't imagine Google
> would be keen to support a project whose goal was to provide an open-source,
> drop-in replacement for one of their commercial products :)
>
>
>  I can't believe this idea received no replies!
>>>
>>
> I guess it's just not an itch anyone here feels a particular need to
> scratch right now  - and that is always what is need to get the ball
> rolling.
>
> Maybe another avenue is to approach the commercial providers who have
> already contributed GSA connectors and ask them to consider writing a
> Solr-based consumer endpoint based on the GSA connector protocol. They may
> be commercially incentivised to do this and can then claim their products
> can hook up to either GSA or open-source Solr using the same interface.
>
> Cheers
> Mark
>
>
>
>


-- 
http://blog.lukas-vlcek.com/


Re: Idea: Add Documentum/Sharepoint/FileNet etc connectivity by emulating a Google Search Appliance's feed interface

2008-10-28 Thread markharw00d


It may be a good summer of code project if someone wanted to implement 
reading and writing data via the GSA feed interface...


Do you mean the Google Summer Of Code initiative? I can't imagine Google 
would be keen to support a project whose goal was to provide an 
open-source, drop-in replacement for one of their commercial products :)



I can't believe this idea received no replies! 


I guess it's just not an itch anyone here feels a particular need to 
scratch right now  - and that is always what is need to get the ball 
rolling.


Maybe another avenue is to approach the commercial providers who have 
already contributed GSA connectors and ask them to consider writing a 
Solr-based consumer endpoint based on the GSA connector protocol. They 
may be commercially incentivised to do this and can then claim their 
products can hook up to either GSA or open-source Solr using the same 
interface.


Cheers
Mark





[jira] Commented: (SOLR-805) DisMax queries are not being cached in QueryResultCache

2008-10-28 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643335#action_12643335
 ] 

Shalin Shekhar Mangar commented on SOLR-805:


Apart from this one, I don't think there are any major fixes for 1.3 branch. 
With Java based replication being in the trunk now, we can also plan for an 
early 1.4 release -- replication alone is a huge user facing feature. Of 
course, we need some time for the new features to stabilize, therefore, 1.4 
can't be done as quick as a 1.3.1 release can be.

> DisMax queries are not being cached in QueryResultCache
> ---
>
> Key: SOLR-805
> URL: https://issues.apache.org/jira/browse/SOLR-805
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 1.3
> Environment: Using Sun JDK 1.5 and Solr 1.3.0 release on Windows XP
>Reporter: Todd Feak
>Priority: Critical
> Fix For: 1.4
>
>
> I have a DisMax Search Handler set up in my solrconfig.xml to weight results 
> based on which field a hit was found in. Results seem to be coming back fine, 
> but the exact same query issued twice will *not* result in a cache hit.
> I have run far enough in the debugger to determine that the hashCode for the 
> BooleanQuery object is returning a different value each time for the same 
> query. This leads me to believe there is some random factor involved in it's 
> calculation, such as a default Object hashCode() implementation somewhere in 
> the chain. Non DisMax queries seem to be caching just fine.
> Where I see this behavior exhibited is on line 47 of the QueryResultKey 
> constructor. I have not dug in far enough to determine exactly where the 
> hashCode is being incorrectly calculated. I will try and dig in further 
> tomorrow, but wanted to get some attention on the bug. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-667) Alternate LRUCache implementation

2008-10-28 Thread Shalin Shekhar Mangar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-667.


Resolution: Fixed

Committed revision 708656.

Thanks Fuad, Noble and Yonik!

> Alternate LRUCache implementation
> -
>
> Key: SOLR-667
> URL: https://issues.apache.org/jira/browse/SOLR-667
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.3
>Reporter: Noble Paul
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: ConcurrentLRUCache.java, ConcurrentLRUCache.java, 
> ConcurrentLRUCache.java, SOLR-667.patch, SOLR-667.patch, SOLR-667.patch, 
> SOLR-667.patch, SOLR-667.patch, SOLR-667.patch
>
>
> The only available SolrCache i.e LRUCache is based on _LinkedHashMap_ which 
> has _get()_ also synchronized. This can cause severe bottlenecks for faceted 
> search. Any alternate implementation which can be faster/better must be 
> considered. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-795) spellcheck: buildOnOptimize

2008-10-28 Thread Shalin Shekhar Mangar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar resolved SOLR-795.


Resolution: Fixed

Committed revision 708653.

Thanks Jason!

> spellcheck: buildOnOptimize
> ---
>
> Key: SOLR-795
> URL: https://issues.apache.org/jira/browse/SOLR-795
> Project: Solr
>  Issue Type: New Feature
>  Components: spellchecker
>Affects Versions: 1.3
>Reporter: Jason Rennie
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: SOLR-795.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> I see that there's an option to automatically rebuild the spelling index on a 
> commit.  That's a nice feature that we'll consider using, but we run commits 
> every few thousand document updates, which would yield ~100 spelling index 
> rebuilds a day.  OTOH, we run an optimize about once/day which seems like a 
> more appropriate schedule for rebuilding the spelling index.
> Is there or could there be an option to rebuild the spelling index on 
> optimize?
> Grant:
> Seems reasonable, could almost do it via the postOptimize call back already 
> in the config, except the SpellCheckComponent's EvenListener is private 
> static and has an empty postCommit implementation (which is what is called 
> after optimization, since it is just like a commit in many ways)
> Thus, a patch would be needed.
> Shalin:
> postCommit/postOptimize callbacks happen after commit/optimize but before a
> new searcher is opened. Therefore, it is not possible to re-build spellcheck
> index on those events without opening a IndexReader directly on the solr
> index. That is why the event listener in SpellCheckComponent uses the
> newSearcher listener to build on commits.
> I don't think there is anything in the API currently to do what Jason wants.
> Hoss:
> FWIW: I believe it has to work that way because postCommit events might
> modify the index. (but i'm just guessing)
> couldn't the Listener's newSearcher() method just do something like
> this...
> if (rebuildOnlyAfterOptimize &&
>! (newSearcher.getReader().isOptimized() &&
>   ! oldSearcher.getReader().isOptimized()) {
>  return;
> } else {
>  // current impl
> }
> ...assuming a new "rebuildOnlyAfterOptimize" option was added?
> Grant:
> That seems reasonable.
> Another thing to think about, is maybe it is useful to provide some event 
> metadata to the events that contain information about what triggered them.  
> Something like a SolrEvent class such that postCommit looks like
> postCommit(SolrEvent evt)
> and
> public void newSearcher(SolrEvent evt, SolrIndexSearcher newSearcher, 
> SolrIndexSearcher currentSearcher);
> Of course, since SolrEventListener is an interface...
> Shalin:
> Yup, that will work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-651) A SearchComponent for fetching TF-IDF values

2008-10-28 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643291#action_12643291
 ] 

Grant Ingersoll commented on SOLR-651:
--

{quote}
# Adding the uniqueKeyFieldName seems out of place it's just one element of 
the schema and it doesn't seem like it belongs in this component.
# How about using the "id" as the key, as is done in other places like 
highlighting.
{quote}
That's fine.  I think my thinking was that by using a "constant" for the name, 
then one could ask explicitly for that property in the NamedList.  That is 
namedList.getVal("uniqueKey");

{quote}
It doesn't seem like we should link the ability to return term vectors with 
term vectors being stored. Like highlighting, they should be used when 
available for speed, but stored fields should also be possible. It's fine for 
the impl of that to wait, but perhaps the interface should support that via a 
tv.fl parameter. update: just looked at the code again, and I see there is a 
tv.fl param so I guess the only discussion point is if the default is right 
(all fields with term vectors stored).
{quote}

That's reasonable.  We can open a separate issue for it if anyone wants it.

{quote}
# "idf" actually isn't the idf, it's the doc freq that is being returned. The 
label should probably be changed to "df"
# instead of "freq", how about just using the shorter and well-known "tf"?
# the docs say that tf_idf "Calculates tf*idf for each term.", but the code is 
actually returning "freq"/"idf" (but the idf is actually a df, so it is a 
straight tf * idf). But this doesn't seem that useful because the user could 
trivially do tf/df themselves. What would seem useful is to get the actual 
scoring tf-idf (via the Similarity). For better language mappings, I think we 
should avoid dashes in parameter names too perhaps tv.tfidf or tv.tf_idf?
{quote}
All fine as well.  I just added the tf*idf computation in as a based on 
Vaijanath's comments.  I'll update these and the wiki.


> A SearchComponent for fetching TF-IDF values
> 
>
> Key: SOLR-651
> URL: https://issues.apache.org/jira/browse/SOLR-651
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.3
>Reporter: Noble Paul
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 1.4
>
> Attachments: SOLR-651-fixes.patch, SOLR-651.patch, SOLR-651.patch, 
> SOLR-651.patch, SOLR-651.patch, SOLR-651.patch, SOLR-651.patch
>
>
> A SearchComponent that can return TF-IDF vector for any given document in the 
> SOLR index
> Query : A Document Number / a query identifying a Document
> Response :  A Map of term vs.TF-IDF value of every term in the Selected
> Document
> Why ?
> Most of the Machine Learning Algorithms work on TFIDF representation of
> documents, hence adding a Request Handler proving the TFIDF representation
> will pave the way for incorporating Learning Paradigms to SOLR framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-651) A SearchComponent for fetching TF-IDF values

2008-10-28 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643289#action_12643289
 ] 

Grant Ingersoll commented on SOLR-651:
--

{quote}
Is there a reason that this component asks for the latest searcher from the 
core instead of getting the one bound to the SolrQueryRequest? Assuming it's 
just a bug... patch attached
{quote}

Nope.  Go ahead and commit.

> A SearchComponent for fetching TF-IDF values
> 
>
> Key: SOLR-651
> URL: https://issues.apache.org/jira/browse/SOLR-651
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.3
>Reporter: Noble Paul
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 1.4
>
> Attachments: SOLR-651-fixes.patch, SOLR-651.patch, SOLR-651.patch, 
> SOLR-651.patch, SOLR-651.patch, SOLR-651.patch, SOLR-651.patch
>
>
> A SearchComponent that can return TF-IDF vector for any given document in the 
> SOLR index
> Query : A Document Number / a query identifying a Document
> Response :  A Map of term vs.TF-IDF value of every term in the Selected
> Document
> Why ?
> Most of the Machine Learning Algorithms work on TFIDF representation of
> documents, hence adding a Request Handler proving the TFIDF representation
> will pave the way for incorporating Learning Paradigms to SOLR framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.




[jira] Issue Comment Edited: (SOLR-651) A SearchComponent for fetching TF-IDF values

2008-10-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643270#action_12643270
 ] 

[EMAIL PROTECTED] edited comment on SOLR-651 at 10/28/08 10:23 AM:
--

Some random thoughts on this patch:
 - Adding the uniqueKeyFieldName seems out of place it's just one element 
of the schema and it doesn't seem like it belongs in this component.
 - How about using the "id" as the key, as is done in other places like 
highlighting.
  So instead of 
{code}
  
3007WFP



   
 
{code}
it could look like
{code}
  



   
 
{code}
- It doesn't seem like we should link the ability to return term vectors with 
term vectors being stored.  Like highlighting, they should be used when 
available for speed, but stored fields should also be possible.  It's fine for 
the impl of that to wait, but perhaps the interface should support that via a 
tv.fl parameter.  update: just looked at the code again, and I see there is a 
tv.fl param so I guess the only discussion point is if the default is right 
(all fields with term vectors stored).
- "idf" actually isn't the idf, it's the doc freq that is being returned.  The 
label should probably be changed to "df"
- instead of "freq", how about just using the shorter and well-known "tf"?
- the docs say that tf_idf "Calculates tf*idf for each term.", but the code is 
actually returning "freq"/"idf" (but the idf is actually a df, so it is a 
straight tf * idf).  *But* this doesn't seem that useful because the user could 
trivially do tf/df themselves.  What would seem useful is to get the actual 
scoring tf-idf (via the Similarity).  For better language mappings, I think we 
should avoid dashes in parameter names too perhaps tv.tfidf or tv.tf_idf?


  was (Author: [EMAIL PROTECTED]):
Some random thoughts on this patch:
 - Adding the uniqueKeyFieldName seems out of place it's just one element 
of the schema and it doesn't seem like it belongs in this component.
 - How about using the "id" as the key, as is done in other places like 
highlighting.
  So instead of 
{code}
  
3007WFP



   
 
{code}
it could look like
{code}
  



   
 
{code}
- It doesn't seem like we should link the ability to return term vectors with 
term vectors being stored.  Like highlighting, they should be used when 
available for speed, but stored fields should also be possible.  It's fine for 
the impl of that to wait, but perhaps the interface should support that via a 
tv.fl parameter.
- "idf" actually isn't the idf, it's the doc freq that is being returned.  The 
label should probably be changed to "df"
- instead of "freq", how about just using the shorter and well-known "tf"?
- the docs say that tf_idf "Calculates tf*idf for each term.", but the code is 
actually returning "freq"/"idf" (but the idf is actually a df, so it is a 
straight tf * idf).  *But* this doesn't seem that useful because the user could 
trivially do tf/df themselves.  What would seem useful is to get the actual 
scoring tf-idf (via the Similarity).  For better language mappings, I think we 
should avoid dashes in parameter names too perhaps tv.tfidf or tv.tf_idf?

  
> A SearchComponent for fetching TF-IDF values
> 
>
> Key: SOLR-651
> URL: https://issues.apache.org/jira/browse/SOLR-651
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.3
>Reporter: Noble Paul
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 1.4
>
> Attachments: SOLR-651-fixes.patch, SOLR-651.patch, SOLR-651.patch, 
> SOLR-651.patch, SOLR-651.patch, SOLR-651.patch, SOLR-651.patch
>
>
> A SearchComponent that can return TF-IDF vector for any given document in the 
> SOLR index
> Query : A Document Number / a query identifying a Document
> Response :  A Map of term vs.TF-IDF value of every term in the Selected
> Document
> Why ?
> Most of the Machine Learning Algorithms work on TFIDF representation of
> documents, hence adding a Request Handler proving the TFIDF representation
> will pave the way for incorporating Learning Paradigms to SOLR framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-651) A SearchComponent for fetching TF-IDF values

2008-10-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643270#action_12643270
 ] 

Yonik Seeley commented on SOLR-651:
---

Some random thoughts on this patch:
 - Adding the uniqueKeyFieldName seems out of place it's just one element 
of the schema and it doesn't seem like it belongs in this component.
 - How about using the "id" as the key, as is done in other places like 
highlighting.
  So instead of 
{code}
  
3007WFP



   
 
{code}
it could look like
{code}
  



   
 
{code}
- It doesn't seem like we should link the ability to return term vectors with 
term vectors being stored.  Like highlighting, they should be used when 
available for speed, but stored fields should also be possible.  It's fine for 
the impl of that to wait, but perhaps the interface should support that via a 
tv.fl parameter.
- "idf" actually isn't the idf, it's the doc freq that is being returned.  The 
label should probably be changed to "df"
- instead of "freq", how about just using the shorter and well-known "tf"?
- the docs say that tf_idf "Calculates tf*idf for each term.", but the code is 
actually returning "freq"/"idf" (but the idf is actually a df, so it is a 
straight tf * idf).  *But* this doesn't seem that useful because the user could 
trivially do tf/df themselves.  What would seem useful is to get the actual 
scoring tf-idf (via the Similarity).  For better language mappings, I think we 
should avoid dashes in parameter names too perhaps tv.tfidf or tv.tf_idf?


> A SearchComponent for fetching TF-IDF values
> 
>
> Key: SOLR-651
> URL: https://issues.apache.org/jira/browse/SOLR-651
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.3
>Reporter: Noble Paul
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 1.4
>
> Attachments: SOLR-651-fixes.patch, SOLR-651.patch, SOLR-651.patch, 
> SOLR-651.patch, SOLR-651.patch, SOLR-651.patch, SOLR-651.patch
>
>
> A SearchComponent that can return TF-IDF vector for any given document in the 
> SOLR index
> Query : A Document Number / a query identifying a Document
> Response :  A Map of term vs.TF-IDF value of every term in the Selected
> Document
> Why ?
> Most of the Machine Learning Algorithms work on TFIDF representation of
> documents, hence adding a Request Handler proving the TFIDF representation
> will pave the way for incorporating Learning Paradigms to SOLR framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (SOLR-827) CoreAdminRequest does not handle core/name differently based on request

2008-10-28 Thread Ryan McKinley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley reassigned SOLR-827:
--

Assignee: Ryan McKinley

> CoreAdminRequest does not handle core/name differently based on request
> ---
>
> Key: SOLR-827
> URL: https://issues.apache.org/jira/browse/SOLR-827
> Project: Solr
>  Issue Type: Bug
>  Components: clients - java
>Reporter: Sean Colombo
>Assignee: Ryan McKinley
> Fix For: 1.4
>
> Attachments: SOLR-827.patch
>
>
> This is closely related to SOLR-803.  In that issue, creating new cores 
> failed because "core" parameter was set instead of "name".  As it turns out, 
> the CREATE action uses "name" and all other actions use "core".  This means 
> that the fix to 803 would have also been a break to the others.
> Documentation on parameters for certain actions:
> http://wiki.apache.org/solr/CoreAdmin#head-c6dd6a81d9af0c12de8c160fbfa82fe2c5411e71
> I have a patch ready.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-827) CoreAdminRequest does not handle core/name differently based on request

2008-10-28 Thread Ryan McKinley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley resolved SOLR-827.


   Resolution: Fixed
Fix Version/s: 1.4

thanks Sean!

If you want to add a unit test, that would be great too!

> CoreAdminRequest does not handle core/name differently based on request
> ---
>
> Key: SOLR-827
> URL: https://issues.apache.org/jira/browse/SOLR-827
> Project: Solr
>  Issue Type: Bug
>  Components: clients - java
>Reporter: Sean Colombo
>Assignee: Ryan McKinley
> Fix For: 1.4
>
> Attachments: SOLR-827.patch
>
>
> This is closely related to SOLR-803.  In that issue, creating new cores 
> failed because "core" parameter was set instead of "name".  As it turns out, 
> the CREATE action uses "name" and all other actions use "core".  This means 
> that the fix to 803 would have also been a break to the others.
> Documentation on parameters for certain actions:
> http://wiki.apache.org/solr/CoreAdmin#head-c6dd6a81d9af0c12de8c160fbfa82fe2c5411e71
> I have a patch ready.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-827) CoreAdminRequest does not handle core/name differently based on request

2008-10-28 Thread Sean Colombo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Colombo updated SOLR-827:
--

Attachment: SOLR-827.patch

Patch file fixing the issue.

> CoreAdminRequest does not handle core/name differently based on request
> ---
>
> Key: SOLR-827
> URL: https://issues.apache.org/jira/browse/SOLR-827
> Project: Solr
>  Issue Type: Bug
>  Components: clients - java
>Reporter: Sean Colombo
> Attachments: SOLR-827.patch
>
>
> This is closely related to SOLR-803.  In that issue, creating new cores 
> failed because "core" parameter was set instead of "name".  As it turns out, 
> the CREATE action uses "name" and all other actions use "core".  This means 
> that the fix to 803 would have also been a break to the others.
> Documentation on parameters for certain actions:
> http://wiki.apache.org/solr/CoreAdmin#head-c6dd6a81d9af0c12de8c160fbfa82fe2c5411e71
> I have a patch ready.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-827) CoreAdminRequest does not handle core/name differently based on request

2008-10-28 Thread Sean Colombo (JIRA)
CoreAdminRequest does not handle core/name differently based on request
---

 Key: SOLR-827
 URL: https://issues.apache.org/jira/browse/SOLR-827
 Project: Solr
  Issue Type: Bug
  Components: clients - java
Reporter: Sean Colombo


This is closely related to SOLR-803.  In that issue, creating new cores failed 
because "core" parameter was set instead of "name".  As it turns out, the 
CREATE action uses "name" and all other actions use "core".  This means that 
the fix to 803 would have also been a break to the others.

Documentation on parameters for certain actions:
http://wiki.apache.org/solr/CoreAdmin#head-c6dd6a81d9af0c12de8c160fbfa82fe2c5411e71

I have a patch ready.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-651) A SearchComponent for fetching TF-IDF values

2008-10-28 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-651:
--

Attachment: SOLR-651-fixes.patch

Is there a reason that this component asks for the latest searcher from the 
core instead of getting the one bound to the SolrQueryRequest?  Assuming it's 
just a bug... patch attached.

> A SearchComponent for fetching TF-IDF values
> 
>
> Key: SOLR-651
> URL: https://issues.apache.org/jira/browse/SOLR-651
> Project: Solr
>  Issue Type: New Feature
>Affects Versions: 1.3
>Reporter: Noble Paul
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 1.4
>
> Attachments: SOLR-651-fixes.patch, SOLR-651.patch, SOLR-651.patch, 
> SOLR-651.patch, SOLR-651.patch, SOLR-651.patch, SOLR-651.patch
>
>
> A SearchComponent that can return TF-IDF vector for any given document in the 
> SOLR index
> Query : A Document Number / a query identifying a Document
> Response :  A Map of term vs.TF-IDF value of every term in the Selected
> Document
> Why ?
> Most of the Machine Learning Algorithms work on TFIDF representation of
> documents, hence adding a Request Handler proving the TFIDF representation
> will pave the way for incorporating Learning Paradigms to SOLR framework.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-561) Solr replication by Solr (for windows also)

2008-10-28 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643214#action_12643214
 ] 

Yonik Seeley commented on SOLR-561:
---

Comments committed. Thanks!

> Solr replication by Solr (for windows also)
> ---
>
> Key: SOLR-561
> URL: https://issues.apache.org/jira/browse/SOLR-561
> Project: Solr
>  Issue Type: New Feature
>  Components: replication
>Affects Versions: 1.4
> Environment: All
>Reporter: Noble Paul
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: deletion_policy.patch, SOLR-561-core.patch, 
> SOLR-561-fixes.patch, SOLR-561-fixes.patch, SOLR-561-fixes.patch, 
> SOLR-561-full.patch, SOLR-561-full.patch, SOLR-561-full.patch, 
> SOLR-561-full.patch, SOLR-561.patch, SOLR-561.patch, SOLR-561.patch, 
> SOLR-561.patch, SOLR-561.patch, SOLR-561.patch, SOLR-561.patch, 
> SOLR-561.patch, SOLR-561.patch, SOLR-561.patch, SOLR-561.patch, 
> SOLR-561.patch, SOLR-561.patch, SOLR-561.patch
>
>
> The current replication strategy in solr involves shell scripts . The 
> following are the drawbacks with the approach
> *  It does not work with windows
> * Replication works as a separate piece not integrated with solr.
> * Cannot control replication from solr admin/JMX
> * Each operation requires manual telnet to the host
> Doing the replication in java has the following advantages
> * Platform independence
> * Manual steps can be completely eliminated. Everything can be driven from 
> solrconfig.xml .
> ** Adding the url of the master in the slaves should be good enough to enable 
> replication. Other things like frequency of
> snapshoot/snappull can also be configured . All other information can be 
> automatically obtained.
> * Start/stop can be triggered from solr/admin or JMX
> * Can get the status/progress while replication is going on. It can also 
> abort an ongoing replication
> * No need to have a login into the machine 
> * From a development perspective, we can unit test it
> This issue can track the implementation of solr replication in java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-826) For the solr-ruby-refactoring movement

2008-10-28 Thread Matt Mitchell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Mitchell updated SOLR-826:
---

Attachment: experiments.2.patch

This is even more stripped down. A basic connection with get/post methods. 
Keeping it simple like this would provide the possibility of mixing in 
different modules for higher-level methods like add, delete_by_id etc..

> For the solr-ruby-refactoring movement
> --
>
> Key: SOLR-826
> URL: https://issues.apache.org/jira/browse/SOLR-826
> Project: Solr
>  Issue Type: Improvement
>  Components: clients - ruby - flare
>Reporter: Matt Mitchell
> Attachments: experiments.2.patch, experiments.patch
>
>
> This is a patch to add a new directory to the solr-ruby-refactoring "branch". 
> It's a very lightweight blob of code for connecting, selecting, updating and 
> deleting using Ruby. It requires the URI and Net::HTTP libraries. No tests at 
> the moment but I think the comments will do for now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-667) Alternate LRUCache implementation

2008-10-28 Thread Shalin Shekhar Mangar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-667:
---

Attachment: SOLR-667.patch

# Added comments in the code
# Fixed a few concurrency issues

I'll commit this shortly.

> Alternate LRUCache implementation
> -
>
> Key: SOLR-667
> URL: https://issues.apache.org/jira/browse/SOLR-667
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Affects Versions: 1.3
>Reporter: Noble Paul
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: ConcurrentLRUCache.java, ConcurrentLRUCache.java, 
> ConcurrentLRUCache.java, SOLR-667.patch, SOLR-667.patch, SOLR-667.patch, 
> SOLR-667.patch, SOLR-667.patch, SOLR-667.patch
>
>
> The only available SolrCache i.e LRUCache is based on _LinkedHashMap_ which 
> has _get()_ also synchronized. This can cause severe bottlenecks for faceted 
> search. Any alternate implementation which can be faster/better must be 
> considered. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-561) Solr replication by Solr (for windows also)

2008-10-28 Thread Noble Paul (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-561:


Attachment: SOLR-561.patch

comments only

> Solr replication by Solr (for windows also)
> ---
>
> Key: SOLR-561
> URL: https://issues.apache.org/jira/browse/SOLR-561
> Project: Solr
>  Issue Type: New Feature
>  Components: replication
>Affects Versions: 1.4
> Environment: All
>Reporter: Noble Paul
>Assignee: Shalin Shekhar Mangar
> Fix For: 1.4
>
> Attachments: deletion_policy.patch, SOLR-561-core.patch, 
> SOLR-561-fixes.patch, SOLR-561-fixes.patch, SOLR-561-fixes.patch, 
> SOLR-561-full.patch, SOLR-561-full.patch, SOLR-561-full.patch, 
> SOLR-561-full.patch, SOLR-561.patch, SOLR-561.patch, SOLR-561.patch, 
> SOLR-561.patch, SOLR-561.patch, SOLR-561.patch, SOLR-561.patch, 
> SOLR-561.patch, SOLR-561.patch, SOLR-561.patch, SOLR-561.patch, 
> SOLR-561.patch, SOLR-561.patch, SOLR-561.patch
>
>
> The current replication strategy in solr involves shell scripts . The 
> following are the drawbacks with the approach
> *  It does not work with windows
> * Replication works as a separate piece not integrated with solr.
> * Cannot control replication from solr admin/JMX
> * Each operation requires manual telnet to the host
> Doing the replication in java has the following advantages
> * Platform independence
> * Manual steps can be completely eliminated. Everything can be driven from 
> solrconfig.xml .
> ** Adding the url of the master in the slaves should be good enough to enable 
> replication. Other things like frequency of
> snapshoot/snappull can also be configured . All other information can be 
> automatically obtained.
> * Start/stop can be triggered from solr/admin or JMX
> * Can get the status/progress while replication is going on. It can also 
> abort an ongoing replication
> * No need to have a login into the machine 
> * From a development perspective, we can unit test it
> This issue can track the implementation of solr replication in java

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.