[jira] [Commented] (SOLR-9651) Consider tracking modification time of external file fields for faster reloading

2016-10-17 Thread Mike (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583099#comment-15583099
 ] 

Mike commented on SOLR-9651:


Well, there's also SOLR-9653 which is specifically about deprecating EFFs. 
Might be a better place to discuss and then come back here if the decision is 
not to deprecate?

> Consider tracking modification time of external file fields for faster 
> reloading
> 
>
> Key: SOLR-9651
> URL: https://issues.apache.org/jira/browse/SOLR-9651
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 4.10.4
> Environment: Linux
>Reporter: Mike
>
> I have an index of about 4M legal documents that has pagerank boosting 
> configured as an external file field. The external file is about 100MB in 
> size and has one row per document in the index. Each row indicates the 
> pagerank score of a document. When we open new searchers, this file has to 
> get reloaded, and it creates a noticeable delay for our users -- takes 
> several seconds to reload. 
> An idea to fix this came up in [a recent discussion in the Solr mailing 
> list|https://www.mail-archive.com/solr-user@lucene.apache.org/msg125521.html]:
>  Could the file only be reloaded if it has changed on disk? In other words, 
> when new searchers are opened, could they check the modtime of the file, and 
> avoid reloading it if the file hasn't changed? 
> In our configuration, this would be a big improvement. We only change the 
> pagerank file once/week because computing it is intensive and new documents 
> don't tend to have a big impact. At the same time, because we're regularly 
> adding new documents, we do hundreds of commits per day, all of which have a 
> delay as the (largish) external file field is reloaded. 
> Is this a reasonable improvement to request? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9651) Consider tracking modification time of external file fields for faster reloading

2016-10-17 Thread Keith Laban (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583091#comment-15583091
 ] 

Keith Laban commented on SOLR-9651:
---

I wrote an extension of EFF called RemoteFileField (SOLR-9617). The idea is 
that you can drop your external file field in an s3 bucket or some remote 
hosted place and then tell solr to suck it down and update the EFF. We could 
probably use the same approach to have it just do atomic updates to the 
documents instead of writing an external file. 

Maybe the title/description of this ticket should be updated to be a discussion 
around finding a better approach for EFF.

> Consider tracking modification time of external file fields for faster 
> reloading
> 
>
> Key: SOLR-9651
> URL: https://issues.apache.org/jira/browse/SOLR-9651
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 4.10.4
> Environment: Linux
>Reporter: Mike
>
> I have an index of about 4M legal documents that has pagerank boosting 
> configured as an external file field. The external file is about 100MB in 
> size and has one row per document in the index. Each row indicates the 
> pagerank score of a document. When we open new searchers, this file has to 
> get reloaded, and it creates a noticeable delay for our users -- takes 
> several seconds to reload. 
> An idea to fix this came up in [a recent discussion in the Solr mailing 
> list|https://www.mail-archive.com/solr-user@lucene.apache.org/msg125521.html]:
>  Could the file only be reloaded if it has changed on disk? In other words, 
> when new searchers are opened, could they check the modtime of the file, and 
> avoid reloading it if the file hasn't changed? 
> In our configuration, this would be a big improvement. We only change the 
> pagerank file once/week because computing it is intensive and new documents 
> don't tend to have a big impact. At the same time, because we're regularly 
> adding new documents, we do hundreds of commits per day, all of which have a 
> delay as the (largish) external file field is reloaded. 
> Is this a reasonable improvement to request? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9651) Consider tracking modification time of external file fields for faster reloading

2016-10-17 Thread Mike (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583073#comment-15583073
 ] 

Mike commented on SOLR-9651:


That could be a solution, but our EFF file is about 90MB...so even with long 
updates we'd still have to send a lot of queries (or one TRULY massive one!). 

I'm happy so long as I have a way to quickly update a single numeric field for 
X million documents. If doing it via HTTP is easiest, that's fine, provided 
performance is reasonable.

> Consider tracking modification time of external file fields for faster 
> reloading
> 
>
> Key: SOLR-9651
> URL: https://issues.apache.org/jira/browse/SOLR-9651
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 4.10.4
> Environment: Linux
>Reporter: Mike
>
> I have an index of about 4M legal documents that has pagerank boosting 
> configured as an external file field. The external file is about 100MB in 
> size and has one row per document in the index. Each row indicates the 
> pagerank score of a document. When we open new searchers, this file has to 
> get reloaded, and it creates a noticeable delay for our users -- takes 
> several seconds to reload. 
> An idea to fix this came up in [a recent discussion in the Solr mailing 
> list|https://www.mail-archive.com/solr-user@lucene.apache.org/msg125521.html]:
>  Could the file only be reloaded if it has changed on disk? In other words, 
> when new searchers are opened, could they check the modtime of the file, and 
> avoid reloading it if the file hasn't changed? 
> In our configuration, this would be a big improvement. We only change the 
> pagerank file once/week because computing it is intensive and new documents 
> don't tend to have a big impact. At the same time, because we're regularly 
> adding new documents, we do hundreds of commits per day, all of which have a 
> delay as the (largish) external file field is reloaded. 
> Is this a reasonable improvement to request? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9651) Consider tracking modification time of external file fields for faster reloading

2016-10-17 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583037#comment-15583037
 ] 

Erick Erickson commented on SOLR-9651:
--

Good points. I wonder if the ability to send very long updates would serve? By 
that I mean instead of writing a file, a mechanism whereby I could send the 
gazillion updates to a DV field at once?

Mind you, I don't really have a lot of investment in deprecating EFF, mostly 
I'd like to be sure we discuss it and explore possibilities before embarking on 
a fix.



> Consider tracking modification time of external file fields for faster 
> reloading
> 
>
> Key: SOLR-9651
> URL: https://issues.apache.org/jira/browse/SOLR-9651
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 4.10.4
> Environment: Linux
>Reporter: Mike
>
> I have an index of about 4M legal documents that has pagerank boosting 
> configured as an external file field. The external file is about 100MB in 
> size and has one row per document in the index. Each row indicates the 
> pagerank score of a document. When we open new searchers, this file has to 
> get reloaded, and it creates a noticeable delay for our users -- takes 
> several seconds to reload. 
> An idea to fix this came up in [a recent discussion in the Solr mailing 
> list|https://www.mail-archive.com/solr-user@lucene.apache.org/msg125521.html]:
>  Could the file only be reloaded if it has changed on disk? In other words, 
> when new searchers are opened, could they check the modtime of the file, and 
> avoid reloading it if the file hasn't changed? 
> In our configuration, this would be a big improvement. We only change the 
> pagerank file once/week because computing it is intensive and new documents 
> don't tend to have a big impact. At the same time, because we're regularly 
> adding new documents, we do hundreds of commits per day, all of which have a 
> delay as the (largish) external file field is reloaded. 
> Is this a reasonable improvement to request? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9651) Consider tracking modification time of external file fields for faster reloading

2016-10-17 Thread Keith Laban (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583026#comment-15583026
 ] 

Keith Laban commented on SOLR-9651:
---

I wonder if we can do something even more hacky (and efficient) like write the 
external file field as a doc value segment and pretend like its an unstored, 
doc value field. I'm with Mike in thinking that dropping a file is much easier 
than updating all your documents.

> Consider tracking modification time of external file fields for faster 
> reloading
> 
>
> Key: SOLR-9651
> URL: https://issues.apache.org/jira/browse/SOLR-9651
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 4.10.4
> Environment: Linux
>Reporter: Mike
>
> I have an index of about 4M legal documents that has pagerank boosting 
> configured as an external file field. The external file is about 100MB in 
> size and has one row per document in the index. Each row indicates the 
> pagerank score of a document. When we open new searchers, this file has to 
> get reloaded, and it creates a noticeable delay for our users -- takes 
> several seconds to reload. 
> An idea to fix this came up in [a recent discussion in the Solr mailing 
> list|https://www.mail-archive.com/solr-user@lucene.apache.org/msg125521.html]:
>  Could the file only be reloaded if it has changed on disk? In other words, 
> when new searchers are opened, could they check the modtime of the file, and 
> avoid reloading it if the file hasn't changed? 
> In our configuration, this would be a big improvement. We only change the 
> pagerank file once/week because computing it is intensive and new documents 
> don't tend to have a big impact. At the same time, because we're regularly 
> adding new documents, we do hundreds of commits per day, all of which have a 
> delay as the (largish) external file field is reloaded. 
> Is this a reasonable improvement to request? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9651) Consider tracking modification time of external file fields for faster reloading

2016-10-17 Thread Mike (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583004#comment-15583004
 ] 

Mike commented on SOLR-9651:


I haven't read the super long SOLR-5944 issue, but the one thing I'll say about 
external file fields that's wonderful is that they can be updated in a flash. 
For a database with millions of documents, I can update them all just by 
updating a single file. That's very powerful and it'd be a shame if EFF were 
deprecated and I had to send a gazillion update queries to Solr to do the same.

> Consider tracking modification time of external file fields for faster 
> reloading
> 
>
> Key: SOLR-9651
> URL: https://issues.apache.org/jira/browse/SOLR-9651
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 4.10.4
> Environment: Linux
>Reporter: Mike
>
> I have an index of about 4M legal documents that has pagerank boosting 
> configured as an external file field. The external file is about 100MB in 
> size and has one row per document in the index. Each row indicates the 
> pagerank score of a document. When we open new searchers, this file has to 
> get reloaded, and it creates a noticeable delay for our users -- takes 
> several seconds to reload. 
> An idea to fix this came up in [a recent discussion in the Solr mailing 
> list|https://www.mail-archive.com/solr-user@lucene.apache.org/msg125521.html]:
>  Could the file only be reloaded if it has changed on disk? In other words, 
> when new searchers are opened, could they check the modtime of the file, and 
> avoid reloading it if the file hasn't changed? 
> In our configuration, this would be a big improvement. We only change the 
> pagerank file once/week because computing it is intensive and new documents 
> don't tend to have a big impact. At the same time, because we're regularly 
> adding new documents, we do hundreds of commits per day, all of which have a 
> delay as the (largish) external file field is reloaded. 
> Is this a reasonable improvement to request? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9651) Consider tracking modification time of external file fields for faster reloading

2016-10-17 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15582033#comment-15582033
 ] 

Erick Erickson commented on SOLR-9651:
--

Consider deprecating instead?

> Consider tracking modification time of external file fields for faster 
> reloading
> 
>
> Key: SOLR-9651
> URL: https://issues.apache.org/jira/browse/SOLR-9651
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 4.10.4
> Environment: Linux
>Reporter: Mike
>
> I have an index of about 4M legal documents that has pagerank boosting 
> configured as an external file field. The external file is about 100MB in 
> size and has one row per document in the index. Each row indicates the 
> pagerank score of a document. When we open new searchers, this file has to 
> get reloaded, and it creates a noticeable delay for our users -- takes 
> several seconds to reload. 
> An idea to fix this came up in [a recent discussion in the Solr mailing 
> list|https://www.mail-archive.com/solr-user@lucene.apache.org/msg125521.html]:
>  Could the file only be reloaded if it has changed on disk? In other words, 
> when new searchers are opened, could they check the modtime of the file, and 
> avoid reloading it if the file hasn't changed? 
> In our configuration, this would be a big improvement. We only change the 
> pagerank file once/week because computing it is intensive and new documents 
> don't tend to have a big impact. At the same time, because we're regularly 
> adding new documents, we do hundreds of commits per day, all of which have a 
> delay as the (largish) external file field is reloaded. 
> Is this a reasonable improvement to request? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9651) Consider tracking modification time of external file fields for faster reloading

2016-10-17 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15582006#comment-15582006
 ] 

Erick Erickson commented on SOLR-9651:
--

I understand the issue here, but do wonder if the effort would be best spent on 
https://issues.apache.org/jira/browse/SOLR-5944 (Updateable numeric docValues 
fields). That JIRA does _not_ require that the entire document be updated, The 
external file fields bit has always been something of a hack to get around 
having to update the entire document.

In fact, does it make sense to deprecate external file fields when SOLR-5944 
gets committed?

> Consider tracking modification time of external file fields for faster 
> reloading
> 
>
> Key: SOLR-9651
> URL: https://issues.apache.org/jira/browse/SOLR-9651
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 4.10.4
> Environment: Linux
>Reporter: Mike
>
> I have an index of about 4M legal documents that has pagerank boosting 
> configured as an external file field. The external file is about 100MB in 
> size and has one row per document in the index. Each row indicates the 
> pagerank score of a document. When we open new searchers, this file has to 
> get reloaded, and it creates a noticeable delay for our users -- takes 
> several seconds to reload. 
> An idea to fix this came up in [a recent discussion in the Solr mailing 
> list|https://www.mail-archive.com/solr-user@lucene.apache.org/msg125521.html]:
>  Could the file only be reloaded if it has changed on disk? In other words, 
> when new searchers are opened, could they check the modtime of the file, and 
> avoid reloading it if the file hasn't changed? 
> In our configuration, this would be a big improvement. We only change the 
> pagerank file once/week because computing it is intensive and new documents 
> don't tend to have a big impact. At the same time, because we're regularly 
> adding new documents, we do hundreds of commits per day, all of which have a 
> delay as the (largish) external file field is reloaded. 
> Is this a reasonable improvement to request? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org