[jira] [Closed] (LUCENE-5158) Allow StoredFieldVisitor instances to be stateful
[ https://issues.apache.org/jira/browse/LUCENE-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brendan Humphreys closed LUCENE-5158. - Resolution: Won't Fix Allow StoredFieldVisitor instances to be stateful - Key: LUCENE-5158 URL: https://issues.apache.org/jira/browse/LUCENE-5158 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 4.4 Reporter: Brendan Humphreys Priority: Minor Attachments: LUCENE-5158.patch Currently there is no way to build stateful {{StoredFieldVisitor}} s. h3. Motivation We would like to optimise our access to stored fields in our indexes by utilising the {{StoredFieldVisitor.Status.STOP}} feature to stop processing fields in a document. Unfortunately we have very large indexes, and rebuilding them to have the required field order is not an option. A stateful {{StoredFieldVisitor}} could solve this; it could track which fields have been loaded for a document, and then {{STOP}} when the fields required have been loaded, regardless of the order they were loaded. h3. Implementation I've added a no-op {{public void reset()}} method to the {{StoredFieldVisitor}} base class, which gives a {{StoredFieldVisitor}} subclass an opportunity to reset its state before the fields of the next document are processed. I've added a call to {{reset()}} in all places the {{StoredFieldVisitor}} was being used. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5158) Allow StoredFieldVisitor instances to be stateful
[ https://issues.apache.org/jira/browse/LUCENE-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733191#comment-13733191 ] Brendan Humphreys commented on LUCENE-5158: --- bq. Separately if you want a reset() method to call before a document is processed, just add it to your own StoredFieldVisitor, and just call it yourself before the next ir.document(). Its not necessary to add this method to the lucene API for that. Yes, I see now what you mean. I had come to this solution via a fairly circuitous route; stopping to smell the flowers I see my modifications were unnecessary. I'll close this won't fix. Cheers, -Brendan Allow StoredFieldVisitor instances to be stateful - Key: LUCENE-5158 URL: https://issues.apache.org/jira/browse/LUCENE-5158 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 4.4 Reporter: Brendan Humphreys Priority: Minor Attachments: LUCENE-5158.patch Currently there is no way to build stateful {{StoredFieldVisitor}} s. h3. Motivation We would like to optimise our access to stored fields in our indexes by utilising the {{StoredFieldVisitor.Status.STOP}} feature to stop processing fields in a document. Unfortunately we have very large indexes, and rebuilding them to have the required field order is not an option. A stateful {{StoredFieldVisitor}} could solve this; it could track which fields have been loaded for a document, and then {{STOP}} when the fields required have been loaded, regardless of the order they were loaded. h3. Implementation I've added a no-op {{public void reset()}} method to the {{StoredFieldVisitor}} base class, which gives a {{StoredFieldVisitor}} subclass an opportunity to reset its state before the fields of the next document are processed. I've added a call to {{reset()}} in all places the {{StoredFieldVisitor}} was being used. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [JENKINS] Lucene-Solr-trunk-Windows (32bit/jdk1.8.0-ea-b99) - Build # 3121 - Failure!
Hi, this one looks crazy. Maybe a Windows-only problem, have never seen that before! Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Policeman Jenkins Server [mailto:jenk...@thetaphi.de] Sent: Wednesday, August 07, 2013 6:35 AM To: dev@lucene.apache.org; rm...@apache.org; hoss...@apache.org Subject: [JENKINS] Lucene-Solr-trunk-Windows (32bit/jdk1.8.0-ea-b99) - Build # 3121 - Failure! Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/3121/ Java: 32bit/jdk1.8.0-ea-b99 -server -XX:+UseG1GC 1 tests failed. REGRESSION: org.apache.lucene.index.TestIndexWriterOutOfFileDescriptors.test Error Message: unreferenced files: before delete: [_0_TestBloomFilteredLucene41Postings_0.doc, _0_TestBloomFilteredLucene41Postings_0.pos, _k.fdt, _k.fdx, _k.fnm, _k.nvd, _k.nvm, _k.si, _k.tvd, _k.tvx, _k_Lucene41WithOrds_0.doc, _k_Lucene41WithOrds_0.pos, _k_Lucene41WithOrds_0.tib, _k_Lucene41WithOrds_0.tii, _k_Lucene42_0.dvd, _k_Lucene42_0.dvm, _k_MockFixedIntBlock_0.doc, _k_MockFixedIntBlock_0.frq, _k_MockFixedIntBlock_0.pos, _k_MockFixedIntBlock_0.pyl, _k_MockFixedIntBlock_0.skp, _k_MockFixedIntBlock_0.tib, _k_MockFixedIntBlock_0.tii, _k_MockVariableIntBlock_0.doc, _k_MockVariableIntBlock_0.frq, _k_MockVariableIntBlock_0.pos, _k_MockVariableIntBlock_0.pyl, _k_MockVariableIntBlock_0.skp, _k_MockVariableIntBlock_0.tib, _k_MockVariableIntBlock_0.tii, _k_TestBloomFilteredLucene41Postings_0.blm, _k_TestBloomFilteredLucene41Postings_0.doc, _k_TestBloomFilteredLucene41Postings_0.pos, _k_TestBloomFilteredLucene41Postings_0.tim, _k_TestBloomFilteredLucene41Postings_0.tip, _l.fdt, _l.fdx, _l.fnm, _l.nvd, _l.nvm, _l.si, _l.tvd, _l.tvx, _l_Lucene41WithOrds_0.doc, _l_Lucene41WithOrds_0.pos, _l_Lucene41WithOrds_0.tib, _l_Lucene41WithOrds_0.tii, _l_Lucene42_0.dvd, _l_Lucene42_0.dvm, _l_MockFixedIntBlock_0.doc, _l_MockFixedIntBlock_0.frq, _l_MockFixedIntBlock_0.pos, _l_MockFixedIntBlock_0.pyl, _l_MockFixedIntBlock_0.skp, _l_MockFixedIntBlock_0.tib, _l_MockFixedIntBlock_0.tii, _l_MockVariableIntBlock_0.doc, _l_MockVariableIntBlock_0.frq, _l_MockVariableIntBlock_0.pos, _l_MockVariableIntBlock_0.pyl, _l_MockVariableIntBlock_0.skp, _l_MockVariableIntBlock_0.tib, _l_MockVariableIntBlock_0.tii, _l_TestBloomFilteredLucene41Postings_0.blm, _l_TestBloomFilteredLucene41Postings_0.doc, _l_TestBloomFilteredLucene41Postings_0.pos, _l_TestBloomFilteredLucene41Postings_0.tim, _l_TestBloomFilteredLucene41Postings_0.tip, _m.cfe, _m.cfs, _m.si, _n.cfe, _n.cfs, _n.si, _o.fdt, _o.fdx, _o.fnm, _o.nvd, _o.nvm, _o.si, _o.tvd, _o.tvx, _o_Lucene41WithOrds_0.doc, _o_Lucene41WithOrds_0.pos, _o_Lucene41WithOrds_0.tib, _o_Lucene41WithOrds_0.tii, _o_Lucene42_0.dvd, _o_Lucene42_0.dvm, _o_MockFixedIntBlock_0.doc, _o_MockFixedIntBlock_0.frq, _o_MockFixedIntBlock_0.pos, _o_MockFixedIntBlock_0.pyl, _o_MockFixedIntBlock_0.skp, _o_MockFixedIntBlock_0.tib, _o_MockFixedIntBlock_0.tii, _o_MockVariableIntBlock_0.doc, _o_MockVariableIntBlock_0.frq, _o_MockVariableIntBlock_0.pos, _o_MockVariableIntBlock_0.pyl, _o_MockVariableIntBlock_0.skp, _o_MockVariableIntBlock_0.tib, _o_MockVariableIntBlock_0.tii, _o_TestBloomFilteredLucene41Postings_0.blm, _o_TestBloomFilteredLucene41Postings_0.doc, _o_TestBloomFilteredLucene41Postings_0.pos, _o_TestBloomFilteredLucene41Postings_0.tim, _o_TestBloomFilteredLucene41Postings_0.tip, _q.fdt, _q.fdx, _q.fnm, _q.nvd, _q.nvm, _q.si, _q.tvd, _q.tvx, _q_Lucene41WithOrds_0.doc, _q_Lucene41WithOrds_0.pos, _q_Lucene41WithOrds_0.tib, _q_Lucene41WithOrds_0.tii, _q_Lucene42_0.dvd, _q_Lucene42_0.dvm, _q_MockFixedIntBlock_0.doc, _q_MockFixedIntBlock_0.frq, _q_MockFixedIntBlock_0.pos, _q_MockFixedIntBlock_0.pyl, _q_MockFixedIntBlock_0.skp, _q_MockFixedIntBlock_0.tib, _q_MockFixedIntBlock_0.tii, _q_MockVariableIntBlock_0.doc, _q_MockVariableIntBlock_0.frq, _q_MockVariableIntBlock_0.pos, _q_MockVariableIntBlock_0.pyl, _q_MockVariableIntBlock_0.skp, _q_MockVariableIntBlock_0.tib, _q_MockVariableIntBlock_0.tii, _q_TestBloomFilteredLucene41Postings_0.blm, _q_TestBloomFilteredLucene41Postings_0.doc, _q_TestBloomFilteredLucene41Postings_0.pos, _q_TestBloomFilteredLucene41Postings_0.tim, _q_TestBloomFilteredLucene41Postings_0.tip, _r.fdt, _r.fdx, _r.fnm, _r.nvd, _r.nvm, _r.si, _r.tvd, _r.tvx, _r_Lucene41WithOrds_0.doc, _r_Lucene41WithOrds_0.pay, _r_Lucene41WithOrds_0.pos, _r_Lucene41WithOrds_0.tib, _r_Lucene41WithOrds_0.tii, _r_Lucene42_0.dvd, _r_Lucene42_0.dvm, _r_MockFixedIntBlock_0.doc, _r_MockFixedIntBlock_0.frq, _r_MockFixedIntBlock_0.pos, _r_MockFixedIntBlock_0.pyl, _r_MockFixedIntBlock_0.skp, _r_MockFixedIntBlock_0.tib, _r_MockFixedIntBlock_0.tii, _r_MockVariableIntBlock_0.doc, _r_MockVariableIntBlock_0.frq, _r_MockVariableIntBlock_0.pos, _r_MockVariableIntBlock_0.pyl, _r_MockVariableIntBlock_0.skp, _r_MockVariableIntBlock_0.tib, _r_MockVariableIntBlock_0.tii,
[jira] [Created] (SOLR-5123) NullPointerException on JdbcDataSource
Thomas SZADEL created SOLR-5123: --- Summary: NullPointerException on JdbcDataSource Key: SOLR-5123 URL: https://issues.apache.org/jira/browse/SOLR-5123 Project: Solr Issue Type: Bug Components: search Affects Versions: 4.3 Environment: Linux Reporter: Thomas SZADEL Priority: Minor We got an NPE with Solr 4.3 when getting a database connection (and JBoss fails to get connection) Solr runs on an JBoss 7.1 et gets their connections from a JNDI call (connection is provided by JBoss). Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:38) at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404) ... 5 more Caused by: java.lang.NullPointerException at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:241) ... 12 more In the code, the possible null value is not checked : 239 try { 240Connection c = getConnection(); 241stmt = c.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY); ... maybe a check may be more safe : if(c == null){ throw new XXXException(); } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time
[ https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733281#comment-13733281 ] ASF subversion and git services commented on SOLR-5113: --- Commit 1511633 from [~noble.paul] in branch 'dev/trunk' [ https://svn.apache.org/r1511633 ] SOLR-5113 CollectionsAPIDistributedZkTest fails all the time -- Key: SOLR-5113 URL: https://issues.apache.org/jira/browse/SOLR-5113 Project: Solr Issue Type: Bug Components: Tests Affects Versions: 4.5, 5.0 Reporter: Uwe Schindler Assignee: Noble Paul Priority: Blocker Attachments: SOLR-5113.patch, SOLR-5113.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time
[ https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733292#comment-13733292 ] ASF subversion and git services commented on SOLR-5113: --- Commit 1511635 from [~noble.paul] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1511635 ] SOLR-5113 CollectionsAPIDistributedZkTest fails all the time -- Key: SOLR-5113 URL: https://issues.apache.org/jira/browse/SOLR-5113 Project: Solr Issue Type: Bug Components: Tests Affects Versions: 4.5, 5.0 Reporter: Uwe Schindler Assignee: Noble Paul Priority: Blocker Attachments: SOLR-5113.patch, SOLR-5113.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time
[ https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Noble Paul resolved SOLR-5113. -- Resolution: Fixed Fix Version/s: 5.0 4.5 CollectionsAPIDistributedZkTest fails all the time -- Key: SOLR-5113 URL: https://issues.apache.org/jira/browse/SOLR-5113 Project: Solr Issue Type: Bug Components: Tests Affects Versions: 4.5, 5.0 Reporter: Uwe Schindler Assignee: Noble Paul Priority: Blocker Fix For: 4.5, 5.0 Attachments: SOLR-5113.patch, SOLR-5113.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time
[ https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733297#comment-13733297 ] Uwe Schindler commented on SOLR-5113: - Hi Noble, thanks for committing! I think it is now up to jenkins to verify that it works! CollectionsAPIDistributedZkTest fails all the time -- Key: SOLR-5113 URL: https://issues.apache.org/jira/browse/SOLR-5113 Project: Solr Issue Type: Bug Components: Tests Affects Versions: 4.5, 5.0 Reporter: Uwe Schindler Assignee: Noble Paul Priority: Blocker Fix For: 4.5, 5.0 Attachments: SOLR-5113.patch, SOLR-5113.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances
Christoph Straßer created SOLR-5124: --- Summary: Solr glues word´s when parsing PDFs under certan circumstances Key: SOLR-5124 URL: https://issues.apache.org/jira/browse/SOLR-5124 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.4 Environment: Windows 7 (don´t think, this is relevant) Reporter: Christoph Straßer Priority: Minor For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this wird documents. This results in worse suggestions by the Suggester. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances
[ https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christoph Straßer updated SOLR-5124: Attachment: 04_Solr.png 03_TikaOutput_GUI_StructuredText.png 03_TikaOutput_GUI_PlainText.png 03_TikaOutput_GUI_MainContent.png 03_TikaOutput.png 02_PDF.png 01_alz_2009_folge11_2009_05_28.pdf Added sample-PDF, screenshots of TIKA-Output, screenshot of SOLR-Index. Solr glues word´s when parsing PDFs under certan circumstances -- Key: SOLR-5124 URL: https://issues.apache.org/jira/browse/SOLR-5124 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.4 Environment: Windows 7 (don´t think, this is relevant) Reporter: Christoph Straßer Priority: Minor Labels: tika,text-extraction Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this wird documents. This results in worse suggestions by the Suggester. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances
[ https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christoph Straßer updated SOLR-5124: Description: For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester. was: For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this wird documents. This results in worse suggestions by the Suggester. Solr glues word´s when parsing PDFs under certan circumstances -- Key: SOLR-5124 URL: https://issues.apache.org/jira/browse/SOLR-5124 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.4 Environment: Windows 7 (don´t think, this is relevant) Reporter: Christoph Straßer Priority: Minor Labels: tika,text-extraction Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Problem using Benchmark
Anyone pls help!! On Wed, Aug 7, 2013 at 12:36 PM, Abhishek Gupta abhi.bansa...@gmail.comwrote: Hi, I am using PyLucene and there I tried to use lucene's Benchmark to evaluate Trec Data. I was having a doubt which I first asked on pylucene-dev mailing list. After solving the first problem I got another problem which is referred a java error by Andi. You can see the thread here ( http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/201308.mbox/%3CCAJBtL5GG-LghfKBCKFhi%2BPXVmEFMdnM1zC%3D9NtDd-kL-Pv1nuQ%40mail.gmail.com%3E ) I am getting the class not found exception for Compressor( http://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/compressors/package-summary.html). I an a newbie to java development, so I don't know about Ant much. PLease help in solving this issue. Thanking You Abhishek Gupta, 9624799165 -- Abhishek Gupta, 897876422, 9416106204, 9624799165
[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances
[ https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733312#comment-13733312 ] Uwe Schindler commented on SOLR-5124: - I have not looked into DIH's code, but I know that TIKA adds the extra whitespace as ignoreable whitespace XML data. It might be ignored by the extraction content handler when it consumes the SAX events. Solr glues word´s when parsing PDFs under certan circumstances -- Key: SOLR-5124 URL: https://issues.apache.org/jira/browse/SOLR-5124 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.4 Environment: Windows 7 (don´t think, this is relevant) Reporter: Christoph Straßer Priority: Minor Labels: tika,text-extraction Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances
[ https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733321#comment-13733321 ] Christoph Straßer commented on SOLR-5124: - Maybe it´s in some way related to SOLR-4679. (But not sure; We use the ExtractingRequestHandler) Solr glues word´s when parsing PDFs under certan circumstances -- Key: SOLR-5124 URL: https://issues.apache.org/jira/browse/SOLR-5124 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.4 Environment: Windows 7 (don´t think, this is relevant) Reporter: Christoph Straßer Priority: Minor Labels: tika,text-extraction Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances
[ https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733325#comment-13733325 ] Uwe Schindler commented on SOLR-5124: - Hi, this is a duplicate of 2 other issues. SOLR-4679 is the main issue. I will close this as duplicate. Solr glues word´s when parsing PDFs under certan circumstances -- Key: SOLR-5124 URL: https://issues.apache.org/jira/browse/SOLR-5124 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.4 Environment: Windows 7 (don´t think, this is relevant) Reporter: Christoph Straßer Priority: Minor Labels: tika,text-extraction Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances
[ https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler closed SOLR-5124. --- Resolution: Duplicate Solr glues word´s when parsing PDFs under certan circumstances -- Key: SOLR-5124 URL: https://issues.apache.org/jira/browse/SOLR-5124 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.4 Environment: Windows 7 (don´t think, this is relevant) Reporter: Christoph Straßer Priority: Minor Labels: tika,text-extraction Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Problem using Benchmark
You can see the complete error I am getting herehttp://codebin.org/view/8460aa0a . On Thu, Aug 8, 2013 at 3:10 PM, Abhishek Gupta abhi.bansa...@gmail.comwrote: Anyone pls help!! On Wed, Aug 7, 2013 at 12:36 PM, Abhishek Gupta abhi.bansa...@gmail.comwrote: Hi, I am using PyLucene and there I tried to use lucene's Benchmark to evaluate Trec Data. I was having a doubt which I first asked on pylucene-dev mailing list. After solving the first problem I got another problem which is referred a java error by Andi. You can see the thread here ( http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/201308.mbox/%3CCAJBtL5GG-LghfKBCKFhi%2BPXVmEFMdnM1zC%3D9NtDd-kL-Pv1nuQ%40mail.gmail.com%3E ) I am getting the class not found exception for Compressor( http://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/compressors/package-summary.html). I an a newbie to java development, so I don't know about Ant much. PLease help in solving this issue. Thanking You Abhishek Gupta, 9624799165 -- Abhishek Gupta, 897876422, 9416106204, 9624799165 -- Abhishek Gupta, 897876422, 9416106204, 9624799165
[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results
[ https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733328#comment-13733328 ] Uwe Schindler commented on SOLR-4679: - There is another occurence of this bug with PDF files (SOLR-5124). I think we should apply the workaround and make the ignoreable whitespace significant. In my opinion this is not a problem at all, because the Analyzer will remove this stuff in any case, so some additional whitespace would disappear. bq. i did some experimenting and confirmed that the SolrContentHandler is getting ignorable whitespace SAX events for br tags in HTML – which makes no sense to me, so i've opened TIKA-1134 to try and get to the bottom of it. I know this bug and I was discussing about that since the early beginning in TIKA and I don't think it will change! TIKA uses ignorable whitespace for all text-only glue stuff, which was decided at the beginning of the project. I can find the mail from their lists; I was involved in that, too (because I applied some fixes for that to corectly produce ignorable whitespace in some parsers, which were missing to do this). FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not exist (it is handled differently, but is never reported by HTML parsers), so the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace SAX event to report this added whitespace. The rule that was choosen in TIKA is: - If you ignore all elements of HTML and only extract plain text, use the ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that produce plain text (TextOnlyContentHandler). They treat all ignoreable whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so if it exists, you know that it is coming from TIKA. - If you want to keep the XHTML structure and you understand block tags and br/, then you can ignore the ignorable whitespace. Regarding this guideline, your patch is correct and should be applied to solr. HTML line breaks (br) are removed during indexing; causes wrong search results Key: SOLR-4679 URL: https://issues.apache.org/jira/browse/SOLR-4679 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.2 Environment: Windows Server 2008 R2, Java 6, Tomcat 7 Reporter: Christoph Straßer Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png HTML line breaks (br, BR, br/, ...) seem to be removed during extraction of content from HTML-Files. They need to be replaced with a empty space. Test-File: html head titleTest mit HTML-Zeilenschaltungen/title /head p word1brword2br/ Some other words, a special name like linzbrand another special name - vienna /p /html The Solr-content-attribute contains the following text: Test mit HTML-Zeilenschaltungen word1word2 Some other words, a special name like linzand another special name - vienna So we are not able to find the word linz. We use the ExtractingRequestHandler to put content into Solr. (wiki.apache.org/solr/ExtractingRequestHandler) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results
[ https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733328#comment-13733328 ] Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:12 AM: -- There is another occurence of this bug with PDF files (SOLR-5124). I think we should apply the workaround and make the ignoreable whitespace significant. In my opinion this is not a problem at all, because the Analyzer will remove this stuff in any case, so some additional whitespace would disappear. bq. i did some experimenting and confirmed that the SolrContentHandler is getting ignorable whitespace SAX events for br tags in HTML – which makes no sense to me, so i've opened TIKA-1134 to try and get to the bottom of it. I know this bug and I was discussing about that since the early beginning in TIKA and I don't think it will change! TIKA uses ignorable whitespace for all text-only glue stuff, which was decided at the beginning of the project. I can find the mail from their lists; I was involved in that, too (because I applied some fixes for that to corectly produce ignorable whitespace in some parsers, which were missing to do this. I also added the XHTMLContentHandler stuff that makes block XHTML elements like p/, div/ also emit a newline as ignoreable on the closing element). FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not exist (it is handled differently, but is never reported by HTML parsers), so the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace SAX event to report this added whitespace. The rule that was choosen in TIKA is: - If you ignore all elements of HTML and only extract plain text, use the ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that produce plain text (TextOnlyContentHandler). They treat all ignoreable whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so if it exists, you know that it is coming from TIKA. - If you want to keep the XHTML structure and you understand block tags and br/, then you can ignore the ignorable whitespace. Regarding this guideline, your patch is correct and should be applied to solr. was (Author: thetaphi): There is another occurence of this bug with PDF files (SOLR-5124). I think we should apply the workaround and make the ignoreable whitespace significant. In my opinion this is not a problem at all, because the Analyzer will remove this stuff in any case, so some additional whitespace would disappear. bq. i did some experimenting and confirmed that the SolrContentHandler is getting ignorable whitespace SAX events for br tags in HTML – which makes no sense to me, so i've opened TIKA-1134 to try and get to the bottom of it. I know this bug and I was discussing about that since the early beginning in TIKA and I don't think it will change! TIKA uses ignorable whitespace for all text-only glue stuff, which was decided at the beginning of the project. I can find the mail from their lists; I was involved in that, too (because I applied some fixes for that to corectly produce ignorable whitespace in some parsers, which were missing to do this). FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not exist (it is handled differently, but is never reported by HTML parsers), so the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace SAX event to report this added whitespace. The rule that was choosen in TIKA is: - If you ignore all elements of HTML and only extract plain text, use the ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that produce plain text (TextOnlyContentHandler). They treat all ignoreable whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so if it exists, you know that it is coming from TIKA. - If you want to keep the XHTML structure and you understand block tags and br/, then you can ignore the ignorable whitespace. Regarding this guideline, your patch is correct and should be applied to solr. HTML line breaks (br) are removed during indexing; causes wrong search results Key: SOLR-4679 URL: https://issues.apache.org/jira/browse/SOLR-4679 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.2 Environment: Windows Server 2008 R2, Java 6, Tomcat 7 Reporter: Christoph Straßer Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png HTML line breaks (br, BR, br/, ...) seem to be removed during extraction of content from HTML-Files. They need to be replaced with a empty space. Test-File: html head
[jira] [Assigned] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results
[ https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reassigned SOLR-4679: --- Assignee: Uwe Schindler HTML line breaks (br) are removed during indexing; causes wrong search results Key: SOLR-4679 URL: https://issues.apache.org/jira/browse/SOLR-4679 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.2 Environment: Windows Server 2008 R2, Java 6, Tomcat 7 Reporter: Christoph Straßer Assignee: Uwe Schindler Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png HTML line breaks (br, BR, br/, ...) seem to be removed during extraction of content from HTML-Files. They need to be replaced with a empty space. Test-File: html head titleTest mit HTML-Zeilenschaltungen/title /head p word1brword2br/ Some other words, a special name like linzbrand another special name - vienna /p /html The Solr-content-attribute contains the following text: Test mit HTML-Zeilenschaltungen word1word2 Some other words, a special name like linzand another special name - vienna So we are not able to find the word linz. We use the ExtractingRequestHandler to put content into Solr. (wiki.apache.org/solr/ExtractingRequestHandler) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results
[ https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733328#comment-13733328 ] Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:17 AM: -- There is another occurence of this bug with PDF files (SOLR-5124). I think we should apply the workaround and make the ignoreable whitespace significant. In my opinion this is not a problem at all, because the Analyzer will remove this stuff in any case, so some additional whitespace would disappear. bq. i did some experimenting and confirmed that the SolrContentHandler is getting ignorable whitespace SAX events for br tags in HTML – which makes no sense to me, so i've opened TIKA-1134 to try and get to the bottom of it. I know this bug and I was discussing about that since the early beginning in TIKA and I don't think it will change! TIKA uses ignorable whitespace for all text-only glue stuff, which was decided at the beginning of the project. I can find the mail from their lists; I was involved in that, too (because I applied some fixes for that to corectly produce ignorable whitespace in some parsers, which were missing to do this. I also added the XHTMLContentHandler stuff that makes block XHTML elements like p/, div/ also emit a newline as ignoreable on the closing element, see TIKA-171). FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not exist (it is handled differently, but is never reported by HTML parsers), so the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace SAX event to report this added whitespace. The rule that was choosen in TIKA is: - If you ignore all elements of HTML and only extract plain text, use the ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that produce plain text (TextOnlyContentHandler). They treat all ignoreable whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so if it exists, you know that it is coming from TIKA. - If you want to keep the XHTML structure and you understand block tags and br/, then you can ignore the ignorable whitespace. Regarding this guideline, your patch is correct and should be applied to solr. was (Author: thetaphi): There is another occurence of this bug with PDF files (SOLR-5124). I think we should apply the workaround and make the ignoreable whitespace significant. In my opinion this is not a problem at all, because the Analyzer will remove this stuff in any case, so some additional whitespace would disappear. bq. i did some experimenting and confirmed that the SolrContentHandler is getting ignorable whitespace SAX events for br tags in HTML – which makes no sense to me, so i've opened TIKA-1134 to try and get to the bottom of it. I know this bug and I was discussing about that since the early beginning in TIKA and I don't think it will change! TIKA uses ignorable whitespace for all text-only glue stuff, which was decided at the beginning of the project. I can find the mail from their lists; I was involved in that, too (because I applied some fixes for that to corectly produce ignorable whitespace in some parsers, which were missing to do this. I also added the XHTMLContentHandler stuff that makes block XHTML elements like p/, div/ also emit a newline as ignoreable on the closing element). FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not exist (it is handled differently, but is never reported by HTML parsers), so the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace SAX event to report this added whitespace. The rule that was choosen in TIKA is: - If you ignore all elements of HTML and only extract plain text, use the ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that produce plain text (TextOnlyContentHandler). They treat all ignoreable whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so if it exists, you know that it is coming from TIKA. - If you want to keep the XHTML structure and you understand block tags and br/, then you can ignore the ignorable whitespace. Regarding this guideline, your patch is correct and should be applied to solr. HTML line breaks (br) are removed during indexing; causes wrong search results Key: SOLR-4679 URL: https://issues.apache.org/jira/browse/SOLR-4679 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.2 Environment: Windows Server 2008 R2, Java 6, Tomcat 7 Reporter: Christoph Straßer Assignee: Uwe Schindler Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, Solr_HtmlLineBreak_Linz_NotFound.png,
[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results
[ https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1377#comment-1377 ] Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:25 AM: -- The stuff with ignorableWhitespace was discussed between [~jukkaz] and me in TIKA-171. I think this was the issue when we decided to emit ignorableWhitespace for all synthetic whitespace added to support text-only extraction. [~hossman]: I can take the issue if you like. I am +1 to committing your current patch, because it makes use of the stuff we decided in TIKA-171. In my opinion, TIKA-1134 is obsolete but you/I can add a comments there to explain one more time and document under which circumstances TIKA emits ignorableWhitepsace. was (Author: thetaphi): The stuff with ignorableWhitespace was discussed between [~jukkaz] and me in TIKA-171. I think this was the issue when we decided to emit ignorableWhitespace for all synthetic whitespace added to support-text only extraction. [~hossman]: I can take the issue if you like. I am +1 to committing your current patch, because it makes use of the stuff we decided in TIKA-171. In my opinion, TIKA-1134 is obsolete but you/I can add a comments there to explain one more time and document under which circumstances TIKA emits ignorableWhitepsace. HTML line breaks (br) are removed during indexing; causes wrong search results Key: SOLR-4679 URL: https://issues.apache.org/jira/browse/SOLR-4679 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.2 Environment: Windows Server 2008 R2, Java 6, Tomcat 7 Reporter: Christoph Straßer Assignee: Uwe Schindler Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png HTML line breaks (br, BR, br/, ...) seem to be removed during extraction of content from HTML-Files. They need to be replaced with a empty space. Test-File: html head titleTest mit HTML-Zeilenschaltungen/title /head p word1brword2br/ Some other words, a special name like linzbrand another special name - vienna /p /html The Solr-content-attribute contains the following text: Test mit HTML-Zeilenschaltungen word1word2 Some other words, a special name like linzand another special name - vienna So we are not able to find the word linz. We use the ExtractingRequestHandler to put content into Solr. (wiki.apache.org/solr/ExtractingRequestHandler) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results
[ https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1377#comment-1377 ] Uwe Schindler commented on SOLR-4679: - The stuff with ignorableWhitespace was discussed between [~jukkaz] and me in TIKA-171. I think this was the issue when we decided to emit ignorableWhitespace for all synthetic whitespace added to support-text only extraction. [~hossman]: I can take the issue if you like. I am +1 to committing your current patch, because it makes use of the stuff we decided in TIKA-171. In my opinion, TIKA-1134 is obsolete but you/I can add a comments there to explain one more time and document under which circumstances TIKA emits ignorableWhitepsace. HTML line breaks (br) are removed during indexing; causes wrong search results Key: SOLR-4679 URL: https://issues.apache.org/jira/browse/SOLR-4679 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.2 Environment: Windows Server 2008 R2, Java 6, Tomcat 7 Reporter: Christoph Straßer Assignee: Uwe Schindler Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png HTML line breaks (br, BR, br/, ...) seem to be removed during extraction of content from HTML-Files. They need to be replaced with a empty space. Test-File: html head titleTest mit HTML-Zeilenschaltungen/title /head p word1brword2br/ Some other words, a special name like linzbrand another special name - vienna /p /html The Solr-content-attribute contains the following text: Test mit HTML-Zeilenschaltungen word1word2 Some other words, a special name like linzand another special name - vienna So we are not able to find the word linz. We use the ExtractingRequestHandler to put content into Solr. (wiki.apache.org/solr/ExtractingRequestHandler) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
jar-checkums generates extra files?
When I do this on trunk: ant jar-checksums svn stat I get: ? solr\licenses\jcl-over-slf4j.jar.sha1 ? solr\licenses\jul-to-slf4j.jar.sha1 ? solr\licenses\log4j.jar.sha1 ? solr\licenses\slf4j-api.jar.sha1 ? solr\licenses\slf4j-log4j12.jar.sha1 Where should this be fixed? Should we svn-ignore those files or should they be somehow excluded from the re-generation of SHA checksums? Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: jar-checkums generates extra files?
Never mind, these were local files and they were svn-ignored, when I removed everything and checked out from scratch this problem is no longer there. I really wish svn had an equivalent of git clean -xfd . Dawid On Thu, Aug 8, 2013 at 12:39 PM, Dawid Weiss dawid.we...@gmail.com wrote: When I do this on trunk: ant jar-checksums svn stat I get: ? solr\licenses\jcl-over-slf4j.jar.sha1 ? solr\licenses\jul-to-slf4j.jar.sha1 ? solr\licenses\log4j.jar.sha1 ? solr\licenses\slf4j-api.jar.sha1 ? solr\licenses\slf4j-log4j12.jar.sha1 Where should this be fixed? Should we svn-ignore those files or should they be somehow excluded from the re-generation of SHA checksums? Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: jar-checkums generates extra files?
Hi, Some GUIs like TortoiseSVN have this. I use this to delete all unversioned files in milliseconds(TM). But native svn does not have it, unfortunately. Uwe Dawid Weiss dawid.we...@gmail.com schrieb: Never mind, these were local files and they were svn-ignored, when I removed everything and checked out from scratch this problem is no longer there. I really wish svn had an equivalent of git clean -xfd . Dawid On Thu, Aug 8, 2013 at 12:39 PM, Dawid Weiss dawid.we...@gmail.com wrote: When I do this on trunk: ant jar-checksums svn stat I get: ? solr\licenses\jcl-over-slf4j.jar.sha1 ? solr\licenses\jul-to-slf4j.jar.sha1 ? solr\licenses\log4j.jar.sha1 ? solr\licenses\slf4j-api.jar.sha1 ? solr\licenses\slf4j-log4j12.jar.sha1 Where should this be fixed? Should we svn-ignore those files or should they be somehow excluded from the re-generation of SHA checksums? Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de
Re: jar-checkums generates extra files?
I kind of use a workaround of removing everything except the .svn folder and then svn revert -R . But this is a dumb solution :) D. On Thu, Aug 8, 2013 at 1:12 PM, Uwe Schindler u...@thetaphi.de wrote: Hi, Some GUIs like TortoiseSVN have this. I use this to delete all unversioned files in milliseconds(TM). But native svn does not have it, unfortunately. Uwe Dawid Weiss dawid.we...@gmail.com schrieb: Never mind, these were local files and they were svn-ignored, when I removed everything and checked out from scratch this problem is no longer there. I really wish svn had an equivalent of git clean -xfd . Dawid On Thu, Aug 8, 2013 at 12:39 PM, Dawid Weiss dawid.we...@gmail.com wrote: When I do this on trunk: ant jar-checksums svn stat I get: ? solr\licenses\jcl-over-slf4j.jar.sha1 ? solr\licenses\jul-to-slf4j.jar.sha1 ? solr\licenses\log4j.jar.sha1 ? solr\licenses\slf4j-api.jar.sha1 ? solr\licenses\slf4j-log4j12.jar.sha1 Where should this be fixed? Should we svn-ignore those files or should they be somehow excluded from the re-generation of SHA checksums? Daw id To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-Tests-trunk-Java7 - Build # 4219 - Failure
Build: https://builds.apache.org/job/Lucene-Solr-Tests-trunk-Java7/4219/ All tests passed Build Log: [...truncated 34909 lines...] BUILD FAILED /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-Tests-trunk-Java7/build.xml:389: The following error occurred while executing this line: /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-Tests-trunk-Java7/build.xml:328: The following error occurred while executing this line: /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-Tests-trunk-Java7/extra-targets.xml:66: The following error occurred while executing this line: /usr/home/hudson/hudson-slave/workspace/Lucene-Solr-Tests-trunk-Java7/extra-targets.xml:139: The following files are missing svn:eol-style (or binary svn:mime-type): * ./solr/core/src/test/org/apache/solr/cloud/CustomCollectionTest.java Total time: 80 minutes 23 seconds Build step 'Invoke Ant' marked build as failure Archiving artifacts Recording test results Email was triggered for: Failure Sending email for trigger: Failure - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-trunk-Linux (64bit/jdk1.7.0_25) - Build # 6924 - Failure!
Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/6924/ Java: 64bit/jdk1.7.0_25 -XX:-UseCompressedOops -XX:+UseParallelGC All tests passed Build Log: [...truncated 34822 lines...] BUILD FAILED /mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/build.xml:389: The following error occurred while executing this line: /mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/build.xml:328: The following error occurred while executing this line: /mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/extra-targets.xml:66: The following error occurred while executing this line: /mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/extra-targets.xml:139: The following files are missing svn:eol-style (or binary svn:mime-type): * ./solr/core/src/test/org/apache/solr/cloud/CustomCollectionTest.java Total time: 52 minutes 24 seconds Build step 'Invoke Ant' marked build as failure Description set: Java: 64bit/jdk1.7.0_25 -XX:-UseCompressedOops -XX:+UseParallelGC Archiving artifacts Recording test results Email was triggered for: Failure Sending email for trigger: Failure - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-trunk-Windows (32bit/jdk1.7.0_25) - Build # 3125 - Failure!
Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/3125/ Java: 32bit/jdk1.7.0_25 -server -XX:+UseSerialGC All tests passed Build Log: [...truncated 31514 lines...] BUILD FAILED C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\build.xml:389: The following error occurred while executing this line: C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\build.xml:328: The following error occurred while executing this line: C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\extra-targets.xml:66: The following error occurred while executing this line: C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\extra-targets.xml:139: The following files are missing svn:eol-style (or binary svn:mime-type): * ./solr/core/src/test/org/apache/solr/cloud/CustomCollectionTest.java Total time: 103 minutes 30 seconds Build step 'Invoke Ant' marked build as failure Description set: Java: 32bit/jdk1.7.0_25 -server -XX:+UseSerialGC Archiving artifacts Recording test results Email was triggered for: Failure Sending email for trigger: Failure - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time
[ https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733406#comment-13733406 ] ASF subversion and git services commented on SOLR-5113: --- Commit 1511715 from [~noble.paul] in branch 'dev/trunk' [ https://svn.apache.org/r1511715 ] SOLR-5113 setting svn:eol-style native CollectionsAPIDistributedZkTest fails all the time -- Key: SOLR-5113 URL: https://issues.apache.org/jira/browse/SOLR-5113 Project: Solr Issue Type: Bug Components: Tests Affects Versions: 4.5, 5.0 Reporter: Uwe Schindler Assignee: Noble Paul Priority: Blocker Fix For: 4.5, 5.0 Attachments: SOLR-5113.patch, SOLR-5113.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time
[ https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733408#comment-13733408 ] ASF subversion and git services commented on SOLR-5113: --- Commit 1511717 from [~noble.paul] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1511717 ] SOLR-5113 setting svn:eol-style native CollectionsAPIDistributedZkTest fails all the time -- Key: SOLR-5113 URL: https://issues.apache.org/jira/browse/SOLR-5113 Project: Solr Issue Type: Bug Components: Tests Affects Versions: 4.5, 5.0 Reporter: Uwe Schindler Assignee: Noble Paul Priority: Blocker Fix For: 4.5, 5.0 Attachments: SOLR-5113.patch, SOLR-5113.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[JENKINS] Lucene-Solr-4.x-Linux (32bit/jdk1.8.0-ea-b99) - Build # 6841 - Still Failing!
Build: http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-Linux/6841/ Java: 32bit/jdk1.8.0-ea-b99 -server -XX:+UseG1GC All tests passed Build Log: [...truncated 31168 lines...] BUILD FAILED /mnt/ssd/jenkins/workspace/Lucene-Solr-4.x-Linux/build.xml:395: The following error occurred while executing this line: /mnt/ssd/jenkins/workspace/Lucene-Solr-4.x-Linux/build.xml:334: The following error occurred while executing this line: /mnt/ssd/jenkins/workspace/Lucene-Solr-4.x-Linux/extra-targets.xml:66: The following error occurred while executing this line: /mnt/ssd/jenkins/workspace/Lucene-Solr-4.x-Linux/extra-targets.xml:139: The following files are missing svn:eol-style (or binary svn:mime-type): * ./solr/core/src/test/org/apache/solr/cloud/CustomCollectionTest.java Total time: 45 minutes 31 seconds Build step 'Invoke Ant' marked build as failure Description set: Java: 32bit/jdk1.8.0-ea-b99 -server -XX:+UseG1GC Archiving artifacts Recording test results Email was triggered for: Failure Sending email for trigger: Failure - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Problem using Benchmark
Abishek, On Aug 8, 2013, at 12:08, Abhishek Gupta abhi.bansa...@gmail.com wrote: You can see the complete error I am getting here. Like I told you on pylucene-dev, you need to setup your classpath correctly so that these classes are found. If you are a Java newbie (as you said) and don't know what that means or how to achieve it you need to research the issue yourself first. This mailing list is not the right forum for this question. Try a general Java programming forum first. Andi.. On Thu, Aug 8, 2013 at 3:10 PM, Abhishek Gupta abhi.bansa...@gmail.com wrote: Anyone pls help!! On Wed, Aug 7, 2013 at 12:36 PM, Abhishek Gupta abhi.bansa...@gmail.com wrote: Hi, I am using PyLucene and there I tried to use lucene's Benchmark to evaluate Trec Data. I was having a doubt which I first asked on pylucene-dev mailing list. After solving the first problem I got another problem which is referred a java error by Andi. You can see the thread here (http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/201308.mbox/%3CCAJBtL5GG-LghfKBCKFhi%2BPXVmEFMdnM1zC%3D9NtDd-kL-Pv1nuQ%40mail.gmail.com%3E) I am getting the class not found exception for Compressor(http://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/compressors/package-summary.html). I an a newbie to java development, so I don't know about Ant much. PLease help in solving this issue. Thanking You Abhishek Gupta, 9624799165 -- Abhishek Gupta, 897876422, 9416106204, 9624799165 -- Abhishek Gupta, 897876422, 9416106204, 9624799165
[jira] [Commented] (LUCENE-5152) Lucene FST is not immutable
[ https://issues.apache.org/jira/browse/LUCENE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733454#comment-13733454 ] Michael McCandless commented on LUCENE-5152: bq. can you elaborate what you are concerned about? I'm worried about the O(N^2) cost of the assert: for every arc (single byte of each term in a seekExact) we are iterating over all root arcs (up to 256 arcs) in this assert. bq. findTargetArc is the only place where we actually use this cache? Ahh that's true, I hadn't realized that. Maybe, instead, we can move the assert just inside the if that actually uses the cached arcs? Ie, put it here: {code} if (follow.target == startNode labelToMatch cachedRootArcs.length) { assert assertRootArcs(); ... } {code} This would address my concern: the cost becomes O(N) not O(N^2). And the coverage is the same? Lucene FST is not immutable --- Key: LUCENE-5152 URL: https://issues.apache.org/jira/browse/LUCENE-5152 Project: Lucene - Core Issue Type: Bug Components: core/FSTs Affects Versions: 4.4 Reporter: Simon Willnauer Priority: Blocker Fix For: 5.0, 4.5 Attachments: LUCENE-5152.patch, LUCENE-5152.patch, LUCENE-5152.patch a spinnoff from LUCENE-5120 where the analyzing suggester modified a returned output from and FST (BytesRef) which caused sideffects in later execution. I added an assertion into the FST that checks if a cached root arc is modified and in-fact this happens for instance in our MemoryPostingsFormat and I bet we find more places. We need to think about how to make this less trappy since it can cause bugs that are super hard to find. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [PROPOSAL] Make Luke a Lucene/Solr Module
Hello, Thanks so much for accepting the project proposal. I've started the coding work. I'll keep you all posted of my work. On Fri, Jul 26, 2013 at 1:48 PM, Ajay Bhat a.ajay.b...@gmail.com wrote: Hi, I have a question regarding one of the interfaces in the orig version. The IOReporter.java [1] is used by the Hadoop Plugin [2] and it only has 2 functions which are implemented by Hadoop Plugin. Is this interface really needed? Can't I just use the functions as is in Hadoop class without needing to get the IOReporter? [1] http://luke.googlecode.com/svn/trunk/src/org/getopt/luke/plugins/IOReporter.java [2] http://luke.googlecode.com/svn/trunk/src/org/getopt/luke/plugins/HadoopPlugin.java On Sat, Jul 20, 2013 at 11:12 PM, SUJIT PAL sujit@comcast.net wrote: Hi Ajay, Thanks for the reply and the links to the email threads. I saw a response on this thread from Shawn Helsey about this as well. I didn't realize your focus was Luke, then Lucene, then Solr - the proposal title and the JIRA both mention Lucene/Solr module, which probably misled me - I guess I should have read the doc more carefully... Thank you for the clarification and good luck with your project. -sujit On Jul 20, 2013, at 9:09 AM, Ajay Bhat wrote: Hi Sujit, Thanks for your comments. There was actually some discussion earlier about whether or not Solr was the highest priority. http://mail-archives.apache.org/mod_mbox/lucene-dev/201307.mbox/%3C0F7176D08A99494EBF1E129298E12904%40JackKrupansky%3E http://mail-archives.apache.org/mod_mbox/lucene-dev/201307.mbox/%3CCAOdYfZVQ1WzWhYVeKgwpA%3DmQVONxo4XiLza28geV2L1PCpcQJg%40mail.gmail.com%3E Right now I don't think I could do the integration with Solr since (a) I don't know enough Javascript to work with Solr and (b) The time for submitting proposals for the program is over. The project duration is scheduled till end of October. After that or if I get time during project period I'll try and work with other functionalities of Luke and then try for Solr. I think its best to make Luke completely functioning before integrating with the trunk, and this is better in incremental steps. On Fri, Jul 19, 2013 at 9:59 PM, SUJIT PAL sujit@comcast.net wrote: Hi Ajay, Since you asked for feedback from the community... a lot of what Luke used to do is now already available in Solr's admin tool. From Luke's feature set that you had in your proposal Google doc. the only ones I think are /not/ preset are the following: * Browse by document number * selectively delete documents from the index - there is no delete document page AFAIK, but you can still do this from the URL. * reconstruct the original document fields, edit them and re-insert to the index - you can do this using code as long as the fields are stored, but there is no reconstruct page. * optimize indexes - can be done from the URL but probably no page/button for this. As a Solr user, for me your tool would be most useful if it concentrated on these areas, and if it could be integrated into the existing admin tool (the Solr 4 one of course). I am not sure what the Solr4 admin tool uses, if its Pivot then I guess thats what you should use (and by extension, if not, you probably should use what the current tool uses so its easy to maintain going forward). Benefit to users such as myself would be unified look-and-feel so not much of a learning curve/barrier to adoption. Just my $0.02... -sujit On Jul 19, 2013, at 8:06 AM, Ajay Bhat wrote: Hi Mark, I've added the proposal to the ASF-ICFOSS proposals page [1]. According to the ICFOSS programme [2] the last date for submission of project proposal is July 19th (today) The time period for mentors to review and rank students project proposals is July 22nd to August 2nd, i.e next week onwards. I'd like some feedback on my proposal from the community as well. Link to proposal on Google Docs : https://docs.google.com/document/d/18Vu5YB6C7WLDxnG01BnZXFEKUC3EQYb0Y5_tCJFb_sc Link to proposal on CWiki page : https://cwiki.apache.org/confluence/display/COMDEV/Proposal+for+Apache+Lucene+-+Ajay+Bhat [1] https://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme [2] http://community.apache.org/mentoringprogramme-icfoss-pilot.html On Thu, Jul 18, 2013 at 12:04 AM, Ajay Bhat a.ajay.b...@gmail.com wrote: Thanks Mark. I've given you comment access as well so you can comment on specific parts of the proposal On Wed, Jul 17, 2013 at 11:51 PM, Mark Miller markrmil...@gmail.com wrote: You can put my down for the mentor. - Mark On Jul 17, 2013, at 2:04 PM, Ajay Bhat a.ajay.b...@gmail.com wrote: Hi all, I want to do the Jira issue LUCENE 2562 : Make Luke a Lucene/Solr module [1] as a project. This project will be for the ASF-ICFOSS programme [2] by Luciano Resende [3] and the proposal has to be
[jira] [Commented] (LUCENE-5152) Lucene FST is not immutable
[ https://issues.apache.org/jira/browse/LUCENE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733474#comment-13733474 ] Simon Willnauer commented on LUCENE-5152: - bq. This would address my concern: the cost becomes O(N) not O(N^2). And the coverage is the same? The problem here is that we really need to check after we returned from the cache and that might be the case only once in a certain test. Yet, I think it's OK to do it there. I still don't get what you are concerned of we only have -ea in tests and the tests don't seem to be any slower? Can you elaborate what you are afraid of? Lucene FST is not immutable --- Key: LUCENE-5152 URL: https://issues.apache.org/jira/browse/LUCENE-5152 Project: Lucene - Core Issue Type: Bug Components: core/FSTs Affects Versions: 4.4 Reporter: Simon Willnauer Priority: Blocker Fix For: 5.0, 4.5 Attachments: LUCENE-5152.patch, LUCENE-5152.patch, LUCENE-5152.patch a spinnoff from LUCENE-5120 where the analyzing suggester modified a returned output from and FST (BytesRef) which caused sideffects in later execution. I added an assertion into the FST that checks if a cached root arc is modified and in-fact this happens for instance in our MemoryPostingsFormat and I bet we find more places. We need to think about how to make this less trappy since it can cause bugs that are super hard to find. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances
[ https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733547#comment-13733547 ] Jack Krupansky commented on SOLR-5124: -- Try doing the update with the extractOnly=true parameter and look at the actual byte codes where the two adjacent terms meet - it may be some odd Unicode value that Solr filters ignore rather than treat as white space. Solr glues word´s when parsing PDFs under certan circumstances -- Key: SOLR-5124 URL: https://issues.apache.org/jira/browse/SOLR-5124 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.4 Environment: Windows 7 (don´t think, this is relevant) Reporter: Christoph Straßer Priority: Minor Labels: tika,text-extraction Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word) (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr. (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document. In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5120) Solrj Query response error with result number
[ https://issues.apache.org/jira/browse/SOLR-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733589#comment-13733589 ] Shawn Heisey commented on SOLR-5120: [~lukasw44] I have a question to ask you: What resources did you look at in order to decide that you should file a bug to get an answer to your question? The reason that I ask is because we have been seeing an increase recently in the number of people who file a bug for support issues instead of asking for help via our discussion resources like the mailing list. This suggests that there might be some incorrect support information out there that needs correction. Related to your issue: If setting the start parameter to 0 or omitting the parameter didn't fix your issue, then this issue can be reopened, but I'm confident that this is the problem. Solrj Query response error with result number -- Key: SOLR-5120 URL: https://issues.apache.org/jira/browse/SOLR-5120 Project: Solr Issue Type: Bug Environment: linux, lubuntu, java version 1.7.0_13. Reporter: Łukasz Woźniczka Priority: Critical This is my simple code : QueryResponse qr; try { qr = fs.execute(solrServer); System.out.println(QUERY RESPONSE : + qr); for (EntryString, Object r : qr.getResponse()) { System.out.println(RESPONSE: + r.getKey() + - + r.getValue()); } SolrDocumentList dl = qr.getResults(); System.out.println(--RESULT SIZE:[ + dl.size() ); } catch (SolrServerException e) { e.printStackTrace(); } I am using solrj and solr-core version 4.4.0. And there is a bug probably in solrj in query result. I am creating one simple txt doc with content 'anna' and then i am restar solr and try to search this phrase. Nothing is found but this is my query response system out {numFound=1,start=1,docs=[]}. So as you can see ther is info that numFound=1 but docs=[] -- is empty. Next i add another document with only one word 'anna' and then try search that string and this is sysout; {numFound=2,start=1,docs=[SolrDocument{file_id=9882, file_name=luk-search2.txt, file_create_user=-1, file_department=10, file_mime_type=text/plain, file_extension=.txt, file_parents_folder=[5021, 4781, 341, -20, -1], _version_=1442647024934584320}]} So as you can see ther is numFound = 2 but only one document is listed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5152) Lucene FST is not immutable
[ https://issues.apache.org/jira/browse/LUCENE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733598#comment-13733598 ] Michael McCandless commented on LUCENE-5152: bq. Can you elaborate what you are afraid of? In general I think it's bad if an assert changes too much how the code would run without asserts. E.g., maybe this O(N^2) assert alters how threads are scheduled and changes how / whether an issue appears in practice. Similarly, if a user is having trouble, I'll recommend turning on asserts to see if one trips, but if this causes a change in how the code runs then this can change whether the issue reproduces. I also just don't like O(N^2) code, even when it's under an assert :) I think asserts should minimize their impact to the real code when possible, and it certainly seems possible in this case. Separately, we really should run our tests w/o asserts, too, since this is how our users typically run (I know some tests fail if assertions are off ... we'd have to fix them). What if we accidentally commit real code behind an assert? Our tests wouldn't catch it ... Lucene FST is not immutable --- Key: LUCENE-5152 URL: https://issues.apache.org/jira/browse/LUCENE-5152 Project: Lucene - Core Issue Type: Bug Components: core/FSTs Affects Versions: 4.4 Reporter: Simon Willnauer Priority: Blocker Fix For: 5.0, 4.5 Attachments: LUCENE-5152.patch, LUCENE-5152.patch, LUCENE-5152.patch a spinnoff from LUCENE-5120 where the analyzing suggester modified a returned output from and FST (BytesRef) which caused sideffects in later execution. I added an assertion into the FST that checks if a cached root arc is modified and in-fact this happens for instance in our MemoryPostingsFormat and I bet we find more places. We need to think about how to make this less trappy since it can cause bugs that are super hard to find. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3076) Solr(Cloud) should support block joins
[ https://issues.apache.org/jira/browse/SOLR-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733601#comment-13733601 ] Yonik Seeley commented on SOLR-3076: Making progeess... currently working on randomized testing (using our current join implementation to cross-check this implementation). I've hit some snags and am working through them... bq. one of inconveniences is the necessity to provide user cache for BJQParser Yeah, I had some things in mind to handle that as well. Solr(Cloud) should support block joins -- Key: SOLR-3076 URL: https://issues.apache.org/jira/browse/SOLR-3076 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Yonik Seeley Fix For: 4.5, 5.0 Attachments: 27M-singlesegment-histogram.png, 27M-singlesegment.png, bjq-vs-filters-backward-disi.patch, bjq-vs-filters-illegal-state.patch, child-bjqparser.patch, dih-3076.patch, dih-config.xml, parent-bjq-qparser.patch, parent-bjq-qparser.patch, Screen Shot 2012-07-17 at 1.12.11 AM.png, SOLR-3076-childDocs.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-7036-childDocs-solr-fork-trunk-patched, solrconf-bjq-erschema-snippet.xml, solrconfig.xml.patch, tochild-bjq-filtered-search-fix.patch Lucene has the ability to do block joins, we should add it to Solr. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5120) Solrj Query response error with result number
[ https://issues.apache.org/jira/browse/SOLR-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733606#comment-13733606 ] Łukasz Woźniczka commented on SOLR-5120: Shawn Haisey its my fault sorry. I am setting start parameter = 1 Solrj Query response error with result number -- Key: SOLR-5120 URL: https://issues.apache.org/jira/browse/SOLR-5120 Project: Solr Issue Type: Bug Environment: linux, lubuntu, java version 1.7.0_13. Reporter: Łukasz Woźniczka Priority: Critical This is my simple code : QueryResponse qr; try { qr = fs.execute(solrServer); System.out.println(QUERY RESPONSE : + qr); for (EntryString, Object r : qr.getResponse()) { System.out.println(RESPONSE: + r.getKey() + - + r.getValue()); } SolrDocumentList dl = qr.getResults(); System.out.println(--RESULT SIZE:[ + dl.size() ); } catch (SolrServerException e) { e.printStackTrace(); } I am using solrj and solr-core version 4.4.0. And there is a bug probably in solrj in query result. I am creating one simple txt doc with content 'anna' and then i am restar solr and try to search this phrase. Nothing is found but this is my query response system out {numFound=1,start=1,docs=[]}. So as you can see ther is info that numFound=1 but docs=[] -- is empty. Next i add another document with only one word 'anna' and then try search that string and this is sysout; {numFound=2,start=1,docs=[SolrDocument{file_id=9882, file_name=luk-search2.txt, file_create_user=-1, file_department=10, file_mime_type=text/plain, file_extension=.txt, file_parents_folder=[5021, 4781, 341, -20, -1], _version_=1442647024934584320}]} So as you can see ther is numFound = 2 but only one document is listed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results
[ https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733656#comment-13733656 ] Hoss Man commented on SOLR-4679: Uwe: I defer to your judgement on this. if you think the patch is hte right way to go, then +1 from me. HTML line breaks (br) are removed during indexing; causes wrong search results Key: SOLR-4679 URL: https://issues.apache.org/jira/browse/SOLR-4679 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.2 Environment: Windows Server 2008 R2, Java 6, Tomcat 7 Reporter: Christoph Straßer Assignee: Uwe Schindler Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png HTML line breaks (br, BR, br/, ...) seem to be removed during extraction of content from HTML-Files. They need to be replaced with a empty space. Test-File: html head titleTest mit HTML-Zeilenschaltungen/title /head p word1brword2br/ Some other words, a special name like linzbrand another special name - vienna /p /html The Solr-content-attribute contains the following text: Test mit HTML-Zeilenschaltungen word1word2 Some other words, a special name like linzand another special name - vienna So we are not able to find the word linz. We use the ExtractingRequestHandler to put content into Solr. (wiki.apache.org/solr/ExtractingRequestHandler) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-2548) Multithreaded faceting
[ https://issues.apache.org/jira/browse/SOLR-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733690#comment-13733690 ] Gun Akkor commented on SOLR-2548: - I would like to revive this ticket, if possible. We have an index with about 10 fields that we regularly facet on. These fields are either multi-valued or are of type TextField, so facet code chooses FC as the facet method, and uses the UnInvertedField instances to count each facet field, which takes several seconds per field in our case. So, multi-thread execution of getTermCounts() reduces the overall facet time considerably. I started with the patch that was posted against 3.1 and modified it a little bit to take into account previous comments made by Yonik and Adrien. The new patch applies against 4.2.1, uses the already existing facetExecutor thread pool, and is configured per request via a facet.threads request param. If the param is not supplied, the code defaults to directExecutor and runs sequential as before. So, code should behave as is if user chooses not to submit number of threads to use. Also in the process of testing, I noticed that UnInvertedField.getUnInvertedField() call was synchronized too early, before the call to new UnInvertedField(field, searcher) if the field is not in the field value cache. Because its init can take several seconds, synchronizing on the cache in that duration was effectively serializing the execution of the multiple threads. So, I modified it (albeit inelegantly) to synchronize later (in our case cache hit ratio is low, so this makes a difference). The patch is still incomplete, as it does not extend this framework to possibly other calls like ranges and dates, but it is a start. Multithreaded faceting -- Key: SOLR-2548 URL: https://issues.apache.org/jira/browse/SOLR-2548 Project: Solr Issue Type: Improvement Components: search Affects Versions: 3.1 Reporter: Janne Majaranta Priority: Minor Labels: facet Attachments: SOLR-2548_for_31x.patch, SOLR-2548.patch Add multithreading support for faceting. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-2548) Multithreaded faceting
[ https://issues.apache.org/jira/browse/SOLR-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gun Akkor updated SOLR-2548: Attachment: SOLR-2548_4.2.1.patch Patch against 4.2.1 Multithreaded faceting -- Key: SOLR-2548 URL: https://issues.apache.org/jira/browse/SOLR-2548 Project: Solr Issue Type: Improvement Components: search Affects Versions: 3.1 Reporter: Janne Majaranta Priority: Minor Labels: facet Attachments: SOLR-2548_4.2.1.patch, SOLR-2548_for_31x.patch, SOLR-2548.patch Add multithreading support for faceting. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5157) Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure.
[ https://issues.apache.org/jira/browse/LUCENE-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733701#comment-13733701 ] Adrien Grand commented on LUCENE-5157: -- I discussed about this issue with Robert to see how we can move forward: - moving OrdinalMap to MultiTermsEnum can be controversial as Robert explained so let's only tackle the naming and getSegmentOrd API issues here, - another option to make getSegmentOrd less trappy is to add an assertion that the provided segment number is the same as the one returned by {{getSegmentNumber}}, this would allow for returning the segment ordinals on any segment in the future without changing the API, - renaming subIndex to segment is ok as it makes the naming more consistent. Robert, please correct me if you think it doesn't reflect correctly what we said. Boaz, what do you think? Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure. Key: LUCENE-5157 URL: https://issues.apache.org/jira/browse/LUCENE-5157 Project: Lucene - Core Issue Type: Improvement Reporter: Boaz Leskes Priority: Minor Attachments: LUCENE-5157.patch I refactored MultiDocValues.OrdinalMap, removing one unused parameter and renaming some methods to more clearly communicate what they do. Also I renamed subIndex references to segmentIndex. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4414) MoreLikeThis on a shard finds no interesting terms if the document queried is not in that shard
[ https://issues.apache.org/jira/browse/SOLR-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733703#comment-13733703 ] Shawn Heisey commented on SOLR-4414: [~shalinmangar] I came across this issue while looking into my problems with distributed MoreLikeThis. Things look a little off, so I'm writing this. At a quick glance, the commit comment doesn't seem to be related to this issue, because it doesn't mention MLT at all. Also, you have never commented on this issue outside the commit comment. This is the issue number in CHANGES.txt, though. Is the commit for this issue or another one? If the commit is for this issue, I think this probably needs to be closed, fixed in 4.2 and 5.0. If not, CHANGES.txt probably needs some cleanup. MoreLikeThis on a shard finds no interesting terms if the document queried is not in that shard --- Key: SOLR-4414 URL: https://issues.apache.org/jira/browse/SOLR-4414 Project: Solr Issue Type: Bug Components: MoreLikeThis, SolrCloud Affects Versions: 4.1 Reporter: Colin Bartolome Running a MoreLikeThis query in a cloud works only when the document being queried exists in whatever shard serves the request. If the document is not present in the shard, no interesting terms are found and, consequently, no matches are found. h5. Steps to reproduce * Edit example/solr/collection1/conf/solrconfig.xml and add this line, with the rest of the request handlers: {code:xml} requestHandler name=/mlt class=solr.MoreLikeThisHandler / {code} * Follow the [simplest SolrCloud example|http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster] to get two shards running. * Hit this URL: [http://localhost:8983/solr/collection1/mlt?mlt.fl=includesq=id:3007WFPmlt.match.include=falsemlt.interestingTerms=listmlt.mindf=1mlt.mintf=1] * Compare that output to that of this URL: [http://localhost:7574/solr/collection1/mlt?mlt.fl=includesq=id:3007WFPmlt.match.include=falsemlt.interestingTerms=listmlt.mindf=1mlt.mintf=1] The former URL will return a result and list some interesting terms. The latter URL will return no results and list no interesting terms. It will also show this odd XML element: {code:xml} null name=response/ {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5157) Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure.
[ https://issues.apache.org/jira/browse/LUCENE-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-5157: - Assignee: Adrien Grand Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure. Key: LUCENE-5157 URL: https://issues.apache.org/jira/browse/LUCENE-5157 Project: Lucene - Core Issue Type: Improvement Reporter: Boaz Leskes Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5157.patch I refactored MultiDocValues.OrdinalMap, removing one unused parameter and renaming some methods to more clearly communicate what they do. Also I renamed subIndex references to segmentIndex. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5150) WAH8DocIdSet: dense sets compression
[ https://issues.apache.org/jira/browse/LUCENE-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733704#comment-13733704 ] Adrien Grand commented on LUCENE-5150: -- I'll commit soon if there is no objection. These dense sets can be common in cases where e.g. users are allowed to see everything but something. WAH8DocIdSet: dense sets compression Key: LUCENE-5150 URL: https://issues.apache.org/jira/browse/LUCENE-5150 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Attachments: LUCENE-5150.patch In LUCENE-5101, Paul Elschot mentioned that it would be interesting to be able to encode the inverse set to also compress very dense sets. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5159) compressed diskdv sorted/sortedset termdictionaries
Robert Muir created LUCENE-5159: --- Summary: compressed diskdv sorted/sortedset termdictionaries Key: LUCENE-5159 URL: https://issues.apache.org/jira/browse/LUCENE-5159 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: Robert Muir Sorted/SortedSet give you ordinal(s) per document, but them separately have a term dictionary of all the values. You can do a few operations on these: * ord - term lookup (e.g. retrieving facet labels) * term - ord lookup (reverse lookup: e.g. fieldcacherangefilter) * get a term enumerator (e.g. merging, ordinalmap construction) The current implementation for diskdv was the simplest thing that can possibly work: under the hood it just makes a binary DV for these (treating ordinals as document ids). When the terms are fixed length, you can address a term directly with multiplication. When they are variable length though, we have to store a packed ints structure in RAM. This variable length case is overkill and chews up a lot of RAM if you have many unique values. It also chews up a lot of disk since all the values are just concatenated (no sharing). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4414) MoreLikeThis on a shard finds no interesting terms if the document queried is not in that shard
[ https://issues.apache.org/jira/browse/SOLR-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733713#comment-13733713 ] Mark Miller commented on SOLR-4414: --- I think it was simply mis-tagged. MoreLikeThis on a shard finds no interesting terms if the document queried is not in that shard --- Key: SOLR-4414 URL: https://issues.apache.org/jira/browse/SOLR-4414 Project: Solr Issue Type: Bug Components: MoreLikeThis, SolrCloud Affects Versions: 4.1 Reporter: Colin Bartolome Running a MoreLikeThis query in a cloud works only when the document being queried exists in whatever shard serves the request. If the document is not present in the shard, no interesting terms are found and, consequently, no matches are found. h5. Steps to reproduce * Edit example/solr/collection1/conf/solrconfig.xml and add this line, with the rest of the request handlers: {code:xml} requestHandler name=/mlt class=solr.MoreLikeThisHandler / {code} * Follow the [simplest SolrCloud example|http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster] to get two shards running. * Hit this URL: [http://localhost:8983/solr/collection1/mlt?mlt.fl=includesq=id:3007WFPmlt.match.include=falsemlt.interestingTerms=listmlt.mindf=1mlt.mintf=1] * Compare that output to that of this URL: [http://localhost:7574/solr/collection1/mlt?mlt.fl=includesq=id:3007WFPmlt.match.include=falsemlt.interestingTerms=listmlt.mindf=1mlt.mintf=1] The former URL will return a result and list some interesting terms. The latter URL will return no results and list no interesting terms. It will also show this odd XML element: {code:xml} null name=response/ {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5159) compressed diskdv sorted/sortedset termdictionaries
[ https://issues.apache.org/jira/browse/LUCENE-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-5159: Attachment: LUCENE-5159.patch Here's an in-progress patch... all the core/codec tests pass, but I'm sure there are a few bugs to knock out (improving the tests is the way to go here). I'm also unhappy with the complexity. The idea is for the variable case, we just prefix-share (i set interval=16), like lucene 3.x dictionary. The current patch specializes the termsenum and reverselookup for this case (but again, im sure there are bugs, its hairy) compressed diskdv sorted/sortedset termdictionaries --- Key: LUCENE-5159 URL: https://issues.apache.org/jira/browse/LUCENE-5159 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: Robert Muir Attachments: LUCENE-5159.patch Sorted/SortedSet give you ordinal(s) per document, but them separately have a term dictionary of all the values. You can do a few operations on these: * ord - term lookup (e.g. retrieving facet labels) * term - ord lookup (reverse lookup: e.g. fieldcacherangefilter) * get a term enumerator (e.g. merging, ordinalmap construction) The current implementation for diskdv was the simplest thing that can possibly work: under the hood it just makes a binary DV for these (treating ordinals as document ids). When the terms are fixed length, you can address a term directly with multiplication. When they are variable length though, we have to store a packed ints structure in RAM. This variable length case is overkill and chews up a lot of RAM if you have many unique values. It also chews up a lot of disk since all the values are just concatenated (no sharing). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4414) MoreLikeThis on a shard finds no interesting terms if the document queried is not in that shard
[ https://issues.apache.org/jira/browse/SOLR-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733717#comment-13733717 ] Shalin Shekhar Mangar commented on SOLR-4414: - [~elyograg] - That was a mistake. The commit mentioned here actually belonged to SOLR-4415. I fixed the issue number in the change log but I forgot to put a comment here. MoreLikeThis on a shard finds no interesting terms if the document queried is not in that shard --- Key: SOLR-4414 URL: https://issues.apache.org/jira/browse/SOLR-4414 Project: Solr Issue Type: Bug Components: MoreLikeThis, SolrCloud Affects Versions: 4.1 Reporter: Colin Bartolome Running a MoreLikeThis query in a cloud works only when the document being queried exists in whatever shard serves the request. If the document is not present in the shard, no interesting terms are found and, consequently, no matches are found. h5. Steps to reproduce * Edit example/solr/collection1/conf/solrconfig.xml and add this line, with the rest of the request handlers: {code:xml} requestHandler name=/mlt class=solr.MoreLikeThisHandler / {code} * Follow the [simplest SolrCloud example|http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster] to get two shards running. * Hit this URL: [http://localhost:8983/solr/collection1/mlt?mlt.fl=includesq=id:3007WFPmlt.match.include=falsemlt.interestingTerms=listmlt.mindf=1mlt.mintf=1] * Compare that output to that of this URL: [http://localhost:7574/solr/collection1/mlt?mlt.fl=includesq=id:3007WFPmlt.match.include=falsemlt.interestingTerms=listmlt.mindf=1mlt.mintf=1] The former URL will return a result and list some interesting terms. The latter URL will return no results and list no interesting terms. It will also show this odd XML element: {code:xml} null name=response/ {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5157) Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure.
[ https://issues.apache.org/jira/browse/LUCENE-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733745#comment-13733745 ] Robert Muir commented on LUCENE-5157: - +1, lets improve it for now and not expand it to try to be a general termsenum merger. but on the other hand, i am still not convinced we can't improve the efficiency of this thing, so its good if we can prevent innards from being too exposed (unless its causing some use case an actual problem) Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure. Key: LUCENE-5157 URL: https://issues.apache.org/jira/browse/LUCENE-5157 Project: Lucene - Core Issue Type: Improvement Reporter: Boaz Leskes Assignee: Adrien Grand Priority: Minor Attachments: LUCENE-5157.patch I refactored MultiDocValues.OrdinalMap, removing one unused parameter and renaming some methods to more clearly communicate what they do. Also I renamed subIndex references to segmentIndex. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5159) compressed diskdv sorted/sortedset termdictionaries
[ https://issues.apache.org/jira/browse/LUCENE-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-5159: Attachment: LUCENE-5159.patch fixes a OB1 bug. ill beef up the DV base test case to really exercise this termsenum... compressed diskdv sorted/sortedset termdictionaries --- Key: LUCENE-5159 URL: https://issues.apache.org/jira/browse/LUCENE-5159 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: Robert Muir Attachments: LUCENE-5159.patch, LUCENE-5159.patch Sorted/SortedSet give you ordinal(s) per document, but them separately have a term dictionary of all the values. You can do a few operations on these: * ord - term lookup (e.g. retrieving facet labels) * term - ord lookup (reverse lookup: e.g. fieldcacherangefilter) * get a term enumerator (e.g. merging, ordinalmap construction) The current implementation for diskdv was the simplest thing that can possibly work: under the hood it just makes a binary DV for these (treating ordinals as document ids). When the terms are fixed length, you can address a term directly with multiplication. When they are variable length though, we have to store a packed ints structure in RAM. This variable length case is overkill and chews up a lot of RAM if you have many unique values. It also chews up a lot of disk since all the values are just concatenated (no sharing). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results
[ https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733776#comment-13733776 ] Uwe Schindler commented on SOLR-4679: - Hoss: I just took this issue because it was unassigned and I was the one mandating to add ignorable whitespace at that time in TIKA. So Jukka and I decided this would be the best. Because you are still not convinced with my argumentation, let me recapitulate TIKA's problems: - TIKA decided to use XHTML as its output format to report the parsed documents to the consumer. This is nice, because it allows to preserve some of the formatting (like bold fonts, paragraphs,...) originating from the original document. Of course most of this formatting is lost, but you can still detect things like emphasized text. By choosing XHTML as output format, of course TIKA must use XHTML formatting for new lines and similar. So whenever a line break is needed, the TIKA pasrer emits a br/ tag or places the paragraph (in a PDF) inside a p/ element. As we all know, HTML ignores formatting like newlines, tabs,... (all are treated as one single whitespace, so means like this regreplace: {{s/\s+/ /}} - On the other hand, TIKA wants to make it simple for people to extract the *plain text* contents. With the XHTML-only approach this would be hard for the consumer. Because to add the correct newlines, the consumer has to fully understand XHTML and detect block elements and replace them by \n To support both usages of TIKA the idea was to embed this information which is unimportant to HTML (as HTML ignores whitespaces completely) as ignorableWhitespace as convenience for the user. A fully compliant XHTML consumer would not parse the ignoreable stuff. As it understands HTML it would detect a p element as a block element and format the output. Solr unfortunately has some strange approach: It is mainly interested in the text only contents, so ideally when consuming the HTLL it could use {{WriteoutContentHandler(StringBuilder, BodyContentHandler(parserConmtentHandler)}}. In that case TIKA would do the right thing automatically: It would extract only text from the body element and would use the convenience whitespace to format the text in ASCII-ART-like way (using tabs, newlines,...) :-) Solr has a hybrid approach: It collects all into a content tag (which is similar to the above approcha), but the bug is that in contrast to TIKA's official WriteOutContentHandler it does not use the ignorable whitespace inserted for convenience. In addition TIKA also has a stack where it allows to process parts of the documents (like the title element or all em elements). In that case it has several StringBuilders in parallel that are populated with the contents. The problems are here too, but cannot be solved by using ignorable whitespace: e.g. one indexes only all em elements (which are inline HTML elements no block elements), there is no whitespace so all em elements would be glued together in the em field of your index... I just mention this, in my opinion the SolrContentHandler needs more work to correctly understand HTML and not just collect element names in a map! Now to your complaint: You proposed to report the newlines as real {{character()}} events - but this is not the right thing to do here. As I said, HTML does not know these characters, they are ignored. The formatting is done by the element names (like p, div, table). So the helper whitespace for text-only consumers should be inserted as ignorableWhitespace only, if we would add it to the real character data we would report things that every HTML parser (like nekohtml) would never report to the consumer. Nekohtml would also report this useless extra whitespace as ignorable. The convenience here is that TIKA's XHTMLContentHandler used by all parsers is configured to help the text-only user, but don't hurt the HTML-only user. This differentiation is done by reporting the HTML element names (p, div, table, th, td, tr, abbr, em, strong,...) but also report the ASCII-ART-text-only content like TABs indide tables, newlines after block elements,... This is always done as ignorableWhitespace (for convenience), a real HTML parser must ignore it - and its correct to do this. HTML line breaks (br) are removed during indexing; causes wrong search results Key: SOLR-4679 URL: https://issues.apache.org/jira/browse/SOLR-4679 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.2 Environment: Windows Server 2008 R2, Java 6, Tomcat 7 Reporter: Christoph Straßer Assignee: Uwe Schindler Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch,
[jira] [Commented] (LUCENE-5159) compressed diskdv sorted/sortedset termdictionaries
[ https://issues.apache.org/jira/browse/LUCENE-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733787#comment-13733787 ] Michael McCandless commented on LUCENE-5159: +1, patch looks great. compressed diskdv sorted/sortedset termdictionaries --- Key: LUCENE-5159 URL: https://issues.apache.org/jira/browse/LUCENE-5159 Project: Lucene - Core Issue Type: Improvement Components: core/index Reporter: Robert Muir Attachments: LUCENE-5159.patch, LUCENE-5159.patch Sorted/SortedSet give you ordinal(s) per document, but them separately have a term dictionary of all the values. You can do a few operations on these: * ord - term lookup (e.g. retrieving facet labels) * term - ord lookup (reverse lookup: e.g. fieldcacherangefilter) * get a term enumerator (e.g. merging, ordinalmap construction) The current implementation for diskdv was the simplest thing that can possibly work: under the hood it just makes a binary DV for these (treating ordinals as document ids). When the terms are fixed length, you can address a term directly with multiplication. When they are variable length though, we have to store a packed ints structure in RAM. This variable length case is overkill and chews up a lot of RAM if you have many unique values. It also chews up a lot of disk since all the values are just concatenated (no sharing). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results
[ https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733791#comment-13733791 ] Hoss Man commented on SOLR-4679: bq. Because you are still not convinced with my argumentation, let me recapitulate TIKA's problems: I never said that ... you said I can take the issue if you like. and you explained why the existing patch should be committed -- i'm totally willing to go along with that, so have at it. it seems sketchy to me, but if that's the way Tika works that's the way tika works, you certainly understand it better then me, so i defer to your assesment. (as mentioned in TIKA-1134 it would be nice if this type of behavior was better documented for people implementing their own ContentHandlers, but that's a Tika issue not a Solr issue.) HTML line breaks (br) are removed during indexing; causes wrong search results Key: SOLR-4679 URL: https://issues.apache.org/jira/browse/SOLR-4679 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.2 Environment: Windows Server 2008 R2, Java 6, Tomcat 7 Reporter: Christoph Straßer Assignee: Uwe Schindler Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png HTML line breaks (br, BR, br/, ...) seem to be removed during extraction of content from HTML-Files. They need to be replaced with a empty space. Test-File: html head titleTest mit HTML-Zeilenschaltungen/title /head p word1brword2br/ Some other words, a special name like linzbrand another special name - vienna /p /html The Solr-content-attribute contains the following text: Test mit HTML-Zeilenschaltungen word1word2 Some other words, a special name like linzand another special name - vienna So we are not able to find the word linz. We use the ExtractingRequestHandler to put content into Solr. (wiki.apache.org/solr/ExtractingRequestHandler) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-5125) Distributed MoreLikeThis fails with NullPointerException, shard query gives EarlyTerminatingCollectorException
Shawn Heisey created SOLR-5125: -- Summary: Distributed MoreLikeThis fails with NullPointerException, shard query gives EarlyTerminatingCollectorException Key: SOLR-5125 URL: https://issues.apache.org/jira/browse/SOLR-5125 Project: Solr Issue Type: Bug Components: MoreLikeThis Affects Versions: 4.4 Reporter: Shawn Heisey Fix For: 4.5, 5.0 A distributed MoreLikeThis query that works perfectly on 4.2.1 is failing on 4.4.0. The original query returns a NullPointerException. The Solr log shows that the shard queries are throwing EarlyTerminatingCollectorException. Full details to follow in the comments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5125) Distributed MoreLikeThis fails with NullPointerException, shard query gives EarlyTerminatingCollectorException
[ https://issues.apache.org/jira/browse/SOLR-5125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733816#comment-13733816 ] Shawn Heisey commented on SOLR-5125: The query that works fine in 4.2.1 has the following URL: /solr/ncmain/ncdismax?q=tag_id:ugphotos000996mlt=truemlt.fl=catchallmlt.count=100 The ncmain handler has the shards parameter in solrconfig.xml and is set up for edismax. The shards.qt parameter is /search, a handler using the default query parser. On 4.2.1, it had a QTime of 49641, a performance issue that I mentioned on the mailing list and will be pursuing there. Here's a server log excerpt, showing a shard request, the shard exception, the original query, and the final exception. {noformat} INFO - 2013-08-08 12:18:20.030; org.apache.solr.core.SolrCore; [s3live] webapp=/solr path=/search params={mlt.fl=catchallsort=score+desctie=0.1shards.qt=/searchmlt.dist.id=ugphotos000996mlt=trueq.alt=*:*distrib=falseshards.tolerant=trueversion=2NOW=1375985885078shard.url=bigindy5.REDACTED.com:8982/solr/s3livedf=catchallfl=score,tag_idqs=3qt=/searchlowercaseOperators=falsemm=100%25qf=catchallwt=javabinrows=100defType=edismaxpf=catchall^2mlt.count=100start=0q=%2B(catchall:arabian+catchall:close-up+catchall:horse+catchall:closeup+catchall:close+catchall:white+catchall:up+catchall:sassy+catchall:154+catchall:equestrian+catchall:domestic+catchall:animals+catchall:of)+-tag_id:ugphotos000996shards.info=trueboost=min(recip(abs(ms(NOW/HOUR,pd)),1.92901e-10,1.5,1.5),0.85)isShard=trueps=3} 6815483 status=500 QTime=14639 ERROR - 2013-08-08 12:18:20.030; org.apache.solr.common.SolrException; null:org.apache.solr.search.EarlyTerminatingCollectorException at org.apache.solr.search.EarlyTerminatingCollector.collect(EarlyTerminatingCollector.java:62) at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:289) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:624) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297) at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1494) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1363) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:474) at org.apache.solr.search.SolrIndexSearcher.getDocList(SolrIndexSearcher.java:1226) at org.apache.solr.handler.MoreLikeThisHandler$MoreLikeThisHelper.getMoreLikeThis(MoreLikeThisHandler.java:365) at org.apache.solr.handler.component.MoreLikeThisComponent.getMoreLikeThese(MoreLikeThisComponent.java:356) at org.apache.solr.handler.component.MoreLikeThisComponent.process(MoreLikeThisComponent.java:107) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
[jira] [Commented] (SOLR-5125) Distributed MoreLikeThis fails with NullPointerException, shard query gives EarlyTerminatingCollectorException
[ https://issues.apache.org/jira/browse/SOLR-5125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733833#comment-13733833 ] Shawn Heisey commented on SOLR-5125: Here's someone else having the same problem. They don't say whether it's a single index or distributed, though. http://stackoverflow.com/questions/17866313/earlyterminatingcollectorexception-in-mlt-component-of-solr-4-4 Distributed MoreLikeThis fails with NullPointerException, shard query gives EarlyTerminatingCollectorException -- Key: SOLR-5125 URL: https://issues.apache.org/jira/browse/SOLR-5125 Project: Solr Issue Type: Bug Components: MoreLikeThis Affects Versions: 4.4 Reporter: Shawn Heisey Fix For: 4.5, 5.0 A distributed MoreLikeThis query that works perfectly on 4.2.1 is failing on 4.4.0. The original query returns a NullPointerException. The Solr log shows that the shard queries are throwing EarlyTerminatingCollectorException. Full details to follow in the comments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4952) audit test configs to use solrconfig.snippet.randomindexconfig.xml in more tests
[ https://issues.apache.org/jira/browse/SOLR-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733844#comment-13733844 ] ASF subversion and git services commented on SOLR-4952: --- Commit 1511954 from hoss...@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1511954 ] SOLR-4952: TestIndexSearcher.testReopen needs fixed segment merging audit test configs to use solrconfig.snippet.randomindexconfig.xml in more tests Key: SOLR-4952 URL: https://issues.apache.org/jira/browse/SOLR-4952 Project: Solr Issue Type: Sub-task Reporter: Hoss Man Assignee: Hoss Man in SOLR-4942 i updated every solrconfig.xml to either... * include solrconfig.snippet.randomindexconfig.xml where it was easy to do so * use the useCompoundFile sys prop if it already had an {{indexConfig}} section, or if including the snippet wasn't going to be easy (ie: contrib tests) As an improvment on this: * audit all core configs not already using solrconfig.snippet.randomindexconfig.xml and either: ** make them use it, ignoring any previously unimportant explicit incdexConfig settings ** make them use it, using explicit sys props to overwrite random values in cases were explicit indexConfig values are important for test ** add a comment why it's not using the include snippet in cases where the explicit parsing is part of hte test * try figure out a way for contrib tests to easily include the same file and/or apply the same rules as above -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4952) audit test configs to use solrconfig.snippet.randomindexconfig.xml in more tests
[ https://issues.apache.org/jira/browse/SOLR-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733851#comment-13733851 ] ASF subversion and git services commented on SOLR-4952: --- Commit 1511958 from hoss...@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1511958 ] SOLR-4952: TestIndexSearcher.testReopen needs fixed segment merging (merge r1511954) audit test configs to use solrconfig.snippet.randomindexconfig.xml in more tests Key: SOLR-4952 URL: https://issues.apache.org/jira/browse/SOLR-4952 Project: Solr Issue Type: Sub-task Reporter: Hoss Man Assignee: Hoss Man in SOLR-4942 i updated every solrconfig.xml to either... * include solrconfig.snippet.randomindexconfig.xml where it was easy to do so * use the useCompoundFile sys prop if it already had an {{indexConfig}} section, or if including the snippet wasn't going to be easy (ie: contrib tests) As an improvment on this: * audit all core configs not already using solrconfig.snippet.randomindexconfig.xml and either: ** make them use it, ignoring any previously unimportant explicit incdexConfig settings ** make them use it, using explicit sys props to overwrite random values in cases were explicit indexConfig values are important for test ** add a comment why it's not using the include snippet in cases where the explicit parsing is part of hte test * try figure out a way for contrib tests to easily include the same file and/or apply the same rules as above -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions
Grant Ingersoll created LUCENE-5160: --- Summary: NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions Key: LUCENE-5160 URL: https://issues.apache.org/jira/browse/LUCENE-5160 Project: Lucene - Core Issue Type: Bug Affects Versions: 4.4, 5.0 Reporter: Grant Ingersoll Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly handle the -1 condition that can be returned from FileChannel.read(). If it returns -1, then it will move the file pointer back and you will enter an infinite loop. SimpleFSDirectory displays the same characteristics, although I have only seen the issue on NIOFSDirectory. The code in question from NIOFSDirectory: {code} try { while (readLength 0) { final int limit; if (readLength chunkSize) { // LUCENE-1566 - work around JVM Bug by breaking // very large reads into chunks limit = readOffset + chunkSize; } else { limit = readOffset + readLength; } bb.limit(limit); int i = channel.read(bb, pos); pos += i; readOffset += i; readLength -= i; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions
[ https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733865#comment-13733865 ] Uwe Schindler commented on LUCENE-5160: --- This is a bug, which never is hit by lucene because we never read sequentially until end of file. +1 to fix this. Theoretically to comply with MMapDirectory it should throw EOFException if it gets -1, because Lucene code should not read beyond file end. NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions Key: LUCENE-5160 URL: https://issues.apache.org/jira/browse/LUCENE-5160 Project: Lucene - Core Issue Type: Bug Affects Versions: 5.0, 4.4 Reporter: Grant Ingersoll Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly handle the -1 condition that can be returned from FileChannel.read(). If it returns -1, then it will move the file pointer back and you will enter an infinite loop. SimpleFSDirectory displays the same characteristics, although I have only seen the issue on NIOFSDirectory. The code in question from NIOFSDirectory: {code} try { while (readLength 0) { final int limit; if (readLength chunkSize) { // LUCENE-1566 - work around JVM Bug by breaking // very large reads into chunks limit = readOffset + chunkSize; } else { limit = readOffset + readLength; } bb.limit(limit); int i = channel.read(bb, pos); pos += i; readOffset += i; readLength -= i; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4774) Add FieldComparator that allows sorting parent docs based on field inside the child docs
[ https://issues.apache.org/jira/browse/LUCENE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733872#comment-13733872 ] Mikhail Khludnev commented on LUCENE-4774: -- fwiw something like http://www.gossamer-threads.com/lists/lucene/java-dev/199372?do=post_view_threaded happens to me NOTE: reproduce with: ant test -Dtestcase=TestBlockJoinSorting -Dtests.method=testNestedSorting -Dtests.seed=FB4F1BE85579255B -Dtests.slow=true -Dtests.locale=da_DK -Dtests.timezone=Asia/Qatar -Dtests.file.encoding=UTF-8 NOTE: test params are: codec=Asserting, sim=RandomSimilarityProvider(queryNorm=true,coord=crazy): {}, locale=da_DK, timezone=Asia/Qatar NOTE: Linux 2.6.32-131.0.15.el6.x86_64 amd64/Sun Microsystems Inc. 1.6.0_29 (64-bit)/cpus=4,threads=1,free=317130512,total=349241344 NOTE: All tests run in this JVM: [TestJoinUtil, TestBlockJoin, TestBlockJoinSorting] --- Test set: org.apache.lucene.search.join.TestBlockJoinSorting --- Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.06 sec FAILURE! testNestedSorting(org.apache.lucene.search.join.TestBlockJoinSorting) Time elapsed: 0.021 sec FAILURE! java.lang.AssertionError: expected:3 but was:28 at __randomizedtesting.SeedInfo.seed([FB4F1BE85579255B:F3A6F6A915D02835]:0) at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.junit.Assert.assertEquals(Assert.java:472) at org.junit.Assert.assertEquals(Assert.java:456) at org.apache.lucene.search.join.TestBlockJoinSorting.testNestedSorting(TestBlockJoinSorting.java:226) Add FieldComparator that allows sorting parent docs based on field inside the child docs Key: LUCENE-4774 URL: https://issues.apache.org/jira/browse/LUCENE-4774 Project: Lucene - Core Issue Type: New Feature Components: modules/join Reporter: Martijn van Groningen Assignee: Martijn van Groningen Fix For: 5.0, 4.3 Attachments: LUCENE-4774.patch, LUCENE-4774.patch, LUCENE-4774.patch A field comparator for sorting block join parent docs based on the a field in the associated child docs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results
[ https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733901#comment-13733901 ] Uwe Schindler commented on SOLR-4679: - bq. I never said that ... You somehow said: bq. I defer to your judgement on this So I assumed that you are still not 100% convinced. Sorry. In any case I will take the issue. In my opinion there is more work to be done with this crazy stack of StringBuilders to better handle the ignorableWhitepace when a new field begins/ends. Currently its insered after the block end tag, so it would go one up in the stack only. I have to think a little bit about it, but the fix in your patch is the easiest for now. And the maybe useless whitespace on some lower stacked StringBuilders is generally removed by text analysis. HTML line breaks (br) are removed during indexing; causes wrong search results Key: SOLR-4679 URL: https://issues.apache.org/jira/browse/SOLR-4679 Project: Solr Issue Type: Bug Components: update Affects Versions: 4.2 Environment: Windows Server 2008 R2, Java 6, Tomcat 7 Reporter: Christoph Straßer Assignee: Uwe Schindler Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png HTML line breaks (br, BR, br/, ...) seem to be removed during extraction of content from HTML-Files. They need to be replaced with a empty space. Test-File: html head titleTest mit HTML-Zeilenschaltungen/title /head p word1brword2br/ Some other words, a special name like linzbrand another special name - vienna /p /html The Solr-content-attribute contains the following text: Test mit HTML-Zeilenschaltungen word1word2 Some other words, a special name like linzand another special name - vienna So we are not able to find the word linz. We use the ExtractingRequestHandler to put content into Solr. (wiki.apache.org/solr/ExtractingRequestHandler) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions
[ https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned LUCENE-5160: --- Assignee: Grant Ingersoll NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions Key: LUCENE-5160 URL: https://issues.apache.org/jira/browse/LUCENE-5160 Project: Lucene - Core Issue Type: Bug Affects Versions: 5.0, 4.4 Reporter: Grant Ingersoll Assignee: Grant Ingersoll Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly handle the -1 condition that can be returned from FileChannel.read(). If it returns -1, then it will move the file pointer back and you will enter an infinite loop. SimpleFSDirectory displays the same characteristics, although I have only seen the issue on NIOFSDirectory. The code in question from NIOFSDirectory: {code} try { while (readLength 0) { final int limit; if (readLength chunkSize) { // LUCENE-1566 - work around JVM Bug by breaking // very large reads into chunks limit = readOffset + chunkSize; } else { limit = readOffset + readLength; } bb.limit(limit); int i = channel.read(bb, pos); pos += i; readOffset += i; readLength -= i; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions
[ https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-5160: Attachment: LUCENE-5160.patch Patch adds the -1 check and throws an EOF NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions Key: LUCENE-5160 URL: https://issues.apache.org/jira/browse/LUCENE-5160 Project: Lucene - Core Issue Type: Bug Affects Versions: 5.0, 4.4 Reporter: Grant Ingersoll Assignee: Grant Ingersoll Attachments: LUCENE-5160.patch Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly handle the -1 condition that can be returned from FileChannel.read(). If it returns -1, then it will move the file pointer back and you will enter an infinite loop. SimpleFSDirectory displays the same characteristics, although I have only seen the issue on NIOFSDirectory. The code in question from NIOFSDirectory: {code} try { while (readLength 0) { final int limit; if (readLength chunkSize) { // LUCENE-1566 - work around JVM Bug by breaking // very large reads into chunks limit = readOffset + chunkSize; } else { limit = readOffset + readLength; } bb.limit(limit); int i = channel.read(bb, pos); pos += i; readOffset += i; readLength -= i; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions
[ https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733912#comment-13733912 ] Uwe Schindler commented on LUCENE-5160: --- +1 to commit. Looks good. Writing a test is a bit hard. MMapDirectory is not affected as it already has a check for the length of the MappedByteBuffers. NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions Key: LUCENE-5160 URL: https://issues.apache.org/jira/browse/LUCENE-5160 Project: Lucene - Core Issue Type: Bug Affects Versions: 5.0, 4.4 Reporter: Grant Ingersoll Assignee: Grant Ingersoll Attachments: LUCENE-5160.patch Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly handle the -1 condition that can be returned from FileChannel.read(). If it returns -1, then it will move the file pointer back and you will enter an infinite loop. SimpleFSDirectory displays the same characteristics, although I have only seen the issue on NIOFSDirectory. The code in question from NIOFSDirectory: {code} try { while (readLength 0) { final int limit; if (readLength chunkSize) { // LUCENE-1566 - work around JVM Bug by breaking // very large reads into chunks limit = readOffset + chunkSize; } else { limit = readOffset + readLength; } bb.limit(limit); int i = channel.read(bb, pos); pos += i; readOffset += i; readLength -= i; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions
[ https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733914#comment-13733914 ] ASF subversion and git services commented on LUCENE-5160: - Commit 1512011 from [~gsingers] in branch 'dev/trunk' [ https://svn.apache.org/r1512011 ] LUCENE-5160: check for -1 return conditions in file reads NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions Key: LUCENE-5160 URL: https://issues.apache.org/jira/browse/LUCENE-5160 Project: Lucene - Core Issue Type: Bug Affects Versions: 5.0, 4.4 Reporter: Grant Ingersoll Assignee: Grant Ingersoll Attachments: LUCENE-5160.patch Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly handle the -1 condition that can be returned from FileChannel.read(). If it returns -1, then it will move the file pointer back and you will enter an infinite loop. SimpleFSDirectory displays the same characteristics, although I have only seen the issue on NIOFSDirectory. The code in question from NIOFSDirectory: {code} try { while (readLength 0) { final int limit; if (readLength chunkSize) { // LUCENE-1566 - work around JVM Bug by breaking // very large reads into chunks limit = readOffset + chunkSize; } else { limit = readOffset + readLength; } bb.limit(limit); int i = channel.read(bb, pos); pos += i; readOffset += i; readLength -= i; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5161) review FSDirectory chunking defaults and test the chunking
Robert Muir created LUCENE-5161: --- Summary: review FSDirectory chunking defaults and test the chunking Key: LUCENE-5161 URL: https://issues.apache.org/jira/browse/LUCENE-5161 Project: Lucene - Core Issue Type: Improvement Reporter: Robert Muir Today there is a loop in SimpleFS/NIOFS: {code} try { do { final int readLength; if (total + chunkSize len) { readLength = len - total; } else { // LUCENE-1566 - work around JVM Bug by breaking very large reads into chunks readLength = chunkSize; } final int i = file.read(b, offset + total, readLength); total += i; } while (total len); } catch (OutOfMemoryError e) { {code} I bet if you look at the clover report its untested, because its fixed at 100MB for 32-bit users and 2GB for 64-bit users (are these defaults even good?!). Also if you call the setter on a 64-bit machine to change the size, it just totally ignores it. We should remove that, the setter should always work. And we should set it to small values in tests so this loop is actually executed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions
[ https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733921#comment-13733921 ] ASF subversion and git services commented on LUCENE-5160: - Commit 1512016 from [~gsingers] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1512016 ] LUCENE-5160: merge from trunk NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions Key: LUCENE-5160 URL: https://issues.apache.org/jira/browse/LUCENE-5160 Project: Lucene - Core Issue Type: Bug Affects Versions: 5.0, 4.4 Reporter: Grant Ingersoll Assignee: Grant Ingersoll Attachments: LUCENE-5160.patch Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly handle the -1 condition that can be returned from FileChannel.read(). If it returns -1, then it will move the file pointer back and you will enter an infinite loop. SimpleFSDirectory displays the same characteristics, although I have only seen the issue on NIOFSDirectory. The code in question from NIOFSDirectory: {code} try { while (readLength 0) { final int limit; if (readLength chunkSize) { // LUCENE-1566 - work around JVM Bug by breaking // very large reads into chunks limit = readOffset + chunkSize; } else { limit = readOffset + readLength; } bb.limit(limit); int i = channel.read(bb, pos); pos += i; readOffset += i; readLength -= i; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions
[ https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved LUCENE-5160. - Resolution: Fixed Fix Version/s: 4.5 5.0 Lucene Fields: (was: New) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions Key: LUCENE-5160 URL: https://issues.apache.org/jira/browse/LUCENE-5160 Project: Lucene - Core Issue Type: Bug Affects Versions: 5.0, 4.4 Reporter: Grant Ingersoll Assignee: Grant Ingersoll Fix For: 5.0, 4.5 Attachments: LUCENE-5160.patch Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly handle the -1 condition that can be returned from FileChannel.read(). If it returns -1, then it will move the file pointer back and you will enter an infinite loop. SimpleFSDirectory displays the same characteristics, although I have only seen the issue on NIOFSDirectory. The code in question from NIOFSDirectory: {code} try { while (readLength 0) { final int limit; if (readLength chunkSize) { // LUCENE-1566 - work around JVM Bug by breaking // very large reads into chunks limit = readOffset + chunkSize; } else { limit = readOffset + readLength; } bb.limit(limit); int i = channel.read(bb, pos); pos += i; readOffset += i; readLength -= i; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5161) review FSDirectory chunking defaults and test the chunking
[ https://issues.apache.org/jira/browse/LUCENE-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-5161: Attachment: LUCENE-5161.patch This patch makes the setter always work, and changes lucenetestcase to use small values for the chunking. I didnt adjust any defaults (maybe Uwe can help, he knows about the code in question) review FSDirectory chunking defaults and test the chunking -- Key: LUCENE-5161 URL: https://issues.apache.org/jira/browse/LUCENE-5161 Project: Lucene - Core Issue Type: Improvement Reporter: Robert Muir Attachments: LUCENE-5161.patch Today there is a loop in SimpleFS/NIOFS: {code} try { do { final int readLength; if (total + chunkSize len) { readLength = len - total; } else { // LUCENE-1566 - work around JVM Bug by breaking very large reads into chunks readLength = chunkSize; } final int i = file.read(b, offset + total, readLength); total += i; } while (total len); } catch (OutOfMemoryError e) { {code} I bet if you look at the clover report its untested, because its fixed at 100MB for 32-bit users and 2GB for 64-bit users (are these defaults even good?!). Also if you call the setter on a 64-bit machine to change the size, it just totally ignores it. We should remove that, the setter should always work. And we should set it to small values in tests so this loop is actually executed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Interesting failure scenario, SolrCloud and ZK nodes on different times
I seem to recall seeing this on my cluster when we didn't have clocks in sync, but perhaps my memory is fuzzy as well. -Grant On Aug 7, 2013, at 7:41 AM, Erick Erickson erickerick...@gmail.com wrote: Well, we're reconstructing a chain of _possibilities_ post-mortem, so there's not much I can say for sure. Mostly just throwing this out there in case it sparks some aha moments. Not knowing ZK well, anything I say is speculation. But I speculate that this isn't really the root of the problem given that we haven't been seeing the ClusterState says we are the leader... error go by the user lists for a while. It may well be a coincidence. The place that this happened reported that the problem seemed to be better after adjusting the ZK nodes' times. I know when I reconstruct events like this I'm never sure about cause and effect since I'm usually doing several things at once. Erick On Tue, Aug 6, 2013 at 5:51 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : When the times were coordinated, many of the problems with recovery went : away. We're trying to reconstruct the scenario from memory, but it : prompted me to pass the incident in case it sparked any thoughts. : Specifically, I wonder if there's anything that comes to mind if the ZK : nodes are significantly out of synch with each other time-wise. : : Does this mean that ntp or other strict time synchronization is important for : SolrCloud? I strive for this anyway, just to ensure that when I'm researching : log files between two machines that I can match things up properly. I don't know if/how Solr/ZK is affected by having machines with clocks out of sync, but i do remember seeing discussions a while back about weird things happening ot ZK client apps *while* time adjustments are taking place to get back in sync. IIRC: as the local clock starts accelerating and jumping ahead in increments to correct itself with ntp, then those jumps can confuse the ZK code into thinking it's been waiting a lot longer then it really has for zk heartbeat (or whatever it's called) and it can trigger a timeout situation. -Hoss - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org Grant Ingersoll | @gsingers http://www.lucidworks.com
[jira] [Commented] (LUCENE-5161) review FSDirectory chunking defaults and test the chunking
[ https://issues.apache.org/jira/browse/LUCENE-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734047#comment-13734047 ] Uwe Schindler commented on LUCENE-5161: --- Thanks Robert for opening. It is too late today, so I will respond tomorrow morning about the NIO stuff. I am now aware and inspected the JVM code, so I can explain why the OOMs occur in SimpleFSDir and NIOFSDir if you read large buffers. More details tomorrow, just one thing before: It has nothing to do with 32 or 64 bits, it is more limitations of the JVM with direct memory and heap size leading to the OOM under certain conditions. But the Integer.MAX_VALUE for 64 bit JVMs is just wrong, too (could also lead to OOM). In general I would not make the buffers too large, so the junk size should be limited to not more than a few megabytes. Making them large brings no performance improvement at all, it just wastes emory in thread-local direct buffers allocated internally by the JVM's NIO code. review FSDirectory chunking defaults and test the chunking -- Key: LUCENE-5161 URL: https://issues.apache.org/jira/browse/LUCENE-5161 Project: Lucene - Core Issue Type: Improvement Reporter: Robert Muir Attachments: LUCENE-5161.patch Today there is a loop in SimpleFS/NIOFS: {code} try { do { final int readLength; if (total + chunkSize len) { readLength = len - total; } else { // LUCENE-1566 - work around JVM Bug by breaking very large reads into chunks readLength = chunkSize; } final int i = file.read(b, offset + total, readLength); total += i; } while (total len); } catch (OutOfMemoryError e) { {code} I bet if you look at the clover report its untested, because its fixed at 100MB for 32-bit users and 2GB for 64-bit users (are these defaults even good?!). Also if you call the setter on a 64-bit machine to change the size, it just totally ignores it. We should remove that, the setter should always work. And we should set it to small values in tests so this loop is actually executed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-5161) review FSDirectory chunking defaults and test the chunking
[ https://issues.apache.org/jira/browse/LUCENE-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734047#comment-13734047 ] Uwe Schindler edited comment on LUCENE-5161 at 8/8/13 9:44 PM: --- Thanks Robert for opening. It is too late today, so I will respond tomorrow morning about the NIO stuff. I am now aware and inspected the JVM code, so I can explain why the OOMs occur in SimpleFSDir and NIOFSDir if you read large buffers. More details tomorrow, just one thing before: It has nothing to do with 32 or 64 bits, it is more limitations of the JVM with direct memory and heap size leading to the OOM under certain conditions. But the Integer.MAX_VALUE for 64 bit JVMs is just wrong, too (could also lead to OOM). In general I would not make the buffers too large, so the junk size should be limited to not more than a few megabytes. Making them large brings no performance improvement at all, it just wastes memory in large *thread-local* direct buffers allocated internally by the JVM's NIO code. was (Author: thetaphi): Thanks Robert for opening. It is too late today, so I will respond tomorrow morning about the NIO stuff. I am now aware and inspected the JVM code, so I can explain why the OOMs occur in SimpleFSDir and NIOFSDir if you read large buffers. More details tomorrow, just one thing before: It has nothing to do with 32 or 64 bits, it is more limitations of the JVM with direct memory and heap size leading to the OOM under certain conditions. But the Integer.MAX_VALUE for 64 bit JVMs is just wrong, too (could also lead to OOM). In general I would not make the buffers too large, so the junk size should be limited to not more than a few megabytes. Making them large brings no performance improvement at all, it just wastes emory in thread-local direct buffers allocated internally by the JVM's NIO code. review FSDirectory chunking defaults and test the chunking -- Key: LUCENE-5161 URL: https://issues.apache.org/jira/browse/LUCENE-5161 Project: Lucene - Core Issue Type: Improvement Reporter: Robert Muir Attachments: LUCENE-5161.patch Today there is a loop in SimpleFS/NIOFS: {code} try { do { final int readLength; if (total + chunkSize len) { readLength = len - total; } else { // LUCENE-1566 - work around JVM Bug by breaking very large reads into chunks readLength = chunkSize; } final int i = file.read(b, offset + total, readLength); total += i; } while (total len); } catch (OutOfMemoryError e) { {code} I bet if you look at the clover report its untested, because its fixed at 100MB for 32-bit users and 2GB for 64-bit users (are these defaults even good?!). Also if you call the setter on a 64-bit machine to change the size, it just totally ignores it. We should remove that, the setter should always work. And we should set it to small values in tests so this loop is actually executed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5161) review FSDirectory chunking defaults and test the chunking
[ https://issues.apache.org/jira/browse/LUCENE-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734058#comment-13734058 ] Robert Muir commented on LUCENE-5161: - Thanks Uwe, I will leave the issue for you tomorrow to fix the defaults. I can only say the chunking does not seem buggy (all tests pass with the randomization in the patch), so at least we have that. review FSDirectory chunking defaults and test the chunking -- Key: LUCENE-5161 URL: https://issues.apache.org/jira/browse/LUCENE-5161 Project: Lucene - Core Issue Type: Improvement Reporter: Robert Muir Attachments: LUCENE-5161.patch Today there is a loop in SimpleFS/NIOFS: {code} try { do { final int readLength; if (total + chunkSize len) { readLength = len - total; } else { // LUCENE-1566 - work around JVM Bug by breaking very large reads into chunks readLength = chunkSize; } final int i = file.read(b, offset + total, readLength); total += i; } while (total len); } catch (OutOfMemoryError e) { {code} I bet if you look at the clover report its untested, because its fixed at 100MB for 32-bit users and 2GB for 64-bit users (are these defaults even good?!). Also if you call the setter on a 64-bit machine to change the size, it just totally ignores it. We should remove that, the setter should always work. And we should set it to small values in tests so this loop is actually executed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4774) Add FieldComparator that allows sorting parent docs based on field inside the child docs
[ https://issues.apache.org/jira/browse/LUCENE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734138#comment-13734138 ] Hoss Man commented on LUCENE-4774: -- Mikhail: can you please open a new bug with the details of your test failure -- specifically: what branch/revision you are testing and whether or not that seed reproduces for you. (it's not really appropriate to comment on closed issues that added features with concerns about bugs in that feature -- that's what Jira issue linking can be helpful for). Add FieldComparator that allows sorting parent docs based on field inside the child docs Key: LUCENE-4774 URL: https://issues.apache.org/jira/browse/LUCENE-4774 Project: Lucene - Core Issue Type: New Feature Components: modules/join Reporter: Martijn van Groningen Assignee: Martijn van Groningen Fix For: 5.0, 4.3 Attachments: LUCENE-4774.patch, LUCENE-4774.patch, LUCENE-4774.patch A field comparator for sorting block join parent docs based on the a field in the associated child docs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5084) new field type - EnumField
[ https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734159#comment-13734159 ] Hoss Man commented on SOLR-5084: Elran: 1) there's still several sections in your patch that have a lot of reformatting making it hard to see what exactly you've added. (I realize that the formatting may not be 100% uniform in all of these files, but the key to making patches easy to read is not to change anything that does't *have* to be changed ... formatting changes should be done seperately and independently from functionality changes) 2) could you please add a few unit tests to show how the type can be used when indexing/querying/faceting/returning stored fields so it's more clear what this patch does? 3) I'm not sure that it makes sense to customize the response writers and the JavaBinCodec to know about hte enum values -- it seems like it would make a lot more sense (and by much simpler) to have clients just treat the enum values as strings 4) a lot of your code seems to be cut/paste from TrieField ... why can't the EnumField class subclass TrieField to re-use this behavior (or worst case: wrap a TrieIntField similar to how TrieDateField works) new field type - EnumField -- Key: SOLR-5084 URL: https://issues.apache.org/jira/browse/SOLR-5084 Project: Solr Issue Type: New Feature Reporter: Elran Dvir Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, Solr-5084.patch We have encountered a use case in our system where we have a few fields (Severity. Risk etc) with a closed set of values, where the sort order for these values is pre-determined but not lexicographic (Critical is higher than High). Generically this is very close to how enums work. To implement, I have prototyped a new type of field: EnumField where the inputs are a closed predefined set of strings in a special configuration file (similar to currency.xml). The code is based on 4.2.1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5084) new field type - EnumField
[ https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734178#comment-13734178 ] Robert Muir commented on SOLR-5084: --- I agree with Hossman.. stick with it though, I really like the idea of an efficient enumerated type. A few other ideas/questions (just from a glance, i could be wrong): * should we enforce from the enum config that the integer values are 0-N or something simple? This way, things like valuesources dont have to do hashing but simple array lookups. * it isnt clear to me what happens if you send a bogus value. I think an enumerated type would be best if its strongly-typed and just throws exception if the value is bogus. * should the config, instead of being a separate config file, just be a nested element underneath the field type? I dont know if this is even possible or a good idea, but its an that would remove some xml files. new field type - EnumField -- Key: SOLR-5084 URL: https://issues.apache.org/jira/browse/SOLR-5084 Project: Solr Issue Type: New Feature Reporter: Elran Dvir Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, Solr-5084.patch We have encountered a use case in our system where we have a few fields (Severity. Risk etc) with a closed set of values, where the sort order for these values is pre-determined but not lexicographic (Critical is higher than High). Generically this is very close to how enums work. To implement, I have prototyped a new type of field: EnumField where the inputs are a closed predefined set of strings in a special configuration file (similar to currency.xml). The code is based on 4.2.1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5150) WAH8DocIdSet: dense sets compression
[ https://issues.apache.org/jira/browse/LUCENE-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734202#comment-13734202 ] Robert Muir commented on LUCENE-5150: - Thanks Adrien, i am too curious if its possible for you to re-run http://people.apache.org/~jpountz/doc_id_sets.html Because now with smaller sets in the dense case, maybe there is no need for wacky heuristics in CachingWrapperFilter and we could just always cache (i am sure some cases would be slower, but if in general its faster...). This would really simplify LUCENE-5101. WAH8DocIdSet: dense sets compression Key: LUCENE-5150 URL: https://issues.apache.org/jira/browse/LUCENE-5150 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand Assignee: Adrien Grand Priority: Trivial Attachments: LUCENE-5150.patch In LUCENE-5101, Paul Elschot mentioned that it would be interesting to be able to encode the inverse set to also compress very dense sets. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5084) new field type - EnumField
[ https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734210#comment-13734210 ] Hoss Man commented on SOLR-5084: bq. ...nested element underneath the field type? I dont know if this is even possible or a good idea, but its an that would remove some xml files. i don't think the schema parsing code can handle that -- it's attribute based, not nested element based bq. should we enforce from the enum config that the integer values are 0-N or something simple? ... yeah ... it would be tempting to not even let the config specify numeric values -- just an ordered list, except: 1) all hell would break loose if someone accidently inserted a new element anywhere other then the end of the list 2) you'd need/want a way to disable values form the middle of the list from working again. #2 is a problem you'd need to worry about even if we keep the mappings explicit but enforce 0-N ... there needs to be something like... {code} enum name=severity pair name=Not Available value=0/ pair name=Low value=1/ !-- value w/o name passes validation but prevents it from being used -- pair value=2/ !-- Medium use to exist, but was phased out -- pair name=High value=3/ pair name=Critical value=4/ !-- this however would fail, because we skipped 5-10 -- pair name=Super Nova value=11/ /enum {code} bq. ... This way, things like valuesources dont have to do hashing but simple array lookups. I was actually thinking it would be nice to support multiple legal names (with one canonical for respones) per value, but that would preent the simple array lookps... {code} enum name=severity value int=0labelNot Available/label/value value int=1labelLow/label/value !-- value w/o name passes validation but prevents it from being used -- value int=2 / !-- Medium use to exist, but was phased out -- value int=3labelHigh/label/value value int=4 label canonical=trueCritical/label labelHighest/label /value /enum {code} new field type - EnumField -- Key: SOLR-5084 URL: https://issues.apache.org/jira/browse/SOLR-5084 Project: Solr Issue Type: New Feature Reporter: Elran Dvir Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, Solr-5084.patch We have encountered a use case in our system where we have a few fields (Severity. Risk etc) with a closed set of values, where the sort order for these values is pre-determined but not lexicographic (Critical is higher than High). Generically this is very close to how enums work. To implement, I have prototyped a new type of field: EnumField where the inputs are a closed predefined set of strings in a special configuration file (similar to currency.xml). The code is based on 4.2.1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5084) new field type - EnumField
[ https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734223#comment-13734223 ] Robert Muir commented on SOLR-5084: --- {quote} ...nested element underneath the field type? I dont know if this is even possible or a good idea, but its an that would remove some xml files. i don't think the schema parsing code can handle that – it's attribute based, not nested element based {quote} Right but code can change. Other parts of solr allow this kinda stuff. {quote} yeah ... it would be tempting to not even let the config specify numeric values – just an ordered list, except: 1) all hell would break loose if someone accidently inserted a new element anywhere other then the end of the list 2) you'd need/want a way to disable values form the middle of the list from working again. {quote} Well i guess i look at it differently. That this is in a sense like an analyzer. you cant change the config without reindexing. {quote} I was actually thinking it would be nice to support multiple legal names (with one canonical for respones) per value, but that would preent the simple array lookps... {quote} Why? I'm talking about int-canonical name (e.g. in the valuesource impl) not anything else. as far as name-int, you want a hash anyway. new field type - EnumField -- Key: SOLR-5084 URL: https://issues.apache.org/jira/browse/SOLR-5084 Project: Solr Issue Type: New Feature Reporter: Elran Dvir Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, Solr-5084.patch We have encountered a use case in our system where we have a few fields (Severity. Risk etc) with a closed set of values, where the sort order for these values is pre-determined but not lexicographic (Critical is higher than High). Generically this is very close to how enums work. To implement, I have prototyped a new type of field: EnumField where the inputs are a closed predefined set of strings in a special configuration file (similar to currency.xml). The code is based on 4.2.1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5084) new field type - EnumField
[ https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734228#comment-13734228 ] Hoss Man commented on SOLR-5084: bq. Well i guess i look at it differently. That this is in a sense like an analyzer. you cant change the config without reindexing. i dunno ... that seems like it would really kill the utility of a field for a lot of use cases -- if it had that kind of limitation, i would just use an int field an managing the mappings myself so id always know i could add/remove fields w/o needing to reindex. to follow your example: if i completley change hte analyzer, then yes i have ot reindex -- but if want to stop using a ynonym, i don't have to re-index every doc, just the ones that had that used that synonyms. bq. as far as name-int, you want a hash anyway. right ... never mind, i was thinking about it backwards. new field type - EnumField -- Key: SOLR-5084 URL: https://issues.apache.org/jira/browse/SOLR-5084 Project: Solr Issue Type: New Feature Reporter: Elran Dvir Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, Solr-5084.patch We have encountered a use case in our system where we have a few fields (Severity. Risk etc) with a closed set of values, where the sort order for these values is pre-determined but not lexicographic (Critical is higher than High). Generically this is very close to how enums work. To implement, I have prototyped a new type of field: EnumField where the inputs are a closed predefined set of strings in a special configuration file (similar to currency.xml). The code is based on 4.2.1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-5084) new field type - EnumField
[ https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734228#comment-13734228 ] Hoss Man edited comment on SOLR-5084 at 8/9/13 12:14 AM: - bq. Well i guess i look at it differently. That this is in a sense like an analyzer. you cant change the config without reindexing. i dunno ... that seems like it would really kill the utility of a field for a lot of use cases -- if it had that kind of limitation, i would just use an int field an managing the mappings myself so id always know i could add/remove (EDIT) -fields- values w/o needing to reindex. to follow your example: if i completley change hte analyzer, then yes i have ot reindex -- but if want to stop using a ynonym, i don't have to re-index every doc, just the ones that had that used that synonyms. bq. as far as name-int, you want a hash anyway. right ... never mind, i was thinking about it backwards. was (Author: hossman): bq. Well i guess i look at it differently. That this is in a sense like an analyzer. you cant change the config without reindexing. i dunno ... that seems like it would really kill the utility of a field for a lot of use cases -- if it had that kind of limitation, i would just use an int field an managing the mappings myself so id always know i could add/remove fields w/o needing to reindex. to follow your example: if i completley change hte analyzer, then yes i have ot reindex -- but if want to stop using a ynonym, i don't have to re-index every doc, just the ones that had that used that synonyms. bq. as far as name-int, you want a hash anyway. right ... never mind, i was thinking about it backwards. new field type - EnumField -- Key: SOLR-5084 URL: https://issues.apache.org/jira/browse/SOLR-5084 Project: Solr Issue Type: New Feature Reporter: Elran Dvir Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, Solr-5084.patch We have encountered a use case in our system where we have a few fields (Severity. Risk etc) with a closed set of values, where the sort order for these values is pre-determined but not lexicographic (Critical is higher than High). Generically this is very close to how enums work. To implement, I have prototyped a new type of field: EnumField where the inputs are a closed predefined set of strings in a special configuration file (similar to currency.xml). The code is based on 4.2.1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5084) new field type - EnumField
[ https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734239#comment-13734239 ] Robert Muir commented on SOLR-5084: --- {quote} i dunno ... that seems like it would really kill the utility of a field for a lot of use cases – if it had that kind of limitation, i would just use an int field an managing the mappings myself so id always know i could add/remove (EDIT) fields values w/o needing to reindex. {quote} This isnt really going to work here, because the idea is you want to assign sort order (not just values mapped to ints). If you want to rename a label, thats fine, but you cant really change the sort order without reindexing. new field type - EnumField -- Key: SOLR-5084 URL: https://issues.apache.org/jira/browse/SOLR-5084 Project: Solr Issue Type: New Feature Reporter: Elran Dvir Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, Solr-5084.patch We have encountered a use case in our system where we have a few fields (Severity. Risk etc) with a closed set of values, where the sort order for these values is pre-determined but not lexicographic (Critical is higher than High). Generically this is very close to how enums work. To implement, I have prototyped a new type of field: EnumField where the inputs are a closed predefined set of strings in a special configuration file (similar to currency.xml). The code is based on 4.2.1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5084) new field type - EnumField
[ https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734248#comment-13734248 ] Hoss Man commented on SOLR-5084: bq. If you want to rename a label, thats fine, but you cant really change the sort order without reindexing. No, no .. of course not ... i wasn't suggestiong you could change the order, just: * *remove* a legal value from the list (w/o causing hte validation to complain) * add new values to the end of the list * (as you mentioned) modify the label on an existing value See the example i posted before about removing Medium but keeping High Critical exactly as they are -- no change in indexed data, just a way to tell the validation logic you were talking about adding skip this value, i removed it on purpose (or i suppose: skip this value, i'm reserving it as a placeholder for future use) new field type - EnumField -- Key: SOLR-5084 URL: https://issues.apache.org/jira/browse/SOLR-5084 Project: Solr Issue Type: New Feature Reporter: Elran Dvir Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, Solr-5084.patch We have encountered a use case in our system where we have a few fields (Severity. Risk etc) with a closed set of values, where the sort order for these values is pre-determined but not lexicographic (Critical is higher than High). Generically this is very close to how enums work. To implement, I have prototyped a new type of field: EnumField where the inputs are a closed predefined set of strings in a special configuration file (similar to currency.xml). The code is based on 4.2.1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5084) new field type - EnumField
[ https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734254#comment-13734254 ] Robert Muir commented on SOLR-5084: --- I think adding new values to the end of the list is no issue at all. neither is renaming labels. but removing a legal value from the list, i think you need to reindex. Because what to do with documents that have that integer value? in general i'm just trying to make sure we keep things sane here, so that the underlying shit is efficient. new field type - EnumField -- Key: SOLR-5084 URL: https://issues.apache.org/jira/browse/SOLR-5084 Project: Solr Issue Type: New Feature Reporter: Elran Dvir Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, Solr-5084.patch We have encountered a use case in our system where we have a few fields (Severity. Risk etc) with a closed set of values, where the sort order for these values is pre-determined but not lexicographic (Critical is higher than High). Generically this is very close to how enums work. To implement, I have prototyped a new type of field: EnumField where the inputs are a closed predefined set of strings in a special configuration file (similar to currency.xml). The code is based on 4.2.1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5084) new field type - EnumField
[ https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734259#comment-13734259 ] Hoss Man commented on SOLR-5084: bq. but removing a legal value from the list, i think you need to reindex. Because what to do with documents that have that integer value? For sorting and value sources etc... nothing special happens -- they still have the same numeric value under the covers; it's just that when writing out the stored values (ie: label) you act as if they have no value in the field at all (shouldn't affect efficiency at all.) If the user wants some other behavior the burden is on them to re-index or delete the affected docs -- but the simple stuff stays just as simple as if they were dealing with the int-label mappings in their own code, the validation of legal labels just moves from the client to solr. new field type - EnumField -- Key: SOLR-5084 URL: https://issues.apache.org/jira/browse/SOLR-5084 Project: Solr Issue Type: New Feature Reporter: Elran Dvir Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, Solr-5084.patch We have encountered a use case in our system where we have a few fields (Severity. Risk etc) with a closed set of values, where the sort order for these values is pre-determined but not lexicographic (Critical is higher than High). Generically this is very close to how enums work. To implement, I have prototyped a new type of field: EnumField where the inputs are a closed predefined set of strings in a special configuration file (similar to currency.xml). The code is based on 4.2.1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5084) new field type - EnumField
[ https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734266#comment-13734266 ] Robert Muir commented on SOLR-5084: --- {quote} For sorting and value sources etc... nothing special happens – they still have the same numeric value under the covers; it's just that when writing out the stored values (ie: label) you act as if they have no value in the field at all (shouldn't affect efficiency at all.) {quote} Then this is just renaming a label to some special value. I really think the best thing is to keep it simple, like java.lang.Enum. Just give a list of values. This way it will be efficient everywhere since the values will be dense. Its also conceptually simple. Otherwise, things get complicated. and the implementation may suffer due to sparse ordinals. Really, i dont care, as docvalues will do the right thing as long as you have 256 values (regardless of sparsity). Fieldcache wont, but doesn't bother me a bit. But still, there is no sense in making things complicated and inefficient for no good reason. Someone could make a HairyComplicatedAndInefficientEnumType for that. new field type - EnumField -- Key: SOLR-5084 URL: https://issues.apache.org/jira/browse/SOLR-5084 Project: Solr Issue Type: New Feature Reporter: Elran Dvir Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, Solr-5084.patch We have encountered a use case in our system where we have a few fields (Severity. Risk etc) with a closed set of values, where the sort order for these values is pre-determined but not lexicographic (Critical is higher than High). Generically this is very close to how enums work. To implement, I have prototyped a new type of field: EnumField where the inputs are a closed predefined set of strings in a special configuration file (similar to currency.xml). The code is based on 4.2.1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: jar-checkums generates extra files?
If you google 'svn remove unversioned' you find a couple one-liners you can alias. I also found http://svn.apache.org/repos/asf/subversion/trunk/contrib/client-side/svn-clean Weird that it has a GPL license though! On Thu, Aug 8, 2013 at 4:14 AM, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: I kind of use a workaround of removing everything except the .svn folder and then svn revert -R . But this is a dumb solution :) D. On Thu, Aug 8, 2013 at 1:12 PM, Uwe Schindler u...@thetaphi.de wrote: Hi, Some GUIs like TortoiseSVN have this. I use this to delete all unversioned files in milliseconds(TM). But native svn does not have it, unfortunately. Uwe Dawid Weiss dawid.we...@gmail.com schrieb: Never mind, these were local files and they were svn-ignored, when I removed everything and checked out from scratch this problem is no longer there. I really wish svn had an equivalent of git clean -xfd . Dawid On Thu, Aug 8, 2013 at 12:39 PM, Dawid Weiss dawid.we...@gmail.com wrote: When I do this on trunk: ant jar-checksums svn stat I get: ? solr\licenses\jcl-over-slf4j.jar.sha1 ? solr\licenses\jul-to-slf4j.jar.sha1 ? solr\licenses\log4j.jar.sha1 ? solr\licenses\slf4j-api.jar.sha1 ? solr\licenses\slf4j-log4j12.jar.sha1 Where should this be fixed? Should we svn-ignore those files or should they be somehow excluded from the re-generation of SHA checksums? Daw id To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org