[jira] [Closed] (LUCENE-5158) Allow StoredFieldVisitor instances to be stateful

2013-08-08 Thread Brendan Humphreys (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brendan Humphreys closed LUCENE-5158.
-

Resolution: Won't Fix

 Allow StoredFieldVisitor instances to be stateful
 -

 Key: LUCENE-5158
 URL: https://issues.apache.org/jira/browse/LUCENE-5158
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 4.4
Reporter: Brendan Humphreys
Priority: Minor
 Attachments: LUCENE-5158.patch


 Currently there is no way to build stateful {{StoredFieldVisitor}} s. 
 h3. Motivation
 We would like to optimise our access to stored fields in our indexes by 
 utilising the {{StoredFieldVisitor.Status.STOP}} feature to stop processing 
 fields in a document. Unfortunately we have very large indexes, and 
 rebuilding them to have the required field order is not an option.
 A stateful {{StoredFieldVisitor}} could solve this; it could track which 
 fields have been loaded for a document, and then {{STOP}} when the fields 
 required have been loaded, regardless of the order they were loaded.
 h3. Implementation
 I've added a no-op {{public void reset()}} method to the 
 {{StoredFieldVisitor}} base class, which gives a {{StoredFieldVisitor}} 
 subclass an opportunity to reset its state before the fields of the next 
 document are processed. I've added a call to {{reset()}} in all places the 
 {{StoredFieldVisitor}} was being used.
  
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5158) Allow StoredFieldVisitor instances to be stateful

2013-08-08 Thread Brendan Humphreys (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733191#comment-13733191
 ] 

Brendan Humphreys commented on LUCENE-5158:
---


bq. Separately if you want a reset() method to call before a document is 
processed, just add it to your own StoredFieldVisitor, and just call it 
yourself before the next ir.document().
Its not necessary to add this method to the lucene API for that.

Yes, I see now what you mean. I had come to this solution via a fairly 
circuitous route; stopping to smell the flowers I see my modifications were 
unnecessary. I'll close this won't fix. 

Cheers,
-Brendan





 Allow StoredFieldVisitor instances to be stateful
 -

 Key: LUCENE-5158
 URL: https://issues.apache.org/jira/browse/LUCENE-5158
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 4.4
Reporter: Brendan Humphreys
Priority: Minor
 Attachments: LUCENE-5158.patch


 Currently there is no way to build stateful {{StoredFieldVisitor}} s. 
 h3. Motivation
 We would like to optimise our access to stored fields in our indexes by 
 utilising the {{StoredFieldVisitor.Status.STOP}} feature to stop processing 
 fields in a document. Unfortunately we have very large indexes, and 
 rebuilding them to have the required field order is not an option.
 A stateful {{StoredFieldVisitor}} could solve this; it could track which 
 fields have been loaded for a document, and then {{STOP}} when the fields 
 required have been loaded, regardless of the order they were loaded.
 h3. Implementation
 I've added a no-op {{public void reset()}} method to the 
 {{StoredFieldVisitor}} base class, which gives a {{StoredFieldVisitor}} 
 subclass an opportunity to reset its state before the fields of the next 
 document are processed. I've added a call to {{reset()}} in all places the 
 {{StoredFieldVisitor}} was being used.
  
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [JENKINS] Lucene-Solr-trunk-Windows (32bit/jdk1.8.0-ea-b99) - Build # 3121 - Failure!

2013-08-08 Thread Uwe Schindler
Hi,

this one looks crazy. Maybe a Windows-only problem, have never seen that before!

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Policeman Jenkins Server [mailto:jenk...@thetaphi.de]
 Sent: Wednesday, August 07, 2013 6:35 AM
 To: dev@lucene.apache.org; rm...@apache.org; hoss...@apache.org
 Subject: [JENKINS] Lucene-Solr-trunk-Windows (32bit/jdk1.8.0-ea-b99) -
 Build # 3121 - Failure!
 
 Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/3121/
 Java: 32bit/jdk1.8.0-ea-b99 -server -XX:+UseG1GC
 
 1 tests failed.
 REGRESSION:
 org.apache.lucene.index.TestIndexWriterOutOfFileDescriptors.test
 
 Error Message:
 unreferenced files: before delete:
 [_0_TestBloomFilteredLucene41Postings_0.doc,
 _0_TestBloomFilteredLucene41Postings_0.pos, _k.fdt, _k.fdx, _k.fnm,
 _k.nvd, _k.nvm, _k.si, _k.tvd, _k.tvx, _k_Lucene41WithOrds_0.doc,
 _k_Lucene41WithOrds_0.pos, _k_Lucene41WithOrds_0.tib,
 _k_Lucene41WithOrds_0.tii, _k_Lucene42_0.dvd, _k_Lucene42_0.dvm,
 _k_MockFixedIntBlock_0.doc, _k_MockFixedIntBlock_0.frq,
 _k_MockFixedIntBlock_0.pos, _k_MockFixedIntBlock_0.pyl,
 _k_MockFixedIntBlock_0.skp, _k_MockFixedIntBlock_0.tib,
 _k_MockFixedIntBlock_0.tii, _k_MockVariableIntBlock_0.doc,
 _k_MockVariableIntBlock_0.frq, _k_MockVariableIntBlock_0.pos,
 _k_MockVariableIntBlock_0.pyl, _k_MockVariableIntBlock_0.skp,
 _k_MockVariableIntBlock_0.tib, _k_MockVariableIntBlock_0.tii,
 _k_TestBloomFilteredLucene41Postings_0.blm,
 _k_TestBloomFilteredLucene41Postings_0.doc,
 _k_TestBloomFilteredLucene41Postings_0.pos,
 _k_TestBloomFilteredLucene41Postings_0.tim,
 _k_TestBloomFilteredLucene41Postings_0.tip, _l.fdt, _l.fdx, _l.fnm, _l.nvd,
 _l.nvm, _l.si, _l.tvd, _l.tvx, _l_Lucene41WithOrds_0.doc,
 _l_Lucene41WithOrds_0.pos, _l_Lucene41WithOrds_0.tib,
 _l_Lucene41WithOrds_0.tii, _l_Lucene42_0.dvd, _l_Lucene42_0.dvm,
 _l_MockFixedIntBlock_0.doc, _l_MockFixedIntBlock_0.frq,
 _l_MockFixedIntBlock_0.pos, _l_MockFixedIntBlock_0.pyl,
 _l_MockFixedIntBlock_0.skp, _l_MockFixedIntBlock_0.tib,
 _l_MockFixedIntBlock_0.tii, _l_MockVariableIntBlock_0.doc,
 _l_MockVariableIntBlock_0.frq, _l_MockVariableIntBlock_0.pos,
 _l_MockVariableIntBlock_0.pyl, _l_MockVariableIntBlock_0.skp,
 _l_MockVariableIntBlock_0.tib, _l_MockVariableIntBlock_0.tii,
 _l_TestBloomFilteredLucene41Postings_0.blm,
 _l_TestBloomFilteredLucene41Postings_0.doc,
 _l_TestBloomFilteredLucene41Postings_0.pos,
 _l_TestBloomFilteredLucene41Postings_0.tim,
 _l_TestBloomFilteredLucene41Postings_0.tip, _m.cfe, _m.cfs, _m.si, _n.cfe,
 _n.cfs, _n.si, _o.fdt, _o.fdx, _o.fnm, _o.nvd, _o.nvm, _o.si, _o.tvd, _o.tvx,
 _o_Lucene41WithOrds_0.doc, _o_Lucene41WithOrds_0.pos,
 _o_Lucene41WithOrds_0.tib, _o_Lucene41WithOrds_0.tii,
 _o_Lucene42_0.dvd, _o_Lucene42_0.dvm, _o_MockFixedIntBlock_0.doc,
 _o_MockFixedIntBlock_0.frq, _o_MockFixedIntBlock_0.pos,
 _o_MockFixedIntBlock_0.pyl, _o_MockFixedIntBlock_0.skp,
 _o_MockFixedIntBlock_0.tib, _o_MockFixedIntBlock_0.tii,
 _o_MockVariableIntBlock_0.doc, _o_MockVariableIntBlock_0.frq,
 _o_MockVariableIntBlock_0.pos, _o_MockVariableIntBlock_0.pyl,
 _o_MockVariableIntBlock_0.skp, _o_MockVariableIntBlock_0.tib,
 _o_MockVariableIntBlock_0.tii,
 _o_TestBloomFilteredLucene41Postings_0.blm,
 _o_TestBloomFilteredLucene41Postings_0.doc,
 _o_TestBloomFilteredLucene41Postings_0.pos,
 _o_TestBloomFilteredLucene41Postings_0.tim,
 _o_TestBloomFilteredLucene41Postings_0.tip, _q.fdt, _q.fdx, _q.fnm,
 _q.nvd, _q.nvm, _q.si, _q.tvd, _q.tvx, _q_Lucene41WithOrds_0.doc,
 _q_Lucene41WithOrds_0.pos, _q_Lucene41WithOrds_0.tib,
 _q_Lucene41WithOrds_0.tii, _q_Lucene42_0.dvd, _q_Lucene42_0.dvm,
 _q_MockFixedIntBlock_0.doc, _q_MockFixedIntBlock_0.frq,
 _q_MockFixedIntBlock_0.pos, _q_MockFixedIntBlock_0.pyl,
 _q_MockFixedIntBlock_0.skp, _q_MockFixedIntBlock_0.tib,
 _q_MockFixedIntBlock_0.tii, _q_MockVariableIntBlock_0.doc,
 _q_MockVariableIntBlock_0.frq, _q_MockVariableIntBlock_0.pos,
 _q_MockVariableIntBlock_0.pyl, _q_MockVariableIntBlock_0.skp,
 _q_MockVariableIntBlock_0.tib, _q_MockVariableIntBlock_0.tii,
 _q_TestBloomFilteredLucene41Postings_0.blm,
 _q_TestBloomFilteredLucene41Postings_0.doc,
 _q_TestBloomFilteredLucene41Postings_0.pos,
 _q_TestBloomFilteredLucene41Postings_0.tim,
 _q_TestBloomFilteredLucene41Postings_0.tip, _r.fdt, _r.fdx, _r.fnm, _r.nvd,
 _r.nvm, _r.si, _r.tvd, _r.tvx, _r_Lucene41WithOrds_0.doc,
 _r_Lucene41WithOrds_0.pay, _r_Lucene41WithOrds_0.pos,
 _r_Lucene41WithOrds_0.tib, _r_Lucene41WithOrds_0.tii,
 _r_Lucene42_0.dvd, _r_Lucene42_0.dvm, _r_MockFixedIntBlock_0.doc,
 _r_MockFixedIntBlock_0.frq, _r_MockFixedIntBlock_0.pos,
 _r_MockFixedIntBlock_0.pyl, _r_MockFixedIntBlock_0.skp,
 _r_MockFixedIntBlock_0.tib, _r_MockFixedIntBlock_0.tii,
 _r_MockVariableIntBlock_0.doc, _r_MockVariableIntBlock_0.frq,
 _r_MockVariableIntBlock_0.pos, _r_MockVariableIntBlock_0.pyl,
 _r_MockVariableIntBlock_0.skp, _r_MockVariableIntBlock_0.tib,
 _r_MockVariableIntBlock_0.tii,
 

[jira] [Created] (SOLR-5123) NullPointerException on JdbcDataSource

2013-08-08 Thread Thomas SZADEL (JIRA)
Thomas SZADEL created SOLR-5123:
---

 Summary: NullPointerException on JdbcDataSource
 Key: SOLR-5123
 URL: https://issues.apache.org/jira/browse/SOLR-5123
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 4.3
 Environment: Linux
Reporter: Thomas SZADEL
Priority: Minor


We got an NPE with Solr 4.3 when getting a database connection (and JBoss fails 
to get connection)

Solr runs on an JBoss 7.1 et gets their connections from a JNDI call 
(connection is provided by JBoss).


Processing Document # 1
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:38)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
... 5 more
Caused by: java.lang.NullPointerException
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:241)
... 12 more


In the code, the possible null value is not checked :
239  try {
240Connection c = getConnection();
241stmt = c.createStatement(ResultSet.TYPE_FORWARD_ONLY, 
ResultSet.CONCUR_READ_ONLY);

... maybe a check may be more safe :
if(c == null){
   throw new XXXException();
}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time

2013-08-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733281#comment-13733281
 ] 

ASF subversion and git services commented on SOLR-5113:
---

Commit 1511633 from [~noble.paul] in branch 'dev/trunk'
[ https://svn.apache.org/r1511633 ]

SOLR-5113

 CollectionsAPIDistributedZkTest fails all the time
 --

 Key: SOLR-5113
 URL: https://issues.apache.org/jira/browse/SOLR-5113
 Project: Solr
  Issue Type: Bug
  Components: Tests
Affects Versions: 4.5, 5.0
Reporter: Uwe Schindler
Assignee: Noble Paul
Priority: Blocker
 Attachments: SOLR-5113.patch, SOLR-5113.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time

2013-08-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733292#comment-13733292
 ] 

ASF subversion and git services commented on SOLR-5113:
---

Commit 1511635 from [~noble.paul] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1511635 ]

SOLR-5113

 CollectionsAPIDistributedZkTest fails all the time
 --

 Key: SOLR-5113
 URL: https://issues.apache.org/jira/browse/SOLR-5113
 Project: Solr
  Issue Type: Bug
  Components: Tests
Affects Versions: 4.5, 5.0
Reporter: Uwe Schindler
Assignee: Noble Paul
Priority: Blocker
 Attachments: SOLR-5113.patch, SOLR-5113.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time

2013-08-08 Thread Noble Paul (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul resolved SOLR-5113.
--

   Resolution: Fixed
Fix Version/s: 5.0
   4.5

 CollectionsAPIDistributedZkTest fails all the time
 --

 Key: SOLR-5113
 URL: https://issues.apache.org/jira/browse/SOLR-5113
 Project: Solr
  Issue Type: Bug
  Components: Tests
Affects Versions: 4.5, 5.0
Reporter: Uwe Schindler
Assignee: Noble Paul
Priority: Blocker
 Fix For: 4.5, 5.0

 Attachments: SOLR-5113.patch, SOLR-5113.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733297#comment-13733297
 ] 

Uwe Schindler commented on SOLR-5113:
-

Hi Noble,
thanks for committing! I think it is now up to jenkins to verify that it works! 



 CollectionsAPIDistributedZkTest fails all the time
 --

 Key: SOLR-5113
 URL: https://issues.apache.org/jira/browse/SOLR-5113
 Project: Solr
  Issue Type: Bug
  Components: Tests
Affects Versions: 4.5, 5.0
Reporter: Uwe Schindler
Assignee: Noble Paul
Priority: Blocker
 Fix For: 4.5, 5.0

 Attachments: SOLR-5113.patch, SOLR-5113.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread JIRA
Christoph Straßer created SOLR-5124:
---

 Summary: Solr glues word´s when parsing PDFs under certan 
circumstances
 Key: SOLR-5124
 URL: https://issues.apache.org/jira/browse/SOLR-5124
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.4
 Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor


For some kind of PDF-documents Solr glues words at linebreaks under some 
circumstances. (eg the last word of line 1 and the first word of line 2 are 
merged into one word)
(Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF 
and screenshots of tika-output and the corrupted content indexed by solr.
(This issue does not occur with all PDF-documents. Tried to recreate the issue 
with new word-documents, I converted into PDF on multiple ways without 
success.) The attached PDF-document has a real weird internal structure. But 
Tika seems to do it´s work right. Even with this weird document.
In our Solr-indices we have a good amount of this wird documents. This results 
in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christoph Straßer updated SOLR-5124:


Attachment: 04_Solr.png
03_TikaOutput_GUI_StructuredText.png
03_TikaOutput_GUI_PlainText.png
03_TikaOutput_GUI_MainContent.png
03_TikaOutput.png
02_PDF.png
01_alz_2009_folge11_2009_05_28.pdf

Added sample-PDF, screenshots of TIKA-Output, screenshot of SOLR-Index.

 Solr glues word´s when parsing PDFs under certan circumstances
 --

 Key: SOLR-5124
 URL: https://issues.apache.org/jira/browse/SOLR-5124
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.4
 Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor
  Labels: tika,text-extraction
 Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png


 For some kind of PDF-documents Solr glues words at linebreaks under some 
 circumstances. (eg the last word of line 1 and the first word of line 2 are 
 merged into one word)
 (Stand-alone-)Tika extracts the text correct. Attached you find one 
 sample-PDF and screenshots of tika-output and the corrupted content indexed 
 by solr.
 (This issue does not occur with all PDF-documents. Tried to recreate the 
 issue with new word-documents, I converted into PDF on multiple ways without 
 success.) The attached PDF-document has a real weird internal structure. But 
 Tika seems to do it´s work right. Even with this weird document.
 In our Solr-indices we have a good amount of this wird documents. This 
 results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christoph Straßer updated SOLR-5124:


Description: 
For some kind of PDF-documents Solr glues words at linebreaks under some 
circumstances. (eg the last word of line 1 and the first word of line 2 are 
merged into one word)
(Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF 
and screenshots of tika-output and the corrupted content indexed by solr.
(This issue does not occur with all PDF-documents. Tried to recreate the issue 
with new word-documents, I converted into PDF on multiple ways without 
success.) The attached PDF-document has a real weird internal structure. But 
Tika seems to do it´s work right. Even with this weird document.
In our Solr-indices we have a good amount of this weird documents. This results 
in worse suggestions by the Suggester.

  was:
For some kind of PDF-documents Solr glues words at linebreaks under some 
circumstances. (eg the last word of line 1 and the first word of line 2 are 
merged into one word)
(Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF 
and screenshots of tika-output and the corrupted content indexed by solr.
(This issue does not occur with all PDF-documents. Tried to recreate the issue 
with new word-documents, I converted into PDF on multiple ways without 
success.) The attached PDF-document has a real weird internal structure. But 
Tika seems to do it´s work right. Even with this weird document.
In our Solr-indices we have a good amount of this wird documents. This results 
in worse suggestions by the Suggester.


 Solr glues word´s when parsing PDFs under certan circumstances
 --

 Key: SOLR-5124
 URL: https://issues.apache.org/jira/browse/SOLR-5124
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.4
 Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor
  Labels: tika,text-extraction
 Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png


 For some kind of PDF-documents Solr glues words at linebreaks under some 
 circumstances. (eg the last word of line 1 and the first word of line 2 are 
 merged into one word)
 (Stand-alone-)Tika extracts the text correct. Attached you find one 
 sample-PDF and screenshots of tika-output and the corrupted content indexed 
 by solr.
 (This issue does not occur with all PDF-documents. Tried to recreate the 
 issue with new word-documents, I converted into PDF on multiple ways without 
 success.) The attached PDF-document has a real weird internal structure. But 
 Tika seems to do it´s work right. Even with this weird document.
 In our Solr-indices we have a good amount of this weird documents. This 
 results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Problem using Benchmark

2013-08-08 Thread Abhishek Gupta
Anyone pls help!!



On Wed, Aug 7, 2013 at 12:36 PM, Abhishek Gupta abhi.bansa...@gmail.comwrote:

 Hi,
 I am using PyLucene and there I tried to use lucene's Benchmark to
 evaluate Trec Data. I was having a doubt which I first asked on
 pylucene-dev mailing list. After solving the first problem I got another
 problem which is referred a java error by Andi. You can see the thread here
 (
 http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/201308.mbox/%3CCAJBtL5GG-LghfKBCKFhi%2BPXVmEFMdnM1zC%3D9NtDd-kL-Pv1nuQ%40mail.gmail.com%3E
 )

 I am getting the class not found exception for Compressor(
 http://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/compressors/package-summary.html).
 I an a newbie to java development, so I don't know about Ant much. PLease
 help in solving this issue.


 Thanking You
 Abhishek Gupta,
 9624799165




-- 
Abhishek Gupta,
897876422, 9416106204, 9624799165


[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733312#comment-13733312
 ] 

Uwe Schindler commented on SOLR-5124:
-

I have not looked into DIH's code, but I know that TIKA adds the extra 
whitespace as ignoreable whitespace XML data. It might be ignored by the 
extraction content handler when it consumes the SAX events.

 Solr glues word´s when parsing PDFs under certan circumstances
 --

 Key: SOLR-5124
 URL: https://issues.apache.org/jira/browse/SOLR-5124
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.4
 Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor
  Labels: tika,text-extraction
 Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png


 For some kind of PDF-documents Solr glues words at linebreaks under some 
 circumstances. (eg the last word of line 1 and the first word of line 2 are 
 merged into one word)
 (Stand-alone-)Tika extracts the text correct. Attached you find one 
 sample-PDF and screenshots of tika-output and the corrupted content indexed 
 by solr.
 (This issue does not occur with all PDF-documents. Tried to recreate the 
 issue with new word-documents, I converted into PDF on multiple ways without 
 success.) The attached PDF-document has a real weird internal structure. But 
 Tika seems to do it´s work right. Even with this weird document.
 In our Solr-indices we have a good amount of this weird documents. This 
 results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733321#comment-13733321
 ] 

Christoph Straßer commented on SOLR-5124:
-

Maybe it´s in some way related to SOLR-4679. (But not sure; We use the 
ExtractingRequestHandler) 

 Solr glues word´s when parsing PDFs under certan circumstances
 --

 Key: SOLR-5124
 URL: https://issues.apache.org/jira/browse/SOLR-5124
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.4
 Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor
  Labels: tika,text-extraction
 Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png


 For some kind of PDF-documents Solr glues words at linebreaks under some 
 circumstances. (eg the last word of line 1 and the first word of line 2 are 
 merged into one word)
 (Stand-alone-)Tika extracts the text correct. Attached you find one 
 sample-PDF and screenshots of tika-output and the corrupted content indexed 
 by solr.
 (This issue does not occur with all PDF-documents. Tried to recreate the 
 issue with new word-documents, I converted into PDF on multiple ways without 
 success.) The attached PDF-document has a real weird internal structure. But 
 Tika seems to do it´s work right. Even with this weird document.
 In our Solr-indices we have a good amount of this weird documents. This 
 results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733325#comment-13733325
 ] 

Uwe Schindler commented on SOLR-5124:
-

Hi, this is a duplicate of 2 other issues. SOLR-4679 is the main issue. I will 
close this as duplicate.

 Solr glues word´s when parsing PDFs under certan circumstances
 --

 Key: SOLR-5124
 URL: https://issues.apache.org/jira/browse/SOLR-5124
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.4
 Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor
  Labels: tika,text-extraction
 Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png


 For some kind of PDF-documents Solr glues words at linebreaks under some 
 circumstances. (eg the last word of line 1 and the first word of line 2 are 
 merged into one word)
 (Stand-alone-)Tika extracts the text correct. Attached you find one 
 sample-PDF and screenshots of tika-output and the corrupted content indexed 
 by solr.
 (This issue does not occur with all PDF-documents. Tried to recreate the 
 issue with new word-documents, I converted into PDF on multiple ways without 
 success.) The attached PDF-document has a real weird internal structure. But 
 Tika seems to do it´s work right. Even with this weird document.
 In our Solr-indices we have a good amount of this weird documents. This 
 results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler closed SOLR-5124.
---

Resolution: Duplicate

 Solr glues word´s when parsing PDFs under certan circumstances
 --

 Key: SOLR-5124
 URL: https://issues.apache.org/jira/browse/SOLR-5124
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.4
 Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor
  Labels: tika,text-extraction
 Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png


 For some kind of PDF-documents Solr glues words at linebreaks under some 
 circumstances. (eg the last word of line 1 and the first word of line 2 are 
 merged into one word)
 (Stand-alone-)Tika extracts the text correct. Attached you find one 
 sample-PDF and screenshots of tika-output and the corrupted content indexed 
 by solr.
 (This issue does not occur with all PDF-documents. Tried to recreate the 
 issue with new word-documents, I converted into PDF on multiple ways without 
 success.) The attached PDF-document has a real weird internal structure. But 
 Tika seems to do it´s work right. Even with this weird document.
 In our Solr-indices we have a good amount of this weird documents. This 
 results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Problem using Benchmark

2013-08-08 Thread Abhishek Gupta
You can see the complete error I am getting
herehttp://codebin.org/view/8460aa0a
.


On Thu, Aug 8, 2013 at 3:10 PM, Abhishek Gupta abhi.bansa...@gmail.comwrote:

 Anyone pls help!!



 On Wed, Aug 7, 2013 at 12:36 PM, Abhishek Gupta 
 abhi.bansa...@gmail.comwrote:

 Hi,
 I am using PyLucene and there I tried to use lucene's Benchmark to
 evaluate Trec Data. I was having a doubt which I first asked on
 pylucene-dev mailing list. After solving the first problem I got another
 problem which is referred a java error by Andi. You can see the thread here
 (
 http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/201308.mbox/%3CCAJBtL5GG-LghfKBCKFhi%2BPXVmEFMdnM1zC%3D9NtDd-kL-Pv1nuQ%40mail.gmail.com%3E
 )

 I am getting the class not found exception for Compressor(
 http://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/compressors/package-summary.html).
 I an a newbie to java development, so I don't know about Ant much. PLease
 help in solving this issue.


 Thanking You
 Abhishek Gupta,
 9624799165




 --
 Abhishek Gupta,
 897876422, 9416106204, 9624799165




-- 
Abhishek Gupta,
897876422, 9416106204, 9624799165


[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733328#comment-13733328
 ] 

Uwe Schindler commented on SOLR-4679:
-

There is another occurence of this bug with PDF files (SOLR-5124). I think we 
should apply the workaround and make the ignoreable whitespace significant. In 
my opinion this is not a problem at all, because the Analyzer will remove this 
stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for br tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to corectly produce ignorable whitespace in some parsers, 
which were missing to do this).

FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace 
SAX event to report this added whitespace. The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you understand block tags and 
br/, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733328#comment-13733328
 ] 

Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:12 AM:
--

There is another occurence of this bug with PDF files (SOLR-5124). I think we 
should apply the workaround and make the ignoreable whitespace significant. In 
my opinion this is not a problem at all, because the Analyzer will remove this 
stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for br tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to corectly produce ignorable whitespace in some parsers, 
which were missing to do this. I also added the XHTMLContentHandler stuff that 
makes block XHTML elements like p/, div/ also emit a newline as 
ignoreable on the closing element).

FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace 
SAX event to report this added whitespace. The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you understand block tags and 
br/, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.

  was (Author: thetaphi):
There is another occurence of this bug with PDF files (SOLR-5124). I think 
we should apply the workaround and make the ignoreable whitespace significant. 
In my opinion this is not a problem at all, because the Analyzer will remove 
this stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for br tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to corectly produce ignorable whitespace in some parsers, 
which were missing to do this).

FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace 
SAX event to report this added whitespace. The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you understand block tags and 
br/, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.
  
 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 

[jira] [Assigned] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned SOLR-4679:
---

Assignee: Uwe Schindler

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733328#comment-13733328
 ] 

Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:17 AM:
--

There is another occurence of this bug with PDF files (SOLR-5124). I think we 
should apply the workaround and make the ignoreable whitespace significant. In 
my opinion this is not a problem at all, because the Analyzer will remove this 
stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for br tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to corectly produce ignorable whitespace in some parsers, 
which were missing to do this. I also added the XHTMLContentHandler stuff that 
makes block XHTML elements like p/, div/ also emit a newline as 
ignoreable on the closing element, see TIKA-171).

FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace 
SAX event to report this added whitespace. The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you understand block tags and 
br/, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.

  was (Author: thetaphi):
There is another occurence of this bug with PDF files (SOLR-5124). I think 
we should apply the workaround and make the ignoreable whitespace significant. 
In my opinion this is not a problem at all, because the Analyzer will remove 
this stuff in any case, so some additional whitespace would disappear.

bq. i did some experimenting and confirmed that the SolrContentHandler is 
getting ignorable whitespace SAX events for br tags in HTML – which makes no 
sense to me, so i've opened TIKA-1134 to try and get to the bottom of it.

I know this bug and I was discussing about that since the early beginning in 
TIKA and I don't think it will change! TIKA uses ignorable whitespace for all 
text-only glue stuff, which was decided at the beginning of the project. I can 
find the mail from their lists; I was involved in that, too (because I applied 
some fixes for that to corectly produce ignorable whitespace in some parsers, 
which were missing to do this. I also added the XHTMLContentHandler stuff that 
makes block XHTML elements like p/, div/ also emit a newline as 
ignoreable on the closing element).

FYI: ignoreable whitespace is XML semantics only, in (X)HTML this does not 
exist (it is handled differently, but is never reported by HTML parsers), so 
the idea in TIKA is to reuse (its a bit incorrect) the ignoreableWhitespace 
SAX event to report this added whitespace. The rule that was choosen in TIKA 
is:
- If you ignore all elements of HTML and only extract plain text, use the 
ignoreable whitespace. This is e.g. done by the TIKA's internal wrappers that 
produce plain text (TextOnlyContentHandler). They treat all ignoreable 
whitepsace as significant. Ignoreable whitespace is *only* produced by TIKA so 
if it exists, you know that it is coming from TIKA.
- If you want to keep the XHTML structure and you understand block tags and 
br/, then you can ignore the ignorable whitespace.

Regarding this guideline, your patch is correct and should be applied to solr.
  
 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, 

[jira] [Comment Edited] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1377#comment-1377
 ] 

Uwe Schindler edited comment on SOLR-4679 at 8/8/13 10:25 AM:
--

The stuff with ignorableWhitespace was discussed between [~jukkaz] and me in 
TIKA-171. I think this was the issue when we decided to emit 
ignorableWhitespace for all synthetic whitespace added to support text-only 
extraction.

[~hossman]: I can take the issue if you like. I am +1 to committing your 
current patch, because it makes use of the stuff we decided in TIKA-171. In my 
opinion,  TIKA-1134 is obsolete but you/I can add a comments there to explain 
one more time and document under which circumstances TIKA emits 
ignorableWhitepsace.

  was (Author: thetaphi):
The stuff with ignorableWhitespace was discussed between [~jukkaz] and me 
in TIKA-171. I think this was the issue when we decided to emit 
ignorableWhitespace for all synthetic whitespace added to support-text only 
extraction.

[~hossman]: I can take the issue if you like. I am +1 to committing your 
current patch, because it makes use of the stuff we decided in TIKA-171. In my 
opinion,  TIKA-1134 is obsolete but you/I can add a comments there to explain 
one more time and document under which circumstances TIKA emits 
ignorableWhitepsace.
  
 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1377#comment-1377
 ] 

Uwe Schindler commented on SOLR-4679:
-

The stuff with ignorableWhitespace was discussed between [~jukkaz] and me in 
TIKA-171. I think this was the issue when we decided to emit 
ignorableWhitespace for all synthetic whitespace added to support-text only 
extraction.

[~hossman]: I can take the issue if you like. I am +1 to committing your 
current patch, because it makes use of the stuff we decided in TIKA-171. In my 
opinion,  TIKA-1134 is obsolete but you/I can add a comments there to explain 
one more time and document under which circumstances TIKA emits 
ignorableWhitepsace.

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



jar-checkums generates extra files?

2013-08-08 Thread Dawid Weiss
When I do this on trunk:

ant jar-checksums
svn stat

I get:
?   solr\licenses\jcl-over-slf4j.jar.sha1
?   solr\licenses\jul-to-slf4j.jar.sha1
?   solr\licenses\log4j.jar.sha1
?   solr\licenses\slf4j-api.jar.sha1
?   solr\licenses\slf4j-log4j12.jar.sha1

Where should this be fixed?  Should we svn-ignore those files or
should they be somehow excluded from the re-generation of SHA
checksums?

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: jar-checkums generates extra files?

2013-08-08 Thread Dawid Weiss
Never mind, these were local files and they were svn-ignored, when I
removed everything and checked out from scratch this problem is no
longer there.

I really wish svn had an equivalent of git clean -xfd .

Dawid

On Thu, Aug 8, 2013 at 12:39 PM, Dawid Weiss dawid.we...@gmail.com wrote:
 When I do this on trunk:

 ant jar-checksums
 svn stat

 I get:
 ?   solr\licenses\jcl-over-slf4j.jar.sha1
 ?   solr\licenses\jul-to-slf4j.jar.sha1
 ?   solr\licenses\log4j.jar.sha1
 ?   solr\licenses\slf4j-api.jar.sha1
 ?   solr\licenses\slf4j-log4j12.jar.sha1

 Where should this be fixed?  Should we svn-ignore those files or
 should they be somehow excluded from the re-generation of SHA
 checksums?

 Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: jar-checkums generates extra files?

2013-08-08 Thread Uwe Schindler
Hi,

Some GUIs like TortoiseSVN have this. I use this to delete all unversioned 
files in milliseconds(TM). But native svn does not have it, unfortunately. 

Uwe



Dawid Weiss dawid.we...@gmail.com schrieb:
Never mind, these were local files and they were svn-ignored, when I
removed everything and checked out from scratch this problem is no
longer there.

I really wish svn had an equivalent of git clean -xfd .

Dawid

On Thu, Aug 8, 2013 at 12:39 PM, Dawid Weiss dawid.we...@gmail.com
wrote:
 When I do this on trunk:

 ant jar-checksums
 svn stat

 I get:
 ?   solr\licenses\jcl-over-slf4j.jar.sha1
 ?   solr\licenses\jul-to-slf4j.jar.sha1
 ?   solr\licenses\log4j.jar.sha1
 ?   solr\licenses\slf4j-api.jar.sha1
 ?   solr\licenses\slf4j-log4j12.jar.sha1

 Where should this be fixed?  Should we svn-ignore those files or
 should they be somehow excluded from the re-generation of SHA
 checksums?

 Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de

Re: jar-checkums generates extra files?

2013-08-08 Thread Dawid Weiss
I kind of use a workaround of removing everything except the .svn
folder and then svn revert -R .
But this is a dumb solution :)

D.

On Thu, Aug 8, 2013 at 1:12 PM, Uwe Schindler u...@thetaphi.de wrote:
 Hi,

 Some GUIs like TortoiseSVN have this. I use this to delete all unversioned
 files in milliseconds(TM). But native svn does not have it, unfortunately.

 Uwe



 Dawid Weiss dawid.we...@gmail.com schrieb:

 Never mind, these were local files and they were svn-ignored, when I
 removed everything and checked out from scratch this problem is no
 longer there.

 I really wish svn had an equivalent of git clean -xfd .

 Dawid

 On Thu, Aug 8, 2013 at 12:39 PM, Dawid Weiss dawid.we...@gmail.com
 wrote:

  When I do this on trunk:

  ant jar-checksums
  svn stat

  I get:
  ?   solr\licenses\jcl-over-slf4j.jar.sha1
  ?   solr\licenses\jul-to-slf4j.jar.sha1
  ?   solr\licenses\log4j.jar.sha1
  ?   solr\licenses\slf4j-api.jar.sha1
  ?   solr\licenses\slf4j-log4j12.jar.sha1

  Where should this be fixed?  Should we svn-ignore those files or
  should they be somehow excluded from the re-generation of SHA
  checksums?

  Daw
  id


 

 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


 --
 Uwe Schindler
 H.-H.-Meier-Allee 63, 28213 Bremen
 http://www.thetaphi.de

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-Tests-trunk-Java7 - Build # 4219 - Failure

2013-08-08 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/Lucene-Solr-Tests-trunk-Java7/4219/

All tests passed

Build Log:
[...truncated 34909 lines...]
BUILD FAILED
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-Tests-trunk-Java7/build.xml:389:
 The following error occurred while executing this line:
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-Tests-trunk-Java7/build.xml:328:
 The following error occurred while executing this line:
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-Tests-trunk-Java7/extra-targets.xml:66:
 The following error occurred while executing this line:
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-Tests-trunk-Java7/extra-targets.xml:139:
 The following files are missing svn:eol-style (or binary svn:mime-type):
* ./solr/core/src/test/org/apache/solr/cloud/CustomCollectionTest.java

Total time: 80 minutes 23 seconds
Build step 'Invoke Ant' marked build as failure
Archiving artifacts
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-trunk-Linux (64bit/jdk1.7.0_25) - Build # 6924 - Failure!

2013-08-08 Thread Policeman Jenkins Server
Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/6924/
Java: 64bit/jdk1.7.0_25 -XX:-UseCompressedOops -XX:+UseParallelGC

All tests passed

Build Log:
[...truncated 34822 lines...]
BUILD FAILED
/mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/build.xml:389: The following 
error occurred while executing this line:
/mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/build.xml:328: The following 
error occurred while executing this line:
/mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/extra-targets.xml:66: The 
following error occurred while executing this line:
/mnt/ssd/jenkins/workspace/Lucene-Solr-trunk-Linux/extra-targets.xml:139: The 
following files are missing svn:eol-style (or binary svn:mime-type):
* ./solr/core/src/test/org/apache/solr/cloud/CustomCollectionTest.java

Total time: 52 minutes 24 seconds
Build step 'Invoke Ant' marked build as failure
Description set: Java: 64bit/jdk1.7.0_25 -XX:-UseCompressedOops 
-XX:+UseParallelGC
Archiving artifacts
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[JENKINS] Lucene-Solr-trunk-Windows (32bit/jdk1.7.0_25) - Build # 3125 - Failure!

2013-08-08 Thread Policeman Jenkins Server
Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/3125/
Java: 32bit/jdk1.7.0_25 -server -XX:+UseSerialGC

All tests passed

Build Log:
[...truncated 31514 lines...]
BUILD FAILED
C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\build.xml:389: The 
following error occurred while executing this line:
C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\build.xml:328: The 
following error occurred while executing this line:
C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\extra-targets.xml:66: 
The following error occurred while executing this line:
C:\Users\JenkinsSlave\workspace\Lucene-Solr-trunk-Windows\extra-targets.xml:139:
 The following files are missing svn:eol-style (or binary svn:mime-type):
* ./solr/core/src/test/org/apache/solr/cloud/CustomCollectionTest.java

Total time: 103 minutes 30 seconds
Build step 'Invoke Ant' marked build as failure
Description set: Java: 32bit/jdk1.7.0_25 -server -XX:+UseSerialGC
Archiving artifacts
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time

2013-08-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733406#comment-13733406
 ] 

ASF subversion and git services commented on SOLR-5113:
---

Commit 1511715 from [~noble.paul] in branch 'dev/trunk'
[ https://svn.apache.org/r1511715 ]

SOLR-5113 setting svn:eol-style native

 CollectionsAPIDistributedZkTest fails all the time
 --

 Key: SOLR-5113
 URL: https://issues.apache.org/jira/browse/SOLR-5113
 Project: Solr
  Issue Type: Bug
  Components: Tests
Affects Versions: 4.5, 5.0
Reporter: Uwe Schindler
Assignee: Noble Paul
Priority: Blocker
 Fix For: 4.5, 5.0

 Attachments: SOLR-5113.patch, SOLR-5113.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5113) CollectionsAPIDistributedZkTest fails all the time

2013-08-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733408#comment-13733408
 ] 

ASF subversion and git services commented on SOLR-5113:
---

Commit 1511717 from [~noble.paul] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1511717 ]

SOLR-5113 setting svn:eol-style native

 CollectionsAPIDistributedZkTest fails all the time
 --

 Key: SOLR-5113
 URL: https://issues.apache.org/jira/browse/SOLR-5113
 Project: Solr
  Issue Type: Bug
  Components: Tests
Affects Versions: 4.5, 5.0
Reporter: Uwe Schindler
Assignee: Noble Paul
Priority: Blocker
 Fix For: 4.5, 5.0

 Attachments: SOLR-5113.patch, SOLR-5113.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[JENKINS] Lucene-Solr-4.x-Linux (32bit/jdk1.8.0-ea-b99) - Build # 6841 - Still Failing!

2013-08-08 Thread Policeman Jenkins Server
Build: http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-Linux/6841/
Java: 32bit/jdk1.8.0-ea-b99 -server -XX:+UseG1GC

All tests passed

Build Log:
[...truncated 31168 lines...]
BUILD FAILED
/mnt/ssd/jenkins/workspace/Lucene-Solr-4.x-Linux/build.xml:395: The following 
error occurred while executing this line:
/mnt/ssd/jenkins/workspace/Lucene-Solr-4.x-Linux/build.xml:334: The following 
error occurred while executing this line:
/mnt/ssd/jenkins/workspace/Lucene-Solr-4.x-Linux/extra-targets.xml:66: The 
following error occurred while executing this line:
/mnt/ssd/jenkins/workspace/Lucene-Solr-4.x-Linux/extra-targets.xml:139: The 
following files are missing svn:eol-style (or binary svn:mime-type):
* ./solr/core/src/test/org/apache/solr/cloud/CustomCollectionTest.java

Total time: 45 minutes 31 seconds
Build step 'Invoke Ant' marked build as failure
Description set: Java: 32bit/jdk1.8.0-ea-b99 -server -XX:+UseG1GC
Archiving artifacts
Recording test results
Email was triggered for: Failure
Sending email for trigger: Failure



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Problem using Benchmark

2013-08-08 Thread Andi Vajda
 Abishek,

On Aug 8, 2013, at 12:08, Abhishek Gupta abhi.bansa...@gmail.com wrote:

 You can see the complete error I am getting here.

Like I told you on pylucene-dev, you need to setup your classpath correctly so 
that these classes are found. If you are a Java newbie (as you said) and don't 
know what that means or how to achieve it you need to research the issue 
yourself first.

This mailing list is not the right forum for this question. Try a general Java 
programming forum first.

Andi..

 
 
 On Thu, Aug 8, 2013 at 3:10 PM, Abhishek Gupta abhi.bansa...@gmail.com 
 wrote:
 Anyone pls help!!
 
 
 
 On Wed, Aug 7, 2013 at 12:36 PM, Abhishek Gupta abhi.bansa...@gmail.com 
 wrote:
 Hi,
 I am using PyLucene and there I tried to use lucene's Benchmark to evaluate 
 Trec Data. I was having a doubt which I first asked on pylucene-dev mailing 
 list. After solving the first problem I got another problem which is 
 referred a java error by Andi. You can see the thread here 
 (http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/201308.mbox/%3CCAJBtL5GG-LghfKBCKFhi%2BPXVmEFMdnM1zC%3D9NtDd-kL-Pv1nuQ%40mail.gmail.com%3E)
 
 I am getting the class not found exception for 
 Compressor(http://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/compressors/package-summary.html).
  I an a newbie to java development, so I don't know about Ant much. PLease 
 help in solving this issue.
  
 
 Thanking You
 Abhishek Gupta,
 9624799165
 
 
 
 -- 
 Abhishek Gupta,
 897876422, 9416106204, 9624799165
 
 
 
 -- 
 Abhishek Gupta,
 897876422, 9416106204, 9624799165


[jira] [Commented] (LUCENE-5152) Lucene FST is not immutable

2013-08-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733454#comment-13733454
 ] 

Michael McCandless commented on LUCENE-5152:


bq. can you elaborate what you are concerned about?

I'm worried about the O(N^2) cost of the assert: for every arc (single
byte of each term in a seekExact) we are iterating over all root arcs
(up to 256 arcs) in this assert.

bq. findTargetArc is the only place where we actually use this cache?

Ahh that's true, I hadn't realized that.

Maybe, instead, we can move the assert just inside the if that
actually uses the cached arcs?  Ie, put it here:

{code}
  if (follow.target == startNode  labelToMatch  cachedRootArcs.length) {
assert assertRootArcs();
...
  }
{code}

This would address my concern: the cost becomes O(N) not O(N^2).  And
the coverage is the same?


 Lucene FST is not immutable
 ---

 Key: LUCENE-5152
 URL: https://issues.apache.org/jira/browse/LUCENE-5152
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/FSTs
Affects Versions: 4.4
Reporter: Simon Willnauer
Priority: Blocker
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5152.patch, LUCENE-5152.patch, LUCENE-5152.patch


 a spinnoff from LUCENE-5120 where the analyzing suggester modified a returned 
 output from and FST (BytesRef) which caused sideffects in later execution. 
 I added an assertion into the FST that checks if a cached root arc is 
 modified and in-fact this happens for instance in our MemoryPostingsFormat 
 and I bet we find more places. We need to think about how to make this less 
 trappy since it can cause bugs that are super hard to find.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [PROPOSAL] Make Luke a Lucene/Solr Module

2013-08-08 Thread Ajay Bhat
Hello,

Thanks so much for accepting the project proposal. I've started the coding
work. I'll keep you all posted of my work.


On Fri, Jul 26, 2013 at 1:48 PM, Ajay Bhat a.ajay.b...@gmail.com wrote:

 Hi,

 I have a question regarding one of the interfaces in the orig version.

 The IOReporter.java [1] is used by the Hadoop Plugin [2] and it only has 2
 functions which are implemented by Hadoop  Plugin. Is this interface really
 needed? Can't I just use the functions as is in Hadoop class without
 needing to get the IOReporter?

 [1]
 http://luke.googlecode.com/svn/trunk/src/org/getopt/luke/plugins/IOReporter.java

 [2]
 http://luke.googlecode.com/svn/trunk/src/org/getopt/luke/plugins/HadoopPlugin.java


 On Sat, Jul 20, 2013 at 11:12 PM, SUJIT PAL sujit@comcast.net wrote:

 Hi Ajay,

 Thanks for the reply and the links to the email threads. I saw a response
 on this thread from Shawn Helsey about this as well. I didn't realize your
 focus was Luke, then Lucene, then Solr - the proposal title and the JIRA
 both mention Lucene/Solr module, which probably misled me - I guess I
 should have read the doc more carefully... Thank you for the clarification
 and good luck with your project.

 -sujit

 On Jul 20, 2013, at 9:09 AM, Ajay Bhat wrote:

  Hi Sujit,
 
  Thanks for your comments. There was actually some discussion earlier
 about whether or not Solr was the highest priority.
 
 
 http://mail-archives.apache.org/mod_mbox/lucene-dev/201307.mbox/%3C0F7176D08A99494EBF1E129298E12904%40JackKrupansky%3E
 
 http://mail-archives.apache.org/mod_mbox/lucene-dev/201307.mbox/%3CCAOdYfZVQ1WzWhYVeKgwpA%3DmQVONxo4XiLza28geV2L1PCpcQJg%40mail.gmail.com%3E
 
  Right now I don't think I could do the integration with Solr since (a)
 I don't know enough Javascript to work with Solr and (b) The time for
 submitting proposals for the program is over.
 
  The project duration is scheduled till end of October. After that or if
 I get time during project period I'll try and work with other
 functionalities of Luke and then try for Solr. I think its best to make
 Luke completely functioning before integrating with the trunk, and this is
 better in incremental steps.
 
 
  On Fri, Jul 19, 2013 at 9:59 PM, SUJIT PAL sujit@comcast.net
 wrote:
  Hi Ajay,
 
  Since you asked for feedback from the community... a lot of what Luke
 used to do is now already available in Solr's admin tool. From Luke's
 feature set that you had in your proposal Google doc. the only ones I think
 are /not/ preset are the following:
 
  * Browse by document number
  * selectively delete documents from the index - there is no delete
 document page AFAIK, but you can still do this from the URL.
  * reconstruct the original document fields, edit them and re-insert to
 the index - you can do this using code as long as the fields are stored,
 but there is no reconstruct page.
  * optimize indexes - can be done from the URL but probably no
 page/button for this.
 
  As a Solr user, for me your tool would be most useful if it
 concentrated on these areas, and if it could be integrated into the
 existing admin tool (the Solr 4 one of course). I am not sure what the
 Solr4 admin tool uses, if its Pivot then I guess thats what you should use
 (and by extension, if not, you probably should use what the current tool
 uses so its easy to maintain going forward). Benefit to users such as
 myself would be unified look-and-feel so not much of a learning
 curve/barrier to adoption.
 
  Just my $0.02...
 
  -sujit
 
  On Jul 19, 2013, at 8:06 AM, Ajay Bhat wrote:
 
   Hi Mark,
  
   I've added the proposal to the ASF-ICFOSS proposals page [1].
 According to the ICFOSS programme [2] the last date for submission of
 project proposal is July 19th (today)
   The time period for mentors to review and rank students project
 proposals is July 22nd to August 2nd, i.e next week onwards.
  
   I'd like some feedback on my proposal from the community as well.
   Link to proposal on Google Docs :
 https://docs.google.com/document/d/18Vu5YB6C7WLDxnG01BnZXFEKUC3EQYb0Y5_tCJFb_sc
   Link to proposal on CWiki page :
 https://cwiki.apache.org/confluence/display/COMDEV/Proposal+for+Apache+Lucene+-+Ajay+Bhat
  
   [1]
 https://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme
  
   [2] http://community.apache.org/mentoringprogramme-icfoss-pilot.html
  
  
   On Thu, Jul 18, 2013 at 12:04 AM, Ajay Bhat a.ajay.b...@gmail.com
 wrote:
   Thanks Mark. I've given you comment access as well so you can comment
 on specific parts of the proposal
  
  
   On Wed, Jul 17, 2013 at 11:51 PM, Mark Miller markrmil...@gmail.com
 wrote:
   You can put my down for the mentor.
  
   - Mark
  
   On Jul 17, 2013, at 2:04 PM, Ajay Bhat a.ajay.b...@gmail.com wrote:
  
   Hi all,
  
   I want to do the Jira issue LUCENE 2562 : Make Luke a Lucene/Solr
 module [1] as a project. This project will be for the ASF-ICFOSS programme
 [2] by Luciano Resende [3] and the proposal has to be 

[jira] [Commented] (LUCENE-5152) Lucene FST is not immutable

2013-08-08 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733474#comment-13733474
 ] 

Simon Willnauer commented on LUCENE-5152:
-

bq. This would address my concern: the cost becomes O(N) not O(N^2). And the 
coverage is the same?

The problem here is that we really need to check after we returned from the 
cache and that might be the case only once in a certain test. Yet, I think it's 
OK to do it there. I still don't get what you are concerned of we only have -ea 
in tests and the tests don't seem to be any slower? Can you elaborate what you 
are afraid of?

 Lucene FST is not immutable
 ---

 Key: LUCENE-5152
 URL: https://issues.apache.org/jira/browse/LUCENE-5152
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/FSTs
Affects Versions: 4.4
Reporter: Simon Willnauer
Priority: Blocker
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5152.patch, LUCENE-5152.patch, LUCENE-5152.patch


 a spinnoff from LUCENE-5120 where the analyzing suggester modified a returned 
 output from and FST (BytesRef) which caused sideffects in later execution. 
 I added an assertion into the FST that checks if a cached root arc is 
 modified and in-fact this happens for instance in our MemoryPostingsFormat 
 and I bet we find more places. We need to think about how to make this less 
 trappy since it can cause bugs that are super hard to find.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5124) Solr glues word´s when parsing PDFs under certan circumstances

2013-08-08 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733547#comment-13733547
 ] 

Jack Krupansky commented on SOLR-5124:
--

Try doing the update with the extractOnly=true parameter and look at the actual 
byte codes where the two adjacent terms meet - it may be some odd Unicode value 
that Solr filters ignore rather than treat as white space.

 Solr glues word´s when parsing PDFs under certan circumstances
 --

 Key: SOLR-5124
 URL: https://issues.apache.org/jira/browse/SOLR-5124
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.4
 Environment: Windows 7 (don´t think, this is relevant)
Reporter: Christoph Straßer
Priority: Minor
  Labels: tika,text-extraction
 Attachments: 01_alz_2009_folge11_2009_05_28.pdf, 02_PDF.png, 
 03_TikaOutput_GUI_MainContent.png, 03_TikaOutput_GUI_PlainText.png, 
 03_TikaOutput_GUI_StructuredText.png, 03_TikaOutput.png, 04_Solr.png


 For some kind of PDF-documents Solr glues words at linebreaks under some 
 circumstances. (eg the last word of line 1 and the first word of line 2 are 
 merged into one word)
 (Stand-alone-)Tika extracts the text correct. Attached you find one 
 sample-PDF and screenshots of tika-output and the corrupted content indexed 
 by solr.
 (This issue does not occur with all PDF-documents. Tried to recreate the 
 issue with new word-documents, I converted into PDF on multiple ways without 
 success.) The attached PDF-document has a real weird internal structure. But 
 Tika seems to do it´s work right. Even with this weird document.
 In our Solr-indices we have a good amount of this weird documents. This 
 results in worse suggestions by the Suggester.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5120) Solrj Query response error with result number

2013-08-08 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733589#comment-13733589
 ] 

Shawn Heisey commented on SOLR-5120:


[~lukasw44] I have a question to ask you:

What resources did you look at in order to decide that you should file a bug to 
get an answer to your question?  The reason that I ask is because we have been 
seeing an increase recently in the number of people who file a bug for support 
issues instead of asking for help via our discussion resources like the mailing 
list.  This suggests that there might be some incorrect support information out 
there that needs correction.

Related to your issue: If setting the start parameter to 0 or omitting the 
parameter didn't fix your issue, then this issue can be reopened, but I'm 
confident that this is the problem.


 Solrj Query response error with result number 
 --

 Key: SOLR-5120
 URL: https://issues.apache.org/jira/browse/SOLR-5120
 Project: Solr
  Issue Type: Bug
 Environment: linux, lubuntu, java version 1.7.0_13.
Reporter: Łukasz Woźniczka
Priority: Critical

 This is my simple code : 
  QueryResponse qr;
 try {
 qr = fs.execute(solrServer);
 System.out.println(QUERY RESPONSE :  + qr);
 for (EntryString, Object r : qr.getResponse()) {
 System.out.println(RESPONSE:  + r.getKey() +  -  + 
 r.getValue());
 }
 SolrDocumentList dl = qr.getResults();
 System.out.println(--RESULT SIZE:[  + dl.size() );
 } catch (SolrServerException e) {
 e.printStackTrace();
 }
 I am using solrj and solr-core version 4.4.0. And there is a bug probably in 
 solrj in query result. I am creating one simple txt doc with content 'anna' 
 and then i am restar solr and try to search this phrase. Nothing is found but 
 this is my query response system out {numFound=1,start=1,docs=[]}.
 So as you can see ther is info that numFound=1 but docs=[] -- is empty. Next 
 i add another document with only one word 'anna' and then try search that 
 string and this is sysout; 
 {numFound=2,start=1,docs=[SolrDocument{file_id=9882, 
 file_name=luk-search2.txt, file_create_user=-1, file_department=10, 
 file_mime_type=text/plain, file_extension=.txt, file_parents_folder=[5021, 
 4781, 341, -20, -1], _version_=1442647024934584320}]}
 So as you can see ther is numFound = 2 but only one document is listed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5152) Lucene FST is not immutable

2013-08-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733598#comment-13733598
 ] 

Michael McCandless commented on LUCENE-5152:


bq. Can you elaborate what you are afraid of?

In general I think it's bad if an assert changes too much how the code
would run without asserts.  E.g., maybe this O(N^2) assert alters how
threads are scheduled and changes how / whether an issue appears in
practice.

Similarly, if a user is having trouble, I'll recommend turning on
asserts to see if one trips, but if this causes a change in how the
code runs then this can change whether the issue reproduces.

I also just don't like O(N^2) code, even when it's under an assert :)

I think asserts should minimize their impact to the real code when
possible, and it certainly seems possible in this case.

Separately, we really should run our tests w/o asserts, too, since
this is how our users typically run (I know some tests fail if
assertions are off ... we'd have to fix them).  What if we accidentally
commit real code behind an assert?  Our tests wouldn't catch it ...


 Lucene FST is not immutable
 ---

 Key: LUCENE-5152
 URL: https://issues.apache.org/jira/browse/LUCENE-5152
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/FSTs
Affects Versions: 4.4
Reporter: Simon Willnauer
Priority: Blocker
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5152.patch, LUCENE-5152.patch, LUCENE-5152.patch


 a spinnoff from LUCENE-5120 where the analyzing suggester modified a returned 
 output from and FST (BytesRef) which caused sideffects in later execution. 
 I added an assertion into the FST that checks if a cached root arc is 
 modified and in-fact this happens for instance in our MemoryPostingsFormat 
 and I bet we find more places. We need to think about how to make this less 
 trappy since it can cause bugs that are super hard to find.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3076) Solr(Cloud) should support block joins

2013-08-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733601#comment-13733601
 ] 

Yonik Seeley commented on SOLR-3076:


Making progeess... currently working on randomized testing (using our current 
join implementation to cross-check this implementation).  I've hit some snags 
and am working through them...

bq. one of inconveniences is the necessity to provide user cache for BJQParser 

Yeah, I had some things in mind to handle that as well.

 Solr(Cloud) should support block joins
 --

 Key: SOLR-3076
 URL: https://issues.apache.org/jira/browse/SOLR-3076
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Yonik Seeley
 Fix For: 4.5, 5.0

 Attachments: 27M-singlesegment-histogram.png, 27M-singlesegment.png, 
 bjq-vs-filters-backward-disi.patch, bjq-vs-filters-illegal-state.patch, 
 child-bjqparser.patch, dih-3076.patch, dih-config.xml, 
 parent-bjq-qparser.patch, parent-bjq-qparser.patch, Screen Shot 2012-07-17 at 
 1.12.11 AM.png, SOLR-3076-childDocs.patch, SOLR-3076.patch, SOLR-3076.patch, 
 SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, 
 SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, SOLR-3076.patch, 
 SOLR-3076.patch, SOLR-3076.patch, 
 SOLR-7036-childDocs-solr-fork-trunk-patched, 
 solrconf-bjq-erschema-snippet.xml, solrconfig.xml.patch, 
 tochild-bjq-filtered-search-fix.patch


 Lucene has the ability to do block joins, we should add it to Solr.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5120) Solrj Query response error with result number

2013-08-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733606#comment-13733606
 ] 

Łukasz Woźniczka commented on SOLR-5120:


Shawn Haisey its my fault sorry. I am setting start parameter = 1 

 Solrj Query response error with result number 
 --

 Key: SOLR-5120
 URL: https://issues.apache.org/jira/browse/SOLR-5120
 Project: Solr
  Issue Type: Bug
 Environment: linux, lubuntu, java version 1.7.0_13.
Reporter: Łukasz Woźniczka
Priority: Critical

 This is my simple code : 
  QueryResponse qr;
 try {
 qr = fs.execute(solrServer);
 System.out.println(QUERY RESPONSE :  + qr);
 for (EntryString, Object r : qr.getResponse()) {
 System.out.println(RESPONSE:  + r.getKey() +  -  + 
 r.getValue());
 }
 SolrDocumentList dl = qr.getResults();
 System.out.println(--RESULT SIZE:[  + dl.size() );
 } catch (SolrServerException e) {
 e.printStackTrace();
 }
 I am using solrj and solr-core version 4.4.0. And there is a bug probably in 
 solrj in query result. I am creating one simple txt doc with content 'anna' 
 and then i am restar solr and try to search this phrase. Nothing is found but 
 this is my query response system out {numFound=1,start=1,docs=[]}.
 So as you can see ther is info that numFound=1 but docs=[] -- is empty. Next 
 i add another document with only one word 'anna' and then try search that 
 string and this is sysout; 
 {numFound=2,start=1,docs=[SolrDocument{file_id=9882, 
 file_name=luk-search2.txt, file_create_user=-1, file_department=10, 
 file_mime_type=text/plain, file_extension=.txt, file_parents_folder=[5021, 
 4781, 341, -20, -1], _version_=1442647024934584320}]}
 So as you can see ther is numFound = 2 but only one document is listed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733656#comment-13733656
 ] 

Hoss Man commented on SOLR-4679:


Uwe: I defer to your judgement on this.  if you think the patch is hte right 
way to go, then +1 from me.

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-2548) Multithreaded faceting

2013-08-08 Thread Gun Akkor (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733690#comment-13733690
 ] 

Gun Akkor commented on SOLR-2548:
-

I would like to revive this ticket, if possible. We have an index with about 10 
fields that we regularly facet on. These fields are either multi-valued or are 
of type TextField, so facet code chooses FC as the facet method, and uses the 
UnInvertedField instances to count each facet field, which takes several 
seconds per field in our case. So, multi-thread execution of getTermCounts() 
reduces the overall facet time considerably.

I started with the patch that was posted against 3.1 and modified it a little 
bit to take into account previous comments made by Yonik and Adrien. The new 
patch applies against 4.2.1, uses the already existing facetExecutor thread 
pool, and is configured per request via a facet.threads request param. If the 
param is not supplied, the code defaults to directExecutor and runs sequential 
as before. So, code should behave as is if user chooses not to submit number of 
threads to use.

Also in the process of testing, I noticed that 
UnInvertedField.getUnInvertedField() call was synchronized too early, before 
the call to new UnInvertedField(field, searcher) if the field is not in the 
field value cache. Because its init can take several seconds, synchronizing on 
the cache in that duration was effectively serializing the execution of the 
multiple threads.
So, I modified it (albeit inelegantly) to synchronize later (in our case cache 
hit ratio is low, so this makes a difference).

The patch is still incomplete, as it does not extend this framework to possibly 
other calls like ranges and dates, but it is a start.

 Multithreaded faceting
 --

 Key: SOLR-2548
 URL: https://issues.apache.org/jira/browse/SOLR-2548
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 3.1
Reporter: Janne Majaranta
Priority: Minor
  Labels: facet
 Attachments: SOLR-2548_for_31x.patch, SOLR-2548.patch


 Add multithreading support for faceting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-2548) Multithreaded faceting

2013-08-08 Thread Gun Akkor (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gun Akkor updated SOLR-2548:


Attachment: SOLR-2548_4.2.1.patch

Patch against 4.2.1

 Multithreaded faceting
 --

 Key: SOLR-2548
 URL: https://issues.apache.org/jira/browse/SOLR-2548
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 3.1
Reporter: Janne Majaranta
Priority: Minor
  Labels: facet
 Attachments: SOLR-2548_4.2.1.patch, SOLR-2548_for_31x.patch, 
 SOLR-2548.patch


 Add multithreading support for faceting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5157) Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure.

2013-08-08 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733701#comment-13733701
 ] 

Adrien Grand commented on LUCENE-5157:
--

I discussed about this issue with Robert to see how we can move forward:
 - moving OrdinalMap to MultiTermsEnum can be controversial as Robert explained 
so let's only tackle the naming and getSegmentOrd API issues here,
 - another option to make getSegmentOrd less trappy is to add an assertion that 
the provided segment number is the same as the one returned by 
{{getSegmentNumber}}, this would allow for returning the segment ordinals on 
any segment in the future without changing the API,
 - renaming subIndex to segment is ok as it makes the naming more consistent.

Robert, please correct me if you think it doesn't reflect correctly what we 
said.
Boaz, what do you think?

 Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure.
 

 Key: LUCENE-5157
 URL: https://issues.apache.org/jira/browse/LUCENE-5157
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Boaz Leskes
Priority: Minor
 Attachments: LUCENE-5157.patch


 I refactored MultiDocValues.OrdinalMap, removing one unused parameter and 
 renaming some methods to more clearly communicate what they do. Also I 
 renamed subIndex references to segmentIndex.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4414) MoreLikeThis on a shard finds no interesting terms if the document queried is not in that shard

2013-08-08 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733703#comment-13733703
 ] 

Shawn Heisey commented on SOLR-4414:


[~shalinmangar] I came across this issue while looking into my problems with 
distributed MoreLikeThis.  Things look a little off, so I'm writing this.

At a quick glance, the commit comment doesn't seem to be related to this issue, 
because it doesn't mention MLT at all.  Also, you have never commented on this 
issue outside the commit comment.  This is the issue number in CHANGES.txt, 
though.  Is the commit for this issue or another one?

If the commit is for this issue, I think this probably needs to be closed, 
fixed in 4.2 and 5.0.  If not, CHANGES.txt probably needs some cleanup.


 MoreLikeThis on a shard finds no interesting terms if the document queried is 
 not in that shard
 ---

 Key: SOLR-4414
 URL: https://issues.apache.org/jira/browse/SOLR-4414
 Project: Solr
  Issue Type: Bug
  Components: MoreLikeThis, SolrCloud
Affects Versions: 4.1
Reporter: Colin Bartolome

 Running a MoreLikeThis query in a cloud works only when the document being 
 queried exists in whatever shard serves the request. If the document is not 
 present in the shard, no interesting terms are found and, consequently, no 
 matches are found.
 h5. Steps to reproduce
 * Edit example/solr/collection1/conf/solrconfig.xml and add this line, with 
 the rest of the request handlers:
 {code:xml}
 requestHandler name=/mlt class=solr.MoreLikeThisHandler /
 {code}
 * Follow the [simplest SolrCloud 
 example|http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster]
  to get two shards running.
 * Hit this URL: 
 [http://localhost:8983/solr/collection1/mlt?mlt.fl=includesq=id:3007WFPmlt.match.include=falsemlt.interestingTerms=listmlt.mindf=1mlt.mintf=1]
 * Compare that output to that of this URL: 
 [http://localhost:7574/solr/collection1/mlt?mlt.fl=includesq=id:3007WFPmlt.match.include=falsemlt.interestingTerms=listmlt.mindf=1mlt.mintf=1]
 The former URL will return a result and list some interesting terms. The 
 latter URL will return no results and list no interesting terms. It will also 
 show this odd XML element:
 {code:xml}
 null name=response/
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5157) Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure.

2013-08-08 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-5157:
-

Assignee: Adrien Grand

 Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure.
 

 Key: LUCENE-5157
 URL: https://issues.apache.org/jira/browse/LUCENE-5157
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Boaz Leskes
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5157.patch


 I refactored MultiDocValues.OrdinalMap, removing one unused parameter and 
 renaming some methods to more clearly communicate what they do. Also I 
 renamed subIndex references to segmentIndex.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5150) WAH8DocIdSet: dense sets compression

2013-08-08 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733704#comment-13733704
 ] 

Adrien Grand commented on LUCENE-5150:
--

I'll commit soon if there is no objection. These dense sets can be common in 
cases where e.g. users are allowed to see everything but something.

 WAH8DocIdSet: dense sets compression
 

 Key: LUCENE-5150
 URL: https://issues.apache.org/jira/browse/LUCENE-5150
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Attachments: LUCENE-5150.patch


 In LUCENE-5101, Paul Elschot mentioned that it would be interesting to be 
 able to encode the inverse set to also compress very dense sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5159) compressed diskdv sorted/sortedset termdictionaries

2013-08-08 Thread Robert Muir (JIRA)
Robert Muir created LUCENE-5159:
---

 Summary: compressed diskdv sorted/sortedset termdictionaries
 Key: LUCENE-5159
 URL: https://issues.apache.org/jira/browse/LUCENE-5159
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Robert Muir


Sorted/SortedSet give you ordinal(s) per document, but them separately have a 
term dictionary of all the values.

You can do a few operations on these:
* ord - term lookup (e.g. retrieving facet labels)
* term - ord lookup (reverse lookup: e.g. fieldcacherangefilter)
* get a term enumerator (e.g. merging, ordinalmap construction)

The current implementation for diskdv was the simplest thing that can possibly 
work: under the hood it just makes a binary DV for these (treating ordinals as 
document ids). When the terms are fixed length, you can address a term directly 
with multiplication. When they are variable length though, we have to store a 
packed ints structure in RAM.

This variable length case is overkill and chews up a lot of RAM if you have 
many unique values. It also chews up a lot of disk since all the values are 
just concatenated (no sharing).



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4414) MoreLikeThis on a shard finds no interesting terms if the document queried is not in that shard

2013-08-08 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733713#comment-13733713
 ] 

Mark Miller commented on SOLR-4414:
---

I think it was simply mis-tagged.

 MoreLikeThis on a shard finds no interesting terms if the document queried is 
 not in that shard
 ---

 Key: SOLR-4414
 URL: https://issues.apache.org/jira/browse/SOLR-4414
 Project: Solr
  Issue Type: Bug
  Components: MoreLikeThis, SolrCloud
Affects Versions: 4.1
Reporter: Colin Bartolome

 Running a MoreLikeThis query in a cloud works only when the document being 
 queried exists in whatever shard serves the request. If the document is not 
 present in the shard, no interesting terms are found and, consequently, no 
 matches are found.
 h5. Steps to reproduce
 * Edit example/solr/collection1/conf/solrconfig.xml and add this line, with 
 the rest of the request handlers:
 {code:xml}
 requestHandler name=/mlt class=solr.MoreLikeThisHandler /
 {code}
 * Follow the [simplest SolrCloud 
 example|http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster]
  to get two shards running.
 * Hit this URL: 
 [http://localhost:8983/solr/collection1/mlt?mlt.fl=includesq=id:3007WFPmlt.match.include=falsemlt.interestingTerms=listmlt.mindf=1mlt.mintf=1]
 * Compare that output to that of this URL: 
 [http://localhost:7574/solr/collection1/mlt?mlt.fl=includesq=id:3007WFPmlt.match.include=falsemlt.interestingTerms=listmlt.mindf=1mlt.mintf=1]
 The former URL will return a result and list some interesting terms. The 
 latter URL will return no results and list no interesting terms. It will also 
 show this odd XML element:
 {code:xml}
 null name=response/
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5159) compressed diskdv sorted/sortedset termdictionaries

2013-08-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-5159:


Attachment: LUCENE-5159.patch

Here's an in-progress patch... all the core/codec tests pass, but I'm sure 
there are a few bugs to knock out (improving the tests is the way to go here).

I'm also unhappy with the complexity.

The idea is for the variable case, we just prefix-share (i set interval=16), 
like lucene 3.x dictionary. The current patch specializes the termsenum and 
reverselookup for this case (but again, im sure there are bugs, its hairy)

 compressed diskdv sorted/sortedset termdictionaries
 ---

 Key: LUCENE-5159
 URL: https://issues.apache.org/jira/browse/LUCENE-5159
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Robert Muir
 Attachments: LUCENE-5159.patch


 Sorted/SortedSet give you ordinal(s) per document, but them separately have a 
 term dictionary of all the values.
 You can do a few operations on these:
 * ord - term lookup (e.g. retrieving facet labels)
 * term - ord lookup (reverse lookup: e.g. fieldcacherangefilter)
 * get a term enumerator (e.g. merging, ordinalmap construction)
 The current implementation for diskdv was the simplest thing that can 
 possibly work: under the hood it just makes a binary DV for these (treating 
 ordinals as document ids). When the terms are fixed length, you can address a 
 term directly with multiplication. When they are variable length though, we 
 have to store a packed ints structure in RAM.
 This variable length case is overkill and chews up a lot of RAM if you have 
 many unique values. It also chews up a lot of disk since all the values are 
 just concatenated (no sharing).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4414) MoreLikeThis on a shard finds no interesting terms if the document queried is not in that shard

2013-08-08 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733717#comment-13733717
 ] 

Shalin Shekhar Mangar commented on SOLR-4414:
-

[~elyograg] - That was a mistake. The commit mentioned here actually belonged 
to SOLR-4415. I fixed the issue number in the change log but I forgot to put a 
comment here.

 MoreLikeThis on a shard finds no interesting terms if the document queried is 
 not in that shard
 ---

 Key: SOLR-4414
 URL: https://issues.apache.org/jira/browse/SOLR-4414
 Project: Solr
  Issue Type: Bug
  Components: MoreLikeThis, SolrCloud
Affects Versions: 4.1
Reporter: Colin Bartolome

 Running a MoreLikeThis query in a cloud works only when the document being 
 queried exists in whatever shard serves the request. If the document is not 
 present in the shard, no interesting terms are found and, consequently, no 
 matches are found.
 h5. Steps to reproduce
 * Edit example/solr/collection1/conf/solrconfig.xml and add this line, with 
 the rest of the request handlers:
 {code:xml}
 requestHandler name=/mlt class=solr.MoreLikeThisHandler /
 {code}
 * Follow the [simplest SolrCloud 
 example|http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster]
  to get two shards running.
 * Hit this URL: 
 [http://localhost:8983/solr/collection1/mlt?mlt.fl=includesq=id:3007WFPmlt.match.include=falsemlt.interestingTerms=listmlt.mindf=1mlt.mintf=1]
 * Compare that output to that of this URL: 
 [http://localhost:7574/solr/collection1/mlt?mlt.fl=includesq=id:3007WFPmlt.match.include=falsemlt.interestingTerms=listmlt.mindf=1mlt.mintf=1]
 The former URL will return a result and list some interesting terms. The 
 latter URL will return no results and list no interesting terms. It will also 
 show this odd XML element:
 {code:xml}
 null name=response/
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5157) Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure.

2013-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733745#comment-13733745
 ] 

Robert Muir commented on LUCENE-5157:
-

+1, lets improve it for now and not expand it to try to be a general 
termsenum merger. but on the other hand, i am still not convinced we can't 
improve the efficiency of this thing, so its good if we can prevent innards 
from being too exposed (unless its causing some use case an actual problem)

 Refactoring MultiDocValues.OrdinalMap to clarify API and internal structure.
 

 Key: LUCENE-5157
 URL: https://issues.apache.org/jira/browse/LUCENE-5157
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Boaz Leskes
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-5157.patch


 I refactored MultiDocValues.OrdinalMap, removing one unused parameter and 
 renaming some methods to more clearly communicate what they do. Also I 
 renamed subIndex references to segmentIndex.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5159) compressed diskdv sorted/sortedset termdictionaries

2013-08-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-5159:


Attachment: LUCENE-5159.patch

fixes a OB1 bug. ill beef up the DV base test case to really exercise this 
termsenum...

 compressed diskdv sorted/sortedset termdictionaries
 ---

 Key: LUCENE-5159
 URL: https://issues.apache.org/jira/browse/LUCENE-5159
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Robert Muir
 Attachments: LUCENE-5159.patch, LUCENE-5159.patch


 Sorted/SortedSet give you ordinal(s) per document, but them separately have a 
 term dictionary of all the values.
 You can do a few operations on these:
 * ord - term lookup (e.g. retrieving facet labels)
 * term - ord lookup (reverse lookup: e.g. fieldcacherangefilter)
 * get a term enumerator (e.g. merging, ordinalmap construction)
 The current implementation for diskdv was the simplest thing that can 
 possibly work: under the hood it just makes a binary DV for these (treating 
 ordinals as document ids). When the terms are fixed length, you can address a 
 term directly with multiplication. When they are variable length though, we 
 have to store a packed ints structure in RAM.
 This variable length case is overkill and chews up a lot of RAM if you have 
 many unique values. It also chews up a lot of disk since all the values are 
 just concatenated (no sharing).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733776#comment-13733776
 ] 

Uwe Schindler commented on SOLR-4679:
-

Hoss: I just took this issue because it was unassigned and I was the one 
mandating to add ignorable whitespace at that time in TIKA. So Jukka and I 
decided this would be the best.

Because you are still not convinced with my argumentation, let me recapitulate 
TIKA's problems:

- TIKA decided to use XHTML as its output format to report the parsed documents 
to the consumer. This is nice, because it allows to preserve some of the 
formatting (like bold fonts, paragraphs,...) originating from the original 
document. Of course most of this formatting is lost, but you can still detect 
things like emphasized text. By choosing XHTML as output format, of course TIKA 
must use XHTML formatting for new lines and similar. So whenever a line break 
is needed, the TIKA pasrer emits a br/ tag or places the paragraph (in a 
PDF) inside a p/ element. As we all know, HTML ignores formatting like 
newlines, tabs,... (all are treated as one single whitespace, so means like 
this regreplace: {{s/\s+/ /}}
- On the other hand, TIKA wants to make it simple for people to extract the 
*plain text* contents. With the XHTML-only approach this would be hard for the 
consumer. Because to add the correct newlines, the consumer has to fully 
understand XHTML and detect block elements and replace them by \n

To support both usages of TIKA the idea was to embed this information which is 
unimportant to HTML (as HTML ignores whitespaces completely) as 
ignorableWhitespace as convenience for the user. A fully compliant XHTML 
consumer would not parse the ignoreable stuff. As it understands HTML it would 
detect a p element as a block element and format the output.

Solr unfortunately has some strange approach: It is mainly interested in the 
text only contents, so ideally when consuming the HTLL it could use 
{{WriteoutContentHandler(StringBuilder, 
BodyContentHandler(parserConmtentHandler)}}. In that case TIKA would do the 
right thing automatically: It would extract only text from the body element and 
would use the convenience whitespace to format the text in ASCII-ART-like way 
(using tabs, newlines,...) :-)
Solr has a hybrid approach: It collects all into a content tag (which is 
similar to the above approcha), but the bug is that in contrast to TIKA's 
official WriteOutContentHandler it does not use the ignorable whitespace 
inserted for convenience. In addition TIKA also has a stack where it allows to 
process parts of the documents (like the title element or all em elements). 
In that case it has several StringBuilders in parallel that are populated with 
the contents. The problems are here too, but cannot be solved by using 
ignorable whitespace: e.g. one indexes only all em elements (which are inline 
HTML elements no block elements), there is no whitespace so all em elements 
would be glued together in the em field of your index... I just mention this, 
in my opinion the SolrContentHandler needs more work to correctly understand 
HTML and not just collect element names in a map!

Now to your complaint: You proposed to report the newlines as real 
{{character()}} events - but this is not the right thing to do here. As I said, 
HTML does not know these characters, they are ignored. The formatting is done 
by the element names (like p, div, table). So the helper whitespace for 
text-only consumers should be inserted as ignorableWhitespace only, if we would 
add it to the real character data we would report things that every HTML parser 
(like nekohtml) would never report to the consumer. Nekohtml would also report 
this useless extra whitespace as ignorable.

The convenience here is that TIKA's XHTMLContentHandler used by all parsers is 
configured to help the text-only user, but don't hurt the HTML-only user. 
This differentiation is done by reporting the HTML element names (p, div, 
table, th, td, tr, abbr, em, strong,...) but also report the 
ASCII-ART-text-only content like TABs indide tables, newlines after block 
elements,... This is always done as ignorableWhitespace (for convenience), a 
real HTML parser must ignore it - and its correct to do this.



 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 

[jira] [Commented] (LUCENE-5159) compressed diskdv sorted/sortedset termdictionaries

2013-08-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733787#comment-13733787
 ] 

Michael McCandless commented on LUCENE-5159:


+1, patch looks great.

 compressed diskdv sorted/sortedset termdictionaries
 ---

 Key: LUCENE-5159
 URL: https://issues.apache.org/jira/browse/LUCENE-5159
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Robert Muir
 Attachments: LUCENE-5159.patch, LUCENE-5159.patch


 Sorted/SortedSet give you ordinal(s) per document, but them separately have a 
 term dictionary of all the values.
 You can do a few operations on these:
 * ord - term lookup (e.g. retrieving facet labels)
 * term - ord lookup (reverse lookup: e.g. fieldcacherangefilter)
 * get a term enumerator (e.g. merging, ordinalmap construction)
 The current implementation for diskdv was the simplest thing that can 
 possibly work: under the hood it just makes a binary DV for these (treating 
 ordinals as document ids). When the terms are fixed length, you can address a 
 term directly with multiplication. When they are variable length though, we 
 have to store a packed ints structure in RAM.
 This variable length case is overkill and chews up a lot of RAM if you have 
 many unique values. It also chews up a lot of disk since all the values are 
 just concatenated (no sharing).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733791#comment-13733791
 ] 

Hoss Man commented on SOLR-4679:


bq. Because you are still not convinced with my argumentation, let me 
recapitulate TIKA's problems:

I never said that ... you said I can take the issue if you like. and you 
explained why the existing patch should be committed -- i'm totally willing to 
go along with that, so have at it.  it seems sketchy to me, but if that's the 
way Tika works that's the way tika works, you certainly understand it better 
then me, so i defer to your assesment.

(as mentioned in TIKA-1134 it would be nice if this type of behavior was better 
documented for people implementing their own ContentHandlers, but that's a Tika 
issue not a Solr issue.)

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-5125) Distributed MoreLikeThis fails with NullPointerException, shard query gives EarlyTerminatingCollectorException

2013-08-08 Thread Shawn Heisey (JIRA)
Shawn Heisey created SOLR-5125:
--

 Summary: Distributed MoreLikeThis fails with NullPointerException, 
shard query gives EarlyTerminatingCollectorException
 Key: SOLR-5125
 URL: https://issues.apache.org/jira/browse/SOLR-5125
 Project: Solr
  Issue Type: Bug
  Components: MoreLikeThis
Affects Versions: 4.4
Reporter: Shawn Heisey
 Fix For: 4.5, 5.0


A distributed MoreLikeThis query that works perfectly on 4.2.1 is failing on 
4.4.0.  The original query returns a NullPointerException.  The Solr log shows 
that the shard queries are throwing EarlyTerminatingCollectorException.  Full 
details to follow in the comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5125) Distributed MoreLikeThis fails with NullPointerException, shard query gives EarlyTerminatingCollectorException

2013-08-08 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733816#comment-13733816
 ] 

Shawn Heisey commented on SOLR-5125:


The query that works fine in 4.2.1 has the following URL:

/solr/ncmain/ncdismax?q=tag_id:ugphotos000996mlt=truemlt.fl=catchallmlt.count=100

The ncmain handler has the shards parameter in solrconfig.xml and is set up for 
edismax. The shards.qt parameter is /search, a handler using the default query 
parser.  On 4.2.1, it had a QTime of 49641, a performance issue that I 
mentioned on the mailing list and will be pursuing there.  Here's a server log 
excerpt, showing a shard request, the shard exception, the original query, and 
the final exception.

{noformat}
INFO  - 2013-08-08 12:18:20.030; org.apache.solr.core.SolrCore; [s3live] 
webapp=/solr path=/search 
params={mlt.fl=catchallsort=score+desctie=0.1shards.qt=/searchmlt.dist.id=ugphotos000996mlt=trueq.alt=*:*distrib=falseshards.tolerant=trueversion=2NOW=1375985885078shard.url=bigindy5.REDACTED.com:8982/solr/s3livedf=catchallfl=score,tag_idqs=3qt=/searchlowercaseOperators=falsemm=100%25qf=catchallwt=javabinrows=100defType=edismaxpf=catchall^2mlt.count=100start=0q=%2B(catchall:arabian+catchall:close-up+catchall:horse+catchall:closeup+catchall:close+catchall:white+catchall:up+catchall:sassy+catchall:154+catchall:equestrian+catchall:domestic+catchall:animals+catchall:of)+-tag_id:ugphotos000996shards.info=trueboost=min(recip(abs(ms(NOW/HOUR,pd)),1.92901e-10,1.5,1.5),0.85)isShard=trueps=3}
 6815483 status=500 QTime=14639
ERROR - 2013-08-08 12:18:20.030; org.apache.solr.common.SolrException; 
null:org.apache.solr.search.EarlyTerminatingCollectorException
at 
org.apache.solr.search.EarlyTerminatingCollector.collect(EarlyTerminatingCollector.java:62)
at 
org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:289)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:624)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1494)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1363)
at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:474)
at 
org.apache.solr.search.SolrIndexSearcher.getDocList(SolrIndexSearcher.java:1226)
at 
org.apache.solr.handler.MoreLikeThisHandler$MoreLikeThisHelper.getMoreLikeThis(MoreLikeThisHandler.java:365)
at 
org.apache.solr.handler.component.MoreLikeThisComponent.getMoreLikeThese(MoreLikeThisComponent.java:356)
at 
org.apache.solr.handler.component.MoreLikeThisComponent.process(MoreLikeThisComponent.java:107)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at 
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
 

[jira] [Commented] (SOLR-5125) Distributed MoreLikeThis fails with NullPointerException, shard query gives EarlyTerminatingCollectorException

2013-08-08 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733833#comment-13733833
 ] 

Shawn Heisey commented on SOLR-5125:


Here's someone else having the same problem.  They don't say whether it's a 
single index or distributed, though.

http://stackoverflow.com/questions/17866313/earlyterminatingcollectorexception-in-mlt-component-of-solr-4-4
 


 Distributed MoreLikeThis fails with NullPointerException, shard query gives 
 EarlyTerminatingCollectorException
 --

 Key: SOLR-5125
 URL: https://issues.apache.org/jira/browse/SOLR-5125
 Project: Solr
  Issue Type: Bug
  Components: MoreLikeThis
Affects Versions: 4.4
Reporter: Shawn Heisey
 Fix For: 4.5, 5.0


 A distributed MoreLikeThis query that works perfectly on 4.2.1 is failing on 
 4.4.0.  The original query returns a NullPointerException.  The Solr log 
 shows that the shard queries are throwing EarlyTerminatingCollectorException. 
  Full details to follow in the comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4952) audit test configs to use solrconfig.snippet.randomindexconfig.xml in more tests

2013-08-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733844#comment-13733844
 ] 

ASF subversion and git services commented on SOLR-4952:
---

Commit 1511954 from hoss...@apache.org in branch 'dev/trunk'
[ https://svn.apache.org/r1511954 ]

SOLR-4952: TestIndexSearcher.testReopen needs fixed segment merging

 audit test configs to use solrconfig.snippet.randomindexconfig.xml in more 
 tests
 

 Key: SOLR-4952
 URL: https://issues.apache.org/jira/browse/SOLR-4952
 Project: Solr
  Issue Type: Sub-task
Reporter: Hoss Man
Assignee: Hoss Man

 in SOLR-4942 i updated every solrconfig.xml to either...
 * include solrconfig.snippet.randomindexconfig.xml where it was easy to do so
 * use the useCompoundFile sys prop if it already had an {{indexConfig}} 
 section, or if including the snippet wasn't going to be easy (ie: contrib 
 tests)
 As an improvment on this:
 * audit all core configs not already using 
 solrconfig.snippet.randomindexconfig.xml and either:
 ** make them use it, ignoring any previously unimportant explicit 
 incdexConfig settings
 ** make them use it, using explicit sys props to overwrite random values in 
 cases were explicit indexConfig values are important for test
 ** add a comment why it's not using the include snippet in cases where the 
 explicit parsing is part of hte test
 * try figure out a way for contrib tests to easily include the same file 
 and/or apply the same rules as above

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4952) audit test configs to use solrconfig.snippet.randomindexconfig.xml in more tests

2013-08-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733851#comment-13733851
 ] 

ASF subversion and git services commented on SOLR-4952:
---

Commit 1511958 from hoss...@apache.org in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1511958 ]

SOLR-4952: TestIndexSearcher.testReopen needs fixed segment merging (merge 
r1511954)

 audit test configs to use solrconfig.snippet.randomindexconfig.xml in more 
 tests
 

 Key: SOLR-4952
 URL: https://issues.apache.org/jira/browse/SOLR-4952
 Project: Solr
  Issue Type: Sub-task
Reporter: Hoss Man
Assignee: Hoss Man

 in SOLR-4942 i updated every solrconfig.xml to either...
 * include solrconfig.snippet.randomindexconfig.xml where it was easy to do so
 * use the useCompoundFile sys prop if it already had an {{indexConfig}} 
 section, or if including the snippet wasn't going to be easy (ie: contrib 
 tests)
 As an improvment on this:
 * audit all core configs not already using 
 solrconfig.snippet.randomindexconfig.xml and either:
 ** make them use it, ignoring any previously unimportant explicit 
 incdexConfig settings
 ** make them use it, using explicit sys props to overwrite random values in 
 cases were explicit indexConfig values are important for test
 ** add a comment why it's not using the include snippet in cases where the 
 explicit parsing is part of hte test
 * try figure out a way for contrib tests to easily include the same file 
 and/or apply the same rules as above

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions

2013-08-08 Thread Grant Ingersoll (JIRA)
Grant Ingersoll created LUCENE-5160:
---

 Summary: NIOFSDirectory, SimpleFSDirectory (others?) don't 
properly handle valid file and FileChannel read conditions
 Key: LUCENE-5160
 URL: https://issues.apache.org/jira/browse/LUCENE-5160
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 4.4, 5.0
Reporter: Grant Ingersoll


Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly 
handle the -1 condition that can be returned from FileChannel.read().  If it 
returns -1, then it will move the file pointer back and you will enter an 
infinite loop.  SimpleFSDirectory displays the same characteristics, although I 
have only seen the issue on NIOFSDirectory.

The code in question from NIOFSDirectory:
{code}
try {
while (readLength  0) {
  final int limit;
  if (readLength  chunkSize) {
// LUCENE-1566 - work around JVM Bug by breaking
// very large reads into chunks
limit = readOffset + chunkSize;
  } else {
limit = readOffset + readLength;
  }
  bb.limit(limit);
  int i = channel.read(bb, pos);
  pos += i;
  readOffset += i;
  readLength -= i;
}
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733865#comment-13733865
 ] 

Uwe Schindler commented on LUCENE-5160:
---

This is a bug, which never is hit by lucene because we never read sequentially 
until end of file.

+1 to fix this. Theoretically to comply with MMapDirectory it should throw 
EOFException if it gets -1, because Lucene code should not read beyond file end.

 NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file 
 and FileChannel read conditions
 

 Key: LUCENE-5160
 URL: https://issues.apache.org/jira/browse/LUCENE-5160
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 5.0, 4.4
Reporter: Grant Ingersoll

 Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly 
 handle the -1 condition that can be returned from FileChannel.read().  If it 
 returns -1, then it will move the file pointer back and you will enter an 
 infinite loop.  SimpleFSDirectory displays the same characteristics, although 
 I have only seen the issue on NIOFSDirectory.
 The code in question from NIOFSDirectory:
 {code}
 try {
 while (readLength  0) {
   final int limit;
   if (readLength  chunkSize) {
 // LUCENE-1566 - work around JVM Bug by breaking
 // very large reads into chunks
 limit = readOffset + chunkSize;
   } else {
 limit = readOffset + readLength;
   }
   bb.limit(limit);
   int i = channel.read(bb, pos);
   pos += i;
   readOffset += i;
   readLength -= i;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4774) Add FieldComparator that allows sorting parent docs based on field inside the child docs

2013-08-08 Thread Mikhail Khludnev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733872#comment-13733872
 ] 

Mikhail Khludnev commented on LUCENE-4774:
--

fwiw something like 
http://www.gossamer-threads.com/lists/lucene/java-dev/199372?do=post_view_threaded
 happens to me 

NOTE: reproduce with: ant test  -Dtestcase=TestBlockJoinSorting 
-Dtests.method=testNestedSorting -Dtests.seed=FB4F1BE85579255B 
-Dtests.slow=true -Dtests.locale=da_DK -Dtests.timezone=Asia/Qatar 
-Dtests.file.encoding=UTF-8
NOTE: test params are: codec=Asserting, 
sim=RandomSimilarityProvider(queryNorm=true,coord=crazy): {}, locale=da_DK, 
timezone=Asia/Qatar
NOTE: Linux 2.6.32-131.0.15.el6.x86_64 amd64/Sun Microsystems Inc. 1.6.0_29 
(64-bit)/cpus=4,threads=1,free=317130512,total=349241344
NOTE: All tests run in this JVM: [TestJoinUtil, TestBlockJoin, 
TestBlockJoinSorting]

---
Test set: org.apache.lucene.search.join.TestBlockJoinSorting
---
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.06 sec  
FAILURE!
testNestedSorting(org.apache.lucene.search.join.TestBlockJoinSorting)  Time 
elapsed: 0.021 sec   FAILURE!
java.lang.AssertionError: expected:3 but was:28
at 
__randomizedtesting.SeedInfo.seed([FB4F1BE85579255B:F3A6F6A915D02835]:0)
at org.junit.Assert.fail(Assert.java:93)
at org.junit.Assert.failNotEquals(Assert.java:647)
at org.junit.Assert.assertEquals(Assert.java:128)
at org.junit.Assert.assertEquals(Assert.java:472)
at org.junit.Assert.assertEquals(Assert.java:456)
at 
org.apache.lucene.search.join.TestBlockJoinSorting.testNestedSorting(TestBlockJoinSorting.java:226)

 Add FieldComparator that allows sorting parent docs based on field inside the 
 child docs
 

 Key: LUCENE-4774
 URL: https://issues.apache.org/jira/browse/LUCENE-4774
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/join
Reporter: Martijn van Groningen
Assignee: Martijn van Groningen
 Fix For: 5.0, 4.3

 Attachments: LUCENE-4774.patch, LUCENE-4774.patch, LUCENE-4774.patch


 A field comparator for sorting block join parent docs based on the a field in 
 the associated child docs. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4679) HTML line breaks (br) are removed during indexing; causes wrong search results

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733901#comment-13733901
 ] 

Uwe Schindler commented on SOLR-4679:
-

bq. I never said that ...

You somehow said:

bq. I defer to your judgement on this

So I assumed that you are still not 100% convinced. Sorry.

In any case I will take the issue. In my opinion there is more work to be done 
with this crazy stack of StringBuilders to better handle the ignorableWhitepace 
when a new field begins/ends. Currently its insered after the block end tag, so 
it would go one up in the stack only. I have to think a little bit about it, 
but the fix in your patch is the easiest for now. And the maybe useless 
whitespace on some lower stacked StringBuilders is generally removed by text 
analysis.

 HTML line breaks (br) are removed during indexing; causes wrong search 
 results
 

 Key: SOLR-4679
 URL: https://issues.apache.org/jira/browse/SOLR-4679
 Project: Solr
  Issue Type: Bug
  Components: update
Affects Versions: 4.2
 Environment: Windows Server 2008 R2, Java 6, Tomcat 7
Reporter: Christoph Straßer
Assignee: Uwe Schindler
 Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
 Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png


 HTML line breaks (br, BR, br/, ...) seem to be removed during 
 extraction of content from HTML-Files. They need to be replaced with a empty 
 space.
 Test-File:
 html
 head
 titleTest mit HTML-Zeilenschaltungen/title
 /head
 p
 word1brword2br/
 Some other words, a special name like linzbrand another special name - 
 vienna
 /p
 /html
 The Solr-content-attribute contains the following text:
 Test mit HTML-Zeilenschaltungen
 word1word2
 Some other words, a special name like linzand another special name - vienna
 So we are not able to find the word linz.
 We use the ExtractingRequestHandler to put content into Solr. 
 (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions

2013-08-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned LUCENE-5160:
---

Assignee: Grant Ingersoll

 NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file 
 and FileChannel read conditions
 

 Key: LUCENE-5160
 URL: https://issues.apache.org/jira/browse/LUCENE-5160
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 5.0, 4.4
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll

 Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly 
 handle the -1 condition that can be returned from FileChannel.read().  If it 
 returns -1, then it will move the file pointer back and you will enter an 
 infinite loop.  SimpleFSDirectory displays the same characteristics, although 
 I have only seen the issue on NIOFSDirectory.
 The code in question from NIOFSDirectory:
 {code}
 try {
 while (readLength  0) {
   final int limit;
   if (readLength  chunkSize) {
 // LUCENE-1566 - work around JVM Bug by breaking
 // very large reads into chunks
 limit = readOffset + chunkSize;
   } else {
 limit = readOffset + readLength;
   }
   bb.limit(limit);
   int i = channel.read(bb, pos);
   pos += i;
   readOffset += i;
   readLength -= i;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions

2013-08-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-5160:


Attachment: LUCENE-5160.patch

Patch adds the -1 check and throws an EOF

 NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file 
 and FileChannel read conditions
 

 Key: LUCENE-5160
 URL: https://issues.apache.org/jira/browse/LUCENE-5160
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 5.0, 4.4
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Attachments: LUCENE-5160.patch


 Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly 
 handle the -1 condition that can be returned from FileChannel.read().  If it 
 returns -1, then it will move the file pointer back and you will enter an 
 infinite loop.  SimpleFSDirectory displays the same characteristics, although 
 I have only seen the issue on NIOFSDirectory.
 The code in question from NIOFSDirectory:
 {code}
 try {
 while (readLength  0) {
   final int limit;
   if (readLength  chunkSize) {
 // LUCENE-1566 - work around JVM Bug by breaking
 // very large reads into chunks
 limit = readOffset + chunkSize;
   } else {
 limit = readOffset + readLength;
   }
   bb.limit(limit);
   int i = channel.read(bb, pos);
   pos += i;
   readOffset += i;
   readLength -= i;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733912#comment-13733912
 ] 

Uwe Schindler commented on LUCENE-5160:
---

+1 to commit. Looks good. Writing a test is a bit hard.

MMapDirectory is not affected as it already has a check for the length of the 
MappedByteBuffers.

 NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file 
 and FileChannel read conditions
 

 Key: LUCENE-5160
 URL: https://issues.apache.org/jira/browse/LUCENE-5160
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 5.0, 4.4
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Attachments: LUCENE-5160.patch


 Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly 
 handle the -1 condition that can be returned from FileChannel.read().  If it 
 returns -1, then it will move the file pointer back and you will enter an 
 infinite loop.  SimpleFSDirectory displays the same characteristics, although 
 I have only seen the issue on NIOFSDirectory.
 The code in question from NIOFSDirectory:
 {code}
 try {
 while (readLength  0) {
   final int limit;
   if (readLength  chunkSize) {
 // LUCENE-1566 - work around JVM Bug by breaking
 // very large reads into chunks
 limit = readOffset + chunkSize;
   } else {
 limit = readOffset + readLength;
   }
   bb.limit(limit);
   int i = channel.read(bb, pos);
   pos += i;
   readOffset += i;
   readLength -= i;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions

2013-08-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733914#comment-13733914
 ] 

ASF subversion and git services commented on LUCENE-5160:
-

Commit 1512011 from [~gsingers] in branch 'dev/trunk'
[ https://svn.apache.org/r1512011 ]

LUCENE-5160: check for -1 return conditions in file reads

 NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file 
 and FileChannel read conditions
 

 Key: LUCENE-5160
 URL: https://issues.apache.org/jira/browse/LUCENE-5160
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 5.0, 4.4
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Attachments: LUCENE-5160.patch


 Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly 
 handle the -1 condition that can be returned from FileChannel.read().  If it 
 returns -1, then it will move the file pointer back and you will enter an 
 infinite loop.  SimpleFSDirectory displays the same characteristics, although 
 I have only seen the issue on NIOFSDirectory.
 The code in question from NIOFSDirectory:
 {code}
 try {
 while (readLength  0) {
   final int limit;
   if (readLength  chunkSize) {
 // LUCENE-1566 - work around JVM Bug by breaking
 // very large reads into chunks
 limit = readOffset + chunkSize;
   } else {
 limit = readOffset + readLength;
   }
   bb.limit(limit);
   int i = channel.read(bb, pos);
   pos += i;
   readOffset += i;
   readLength -= i;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5161) review FSDirectory chunking defaults and test the chunking

2013-08-08 Thread Robert Muir (JIRA)
Robert Muir created LUCENE-5161:
---

 Summary: review FSDirectory chunking defaults and test the chunking
 Key: LUCENE-5161
 URL: https://issues.apache.org/jira/browse/LUCENE-5161
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Robert Muir


Today there is a loop in SimpleFS/NIOFS:
{code}
try {
  do {
final int readLength;
if (total + chunkSize  len) {
  readLength = len - total;
} else {
  // LUCENE-1566 - work around JVM Bug by breaking very large reads 
into chunks
  readLength = chunkSize;
}
final int i = file.read(b, offset + total, readLength);
total += i;
  } while (total  len);
} catch (OutOfMemoryError e) {
{code}

I bet if you look at the clover report its untested, because its fixed at 100MB 
for 32-bit users and 2GB for 64-bit users (are these defaults even good?!).

Also if you call the setter on a 64-bit machine to change the size, it just 
totally ignores it. We should remove that, the setter should always work.

And we should set it to small values in tests so this loop is actually executed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions

2013-08-08 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733921#comment-13733921
 ] 

ASF subversion and git services commented on LUCENE-5160:
-

Commit 1512016 from [~gsingers] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1512016 ]

LUCENE-5160: merge from trunk

 NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file 
 and FileChannel read conditions
 

 Key: LUCENE-5160
 URL: https://issues.apache.org/jira/browse/LUCENE-5160
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 5.0, 4.4
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Attachments: LUCENE-5160.patch


 Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly 
 handle the -1 condition that can be returned from FileChannel.read().  If it 
 returns -1, then it will move the file pointer back and you will enter an 
 infinite loop.  SimpleFSDirectory displays the same characteristics, although 
 I have only seen the issue on NIOFSDirectory.
 The code in question from NIOFSDirectory:
 {code}
 try {
 while (readLength  0) {
   final int limit;
   if (readLength  chunkSize) {
 // LUCENE-1566 - work around JVM Bug by breaking
 // very large reads into chunks
 limit = readOffset + chunkSize;
   } else {
 limit = readOffset + readLength;
   }
   bb.limit(limit);
   int i = channel.read(bb, pos);
   pos += i;
   readOffset += i;
   readLength -= i;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-5160) NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file and FileChannel read conditions

2013-08-08 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved LUCENE-5160.
-

   Resolution: Fixed
Fix Version/s: 4.5
   5.0
Lucene Fields:   (was: New)

 NIOFSDirectory, SimpleFSDirectory (others?) don't properly handle valid file 
 and FileChannel read conditions
 

 Key: LUCENE-5160
 URL: https://issues.apache.org/jira/browse/LUCENE-5160
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 5.0, 4.4
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 5.0, 4.5

 Attachments: LUCENE-5160.patch


 Around line 190 of NIOFSDirectory, the loop to read in bytes doesn't properly 
 handle the -1 condition that can be returned from FileChannel.read().  If it 
 returns -1, then it will move the file pointer back and you will enter an 
 infinite loop.  SimpleFSDirectory displays the same characteristics, although 
 I have only seen the issue on NIOFSDirectory.
 The code in question from NIOFSDirectory:
 {code}
 try {
 while (readLength  0) {
   final int limit;
   if (readLength  chunkSize) {
 // LUCENE-1566 - work around JVM Bug by breaking
 // very large reads into chunks
 limit = readOffset + chunkSize;
   } else {
 limit = readOffset + readLength;
   }
   bb.limit(limit);
   int i = channel.read(bb, pos);
   pos += i;
   readOffset += i;
   readLength -= i;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5161) review FSDirectory chunking defaults and test the chunking

2013-08-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-5161:


Attachment: LUCENE-5161.patch

This patch makes the setter always work, and changes lucenetestcase to use 
small values for the chunking.

I didnt adjust any defaults (maybe Uwe can help, he knows about the code in 
question)

 review FSDirectory chunking defaults and test the chunking
 --

 Key: LUCENE-5161
 URL: https://issues.apache.org/jira/browse/LUCENE-5161
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-5161.patch


 Today there is a loop in SimpleFS/NIOFS:
 {code}
 try {
   do {
 final int readLength;
 if (total + chunkSize  len) {
   readLength = len - total;
 } else {
   // LUCENE-1566 - work around JVM Bug by breaking very large 
 reads into chunks
   readLength = chunkSize;
 }
 final int i = file.read(b, offset + total, readLength);
 total += i;
   } while (total  len);
 } catch (OutOfMemoryError e) {
 {code}
 I bet if you look at the clover report its untested, because its fixed at 
 100MB for 32-bit users and 2GB for 64-bit users (are these defaults even 
 good?!).
 Also if you call the setter on a 64-bit machine to change the size, it just 
 totally ignores it. We should remove that, the setter should always work.
 And we should set it to small values in tests so this loop is actually 
 executed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Interesting failure scenario, SolrCloud and ZK nodes on different times

2013-08-08 Thread Grant Ingersoll
I seem to recall seeing this on my cluster when we didn't have clocks in sync, 
but perhaps my memory is fuzzy as well.

-Grant

On Aug 7, 2013, at 7:41 AM, Erick Erickson erickerick...@gmail.com wrote:

 Well, we're reconstructing a chain of _possibilities_ post-mortem,
 so there's not much I can say for sure. Mostly just throwing this 
 out there in case it sparks some aha moments. Not knowing
 ZK well, anything I say is speculation.
 
 But I speculate that this isn't really the root of the problem given
 that we haven't been seeing the ClusterState says we are the leader...
 error go by the user lists for a while. It may well be a coincidence. The
 place that this happened reported that the problem seemed to 
 be better after adjusting the ZK nodes' times. I know when I
 reconstruct events like this I'm never sure about cause and
 effect since I'm usually doing several things at once.
 
 Erick
 
 
 On Tue, Aug 6, 2013 at 5:51 PM, Chris Hostetter hossman_luc...@fucit.org 
 wrote:
 
 :  When the times were coordinated, many of the problems with recovery went
 :  away. We're trying to reconstruct the scenario from memory, but it
 :  prompted me to pass the incident in case it sparked any thoughts.
 :  Specifically, I wonder if there's anything that comes to mind if the ZK
 :  nodes are significantly out of synch with each other time-wise.
 :
 : Does this mean that ntp or other strict time synchronization is important 
 for
 : SolrCloud?  I strive for this anyway, just to ensure that when I'm 
 researching
 : log files between two machines that I can match things up properly.
 
 I don't know if/how Solr/ZK is affected by having machines with clocks out
 of sync, but i do remember seeing discussions a while back about weird
 things happening ot ZK client apps *while* time adjustments are taking
 place to get back in sync.
 
 IIRC: as the local clock starts accelerating and jumping ahead in
 increments to correct itself with ntp, then those jumps can confuse the
 ZK code into thinking it's been waiting a lot longer then it really
 has for zk heartbeat (or whatever it's called) and it can trigger a
 timeout situation.
 
 
 -Hoss
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 


Grant Ingersoll | @gsingers
http://www.lucidworks.com







[jira] [Commented] (LUCENE-5161) review FSDirectory chunking defaults and test the chunking

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734047#comment-13734047
 ] 

Uwe Schindler commented on LUCENE-5161:
---

Thanks Robert for opening.

It is too late today, so I will respond tomorrow morning about the NIO stuff. I 
am now aware and inspected the JVM code, so I can explain why the OOMs occur in 
SimpleFSDir and NIOFSDir if you read large buffers. More details tomorrow, just 
one thing before: It has nothing to do with 32 or 64 bits, it is more 
limitations of the JVM with direct memory and heap size leading to the OOM 
under certain conditions. But the Integer.MAX_VALUE for 64 bit JVMs is just 
wrong, too (could also lead to OOM).

In general I would not make the buffers too large, so the junk size should be 
limited to not more than a few megabytes. Making them large brings no 
performance improvement at all, it just wastes emory in thread-local direct 
buffers allocated internally by the JVM's NIO code.

 review FSDirectory chunking defaults and test the chunking
 --

 Key: LUCENE-5161
 URL: https://issues.apache.org/jira/browse/LUCENE-5161
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-5161.patch


 Today there is a loop in SimpleFS/NIOFS:
 {code}
 try {
   do {
 final int readLength;
 if (total + chunkSize  len) {
   readLength = len - total;
 } else {
   // LUCENE-1566 - work around JVM Bug by breaking very large 
 reads into chunks
   readLength = chunkSize;
 }
 final int i = file.read(b, offset + total, readLength);
 total += i;
   } while (total  len);
 } catch (OutOfMemoryError e) {
 {code}
 I bet if you look at the clover report its untested, because its fixed at 
 100MB for 32-bit users and 2GB for 64-bit users (are these defaults even 
 good?!).
 Also if you call the setter on a 64-bit machine to change the size, it just 
 totally ignores it. We should remove that, the setter should always work.
 And we should set it to small values in tests so this loop is actually 
 executed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-5161) review FSDirectory chunking defaults and test the chunking

2013-08-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734047#comment-13734047
 ] 

Uwe Schindler edited comment on LUCENE-5161 at 8/8/13 9:44 PM:
---

Thanks Robert for opening.

It is too late today, so I will respond tomorrow morning about the NIO stuff. I 
am now aware and inspected the JVM code, so I can explain why the OOMs occur in 
SimpleFSDir and NIOFSDir if you read large buffers. More details tomorrow, just 
one thing before: It has nothing to do with 32 or 64 bits, it is more 
limitations of the JVM with direct memory and heap size leading to the OOM 
under certain conditions. But the Integer.MAX_VALUE for 64 bit JVMs is just 
wrong, too (could also lead to OOM).

In general I would not make the buffers too large, so the junk size should be 
limited to not more than a few megabytes. Making them large brings no 
performance improvement at all, it just wastes memory in large *thread-local* 
direct buffers allocated internally by the JVM's NIO code.

  was (Author: thetaphi):
Thanks Robert for opening.

It is too late today, so I will respond tomorrow morning about the NIO stuff. I 
am now aware and inspected the JVM code, so I can explain why the OOMs occur in 
SimpleFSDir and NIOFSDir if you read large buffers. More details tomorrow, just 
one thing before: It has nothing to do with 32 or 64 bits, it is more 
limitations of the JVM with direct memory and heap size leading to the OOM 
under certain conditions. But the Integer.MAX_VALUE for 64 bit JVMs is just 
wrong, too (could also lead to OOM).

In general I would not make the buffers too large, so the junk size should be 
limited to not more than a few megabytes. Making them large brings no 
performance improvement at all, it just wastes emory in thread-local direct 
buffers allocated internally by the JVM's NIO code.
  
 review FSDirectory chunking defaults and test the chunking
 --

 Key: LUCENE-5161
 URL: https://issues.apache.org/jira/browse/LUCENE-5161
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-5161.patch


 Today there is a loop in SimpleFS/NIOFS:
 {code}
 try {
   do {
 final int readLength;
 if (total + chunkSize  len) {
   readLength = len - total;
 } else {
   // LUCENE-1566 - work around JVM Bug by breaking very large 
 reads into chunks
   readLength = chunkSize;
 }
 final int i = file.read(b, offset + total, readLength);
 total += i;
   } while (total  len);
 } catch (OutOfMemoryError e) {
 {code}
 I bet if you look at the clover report its untested, because its fixed at 
 100MB for 32-bit users and 2GB for 64-bit users (are these defaults even 
 good?!).
 Also if you call the setter on a 64-bit machine to change the size, it just 
 totally ignores it. We should remove that, the setter should always work.
 And we should set it to small values in tests so this loop is actually 
 executed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5161) review FSDirectory chunking defaults and test the chunking

2013-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734058#comment-13734058
 ] 

Robert Muir commented on LUCENE-5161:
-

Thanks Uwe, I will leave the issue for you tomorrow to fix the defaults.

I can only say the chunking does not seem buggy (all tests pass with the 
randomization in the patch), so at least we have that.

 review FSDirectory chunking defaults and test the chunking
 --

 Key: LUCENE-5161
 URL: https://issues.apache.org/jira/browse/LUCENE-5161
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Robert Muir
 Attachments: LUCENE-5161.patch


 Today there is a loop in SimpleFS/NIOFS:
 {code}
 try {
   do {
 final int readLength;
 if (total + chunkSize  len) {
   readLength = len - total;
 } else {
   // LUCENE-1566 - work around JVM Bug by breaking very large 
 reads into chunks
   readLength = chunkSize;
 }
 final int i = file.read(b, offset + total, readLength);
 total += i;
   } while (total  len);
 } catch (OutOfMemoryError e) {
 {code}
 I bet if you look at the clover report its untested, because its fixed at 
 100MB for 32-bit users and 2GB for 64-bit users (are these defaults even 
 good?!).
 Also if you call the setter on a 64-bit machine to change the size, it just 
 totally ignores it. We should remove that, the setter should always work.
 And we should set it to small values in tests so this loop is actually 
 executed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4774) Add FieldComparator that allows sorting parent docs based on field inside the child docs

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734138#comment-13734138
 ] 

Hoss Man commented on LUCENE-4774:
--

Mikhail: can you please open a new bug with the details of your test failure -- 
specifically: what branch/revision you are testing and whether or not that seed 
reproduces for you.

(it's not really appropriate to comment on closed issues that added features 
with concerns about bugs in that feature -- that's what Jira issue linking can 
be helpful for).



 Add FieldComparator that allows sorting parent docs based on field inside the 
 child docs
 

 Key: LUCENE-4774
 URL: https://issues.apache.org/jira/browse/LUCENE-4774
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/join
Reporter: Martijn van Groningen
Assignee: Martijn van Groningen
 Fix For: 5.0, 4.3

 Attachments: LUCENE-4774.patch, LUCENE-4774.patch, LUCENE-4774.patch


 A field comparator for sorting block join parent docs based on the a field in 
 the associated child docs. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734159#comment-13734159
 ] 

Hoss Man commented on SOLR-5084:


Elran:

1) there's still several sections in your patch that have a lot of reformatting 
making it hard to see what exactly you've added.  (I realize that the 
formatting may not be 100% uniform in all of these files, but the key to making 
patches easy to read is not to change anything that does't *have* to be changed 
... formatting changes should be done seperately and independently from 
functionality changes)

2) could you please add a few unit tests to show how the type can be used when 
indexing/querying/faceting/returning stored fields so it's more clear what this 
patch does?

3) I'm not sure that it makes sense to customize the response writers and the 
JavaBinCodec to know about hte enum values -- it seems like it would make a lot 
more sense (and by much simpler) to have clients just treat the enum values as 
strings

4) a lot of your code seems to be cut/paste from TrieField ... why can't the 
EnumField class subclass TrieField to re-use this behavior (or worst case: wrap 
a TrieIntField similar to how TrieDateField works)

 new field type - EnumField
 --

 Key: SOLR-5084
 URL: https://issues.apache.org/jira/browse/SOLR-5084
 Project: Solr
  Issue Type: New Feature
Reporter: Elran Dvir
 Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
 Solr-5084.patch


 We have encountered a use case in our system where we have a few fields 
 (Severity. Risk etc) with a closed set of values, where the sort order for 
 these values is pre-determined but not lexicographic (Critical is higher than 
 High). Generically this is very close to how enums work.
 To implement, I have prototyped a new type of field: EnumField where the 
 inputs are a closed predefined  set of strings in a special configuration 
 file (similar to currency.xml).
 The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734178#comment-13734178
 ] 

Robert Muir commented on SOLR-5084:
---

I agree with Hossman.. stick with it though, I really like the idea of an 
efficient enumerated type.

A few other ideas/questions (just from a glance, i could be wrong):
* should we enforce from the enum config that the integer values are 0-N or 
something simple? This way, things like valuesources dont have to do hashing 
but simple array lookups.
* it isnt clear to me what happens if you send a bogus value. I think an 
enumerated type would be best if its strongly-typed and just throws exception 
if the value is bogus.
* should the config, instead of being a separate config file, just be a nested 
element underneath the field type? I dont know if this is even possible or a 
good idea, but its an that would remove some xml files.


 new field type - EnumField
 --

 Key: SOLR-5084
 URL: https://issues.apache.org/jira/browse/SOLR-5084
 Project: Solr
  Issue Type: New Feature
Reporter: Elran Dvir
 Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
 Solr-5084.patch


 We have encountered a use case in our system where we have a few fields 
 (Severity. Risk etc) with a closed set of values, where the sort order for 
 these values is pre-determined but not lexicographic (Critical is higher than 
 High). Generically this is very close to how enums work.
 To implement, I have prototyped a new type of field: EnumField where the 
 inputs are a closed predefined  set of strings in a special configuration 
 file (similar to currency.xml).
 The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5150) WAH8DocIdSet: dense sets compression

2013-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734202#comment-13734202
 ] 

Robert Muir commented on LUCENE-5150:
-

Thanks Adrien, i am too curious if its possible for you to re-run 
http://people.apache.org/~jpountz/doc_id_sets.html

Because now with smaller sets in the dense case, maybe there is no need for 
wacky heuristics in CachingWrapperFilter and we could just always cache (i am 
sure some cases would be slower, but if in general its faster...). This would 
really simplify LUCENE-5101.

 WAH8DocIdSet: dense sets compression
 

 Key: LUCENE-5150
 URL: https://issues.apache.org/jira/browse/LUCENE-5150
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Trivial
 Attachments: LUCENE-5150.patch


 In LUCENE-5101, Paul Elschot mentioned that it would be interesting to be 
 able to encode the inverse set to also compress very dense sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734210#comment-13734210
 ] 

Hoss Man commented on SOLR-5084:


bq. ...nested element underneath the field type? I dont know if this is even 
possible or a good idea, but its an that would remove some xml files.

i don't think the schema parsing code can handle that -- it's attribute based, 
not nested element based

bq. should we enforce from the enum config that the integer values are 0-N or 
something simple? ...

yeah ... it would be tempting to not even let the config specify numeric values 
-- just an ordered list, except:

1) all hell would break loose if someone accidently inserted a new element 
anywhere other then the end of the list
2) you'd need/want a way to disable values form the middle of the list from 
working again.

#2 is a problem you'd need to worry about even if we keep the mappings explicit 
but enforce 0-N ... there needs to be something like...

{code}
  enum name=severity
pair name=Not Available value=0/
pair name=Low value=1/

!-- value w/o name passes validation but prevents it from being used --
pair value=2/ !-- Medium use to exist, but was phased out --

pair name=High value=3/
pair name=Critical value=4/   

!-- this however would fail, because we skipped 5-10 --
pair name=Super Nova value=11/
  /enum
{code}

bq. ... This way, things like valuesources dont have to do hashing but simple 
array lookups.

I was actually thinking it would be nice to support multiple legal names (with 
one canonical for respones) per value, but that would preent the simple array 
lookps...

{code}
  enum name=severity
value int=0labelNot Available/label/value
value int=1labelLow/label/value

!-- value w/o name passes validation but prevents it from being used --
value int=2 / !-- Medium use to exist, but was phased out --

value int=3labelHigh/label/value

value int=4
  label canonical=trueCritical/label
  labelHighest/label
/value
  /enum
{code}


 new field type - EnumField
 --

 Key: SOLR-5084
 URL: https://issues.apache.org/jira/browse/SOLR-5084
 Project: Solr
  Issue Type: New Feature
Reporter: Elran Dvir
 Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
 Solr-5084.patch


 We have encountered a use case in our system where we have a few fields 
 (Severity. Risk etc) with a closed set of values, where the sort order for 
 these values is pre-determined but not lexicographic (Critical is higher than 
 High). Generically this is very close to how enums work.
 To implement, I have prototyped a new type of field: EnumField where the 
 inputs are a closed predefined  set of strings in a special configuration 
 file (similar to currency.xml).
 The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734223#comment-13734223
 ] 

Robert Muir commented on SOLR-5084:
---

{quote}
...nested element underneath the field type? I dont know if this is even 
possible or a good idea, but its an that would remove some xml files.

i don't think the schema parsing code can handle that – it's attribute based, 
not nested element based
{quote}

Right but code can change. Other parts of solr allow this kinda stuff.

{quote}
yeah ... it would be tempting to not even let the config specify numeric values 
– just an ordered list, except:

1) all hell would break loose if someone accidently inserted a new element 
anywhere other then the end of the list
2) you'd need/want a way to disable values form the middle of the list from 
working again.
{quote}

Well i guess i look at it differently. That this is in a sense like an 
analyzer. you cant change the config without reindexing.

{quote}
I was actually thinking it would be nice to support multiple legal names (with 
one canonical for respones) per value, but that would preent the simple array 
lookps...
{quote}

Why? I'm talking about int-canonical name (e.g. in the valuesource impl) not 
anything else. as far as name-int, you want a hash anyway.

 new field type - EnumField
 --

 Key: SOLR-5084
 URL: https://issues.apache.org/jira/browse/SOLR-5084
 Project: Solr
  Issue Type: New Feature
Reporter: Elran Dvir
 Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
 Solr-5084.patch


 We have encountered a use case in our system where we have a few fields 
 (Severity. Risk etc) with a closed set of values, where the sort order for 
 these values is pre-determined but not lexicographic (Critical is higher than 
 High). Generically this is very close to how enums work.
 To implement, I have prototyped a new type of field: EnumField where the 
 inputs are a closed predefined  set of strings in a special configuration 
 file (similar to currency.xml).
 The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734228#comment-13734228
 ] 

Hoss Man commented on SOLR-5084:


bq. Well i guess i look at it differently. That this is in a sense like an 
analyzer. you cant change the config without reindexing.

i dunno ... that seems like it would really kill the utility of a field for a 
lot of use cases -- if it had that kind of limitation, i would just use an 
int field an managing the mappings myself so id always know i could 
add/remove fields w/o needing to reindex.

to follow your example: if i completley change hte analyzer, then yes i have ot 
reindex -- but if want to stop using a ynonym, i don't have to re-index every 
doc, just the ones that had that used that synonyms.

bq. as far as name-int, you want a hash anyway.

right ... never mind, i was thinking about it backwards.

 new field type - EnumField
 --

 Key: SOLR-5084
 URL: https://issues.apache.org/jira/browse/SOLR-5084
 Project: Solr
  Issue Type: New Feature
Reporter: Elran Dvir
 Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
 Solr-5084.patch


 We have encountered a use case in our system where we have a few fields 
 (Severity. Risk etc) with a closed set of values, where the sort order for 
 these values is pre-determined but not lexicographic (Critical is higher than 
 High). Generically this is very close to how enums work.
 To implement, I have prototyped a new type of field: EnumField where the 
 inputs are a closed predefined  set of strings in a special configuration 
 file (similar to currency.xml).
 The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734228#comment-13734228
 ] 

Hoss Man edited comment on SOLR-5084 at 8/9/13 12:14 AM:
-

bq. Well i guess i look at it differently. That this is in a sense like an 
analyzer. you cant change the config without reindexing.

i dunno ... that seems like it would really kill the utility of a field for a 
lot of use cases -- if it had that kind of limitation, i would just use an 
int field an managing the mappings myself so id always know i could 
add/remove (EDIT) -fields- values w/o needing to reindex.

to follow your example: if i completley change hte analyzer, then yes i have ot 
reindex -- but if want to stop using a ynonym, i don't have to re-index every 
doc, just the ones that had that used that synonyms.

bq. as far as name-int, you want a hash anyway.

right ... never mind, i was thinking about it backwards.

  was (Author: hossman):
bq. Well i guess i look at it differently. That this is in a sense like an 
analyzer. you cant change the config without reindexing.

i dunno ... that seems like it would really kill the utility of a field for a 
lot of use cases -- if it had that kind of limitation, i would just use an 
int field an managing the mappings myself so id always know i could 
add/remove fields w/o needing to reindex.

to follow your example: if i completley change hte analyzer, then yes i have ot 
reindex -- but if want to stop using a ynonym, i don't have to re-index every 
doc, just the ones that had that used that synonyms.

bq. as far as name-int, you want a hash anyway.

right ... never mind, i was thinking about it backwards.
  
 new field type - EnumField
 --

 Key: SOLR-5084
 URL: https://issues.apache.org/jira/browse/SOLR-5084
 Project: Solr
  Issue Type: New Feature
Reporter: Elran Dvir
 Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
 Solr-5084.patch


 We have encountered a use case in our system where we have a few fields 
 (Severity. Risk etc) with a closed set of values, where the sort order for 
 these values is pre-determined but not lexicographic (Critical is higher than 
 High). Generically this is very close to how enums work.
 To implement, I have prototyped a new type of field: EnumField where the 
 inputs are a closed predefined  set of strings in a special configuration 
 file (similar to currency.xml).
 The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734239#comment-13734239
 ] 

Robert Muir commented on SOLR-5084:
---

{quote}
i dunno ... that seems like it would really kill the utility of a field for a 
lot of use cases – if it had that kind of limitation, i would just use an int 
field an managing the mappings myself so id always know i could add/remove 
(EDIT) fields values w/o needing to reindex.
{quote}

This isnt really going to work here, because the idea is you want to assign 
sort order (not just values mapped to ints). If you want to rename a label, 
thats fine, but you cant really change the sort order without reindexing.

 new field type - EnumField
 --

 Key: SOLR-5084
 URL: https://issues.apache.org/jira/browse/SOLR-5084
 Project: Solr
  Issue Type: New Feature
Reporter: Elran Dvir
 Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
 Solr-5084.patch


 We have encountered a use case in our system where we have a few fields 
 (Severity. Risk etc) with a closed set of values, where the sort order for 
 these values is pre-determined but not lexicographic (Critical is higher than 
 High). Generically this is very close to how enums work.
 To implement, I have prototyped a new type of field: EnumField where the 
 inputs are a closed predefined  set of strings in a special configuration 
 file (similar to currency.xml).
 The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734248#comment-13734248
 ] 

Hoss Man commented on SOLR-5084:


bq. If you want to rename a label, thats fine, but you cant really change the 
sort order without reindexing.

No, no .. of course not ... i wasn't suggestiong you could change the order, 
just:
* *remove* a legal value from the list (w/o causing hte validation to complain)
* add new values to the end of the list
* (as you mentioned) modify the label on an existing value

See the example i posted before about removing Medium but keeping High  
Critical exactly as they are -- no change in indexed data, just a way to tell 
the validation logic you were talking about adding skip this value, i removed 
it on purpose (or i suppose: skip this value, i'm reserving it as a 
placeholder for future use)

 new field type - EnumField
 --

 Key: SOLR-5084
 URL: https://issues.apache.org/jira/browse/SOLR-5084
 Project: Solr
  Issue Type: New Feature
Reporter: Elran Dvir
 Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
 Solr-5084.patch


 We have encountered a use case in our system where we have a few fields 
 (Severity. Risk etc) with a closed set of values, where the sort order for 
 these values is pre-determined but not lexicographic (Critical is higher than 
 High). Generically this is very close to how enums work.
 To implement, I have prototyped a new type of field: EnumField where the 
 inputs are a closed predefined  set of strings in a special configuration 
 file (similar to currency.xml).
 The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734254#comment-13734254
 ] 

Robert Muir commented on SOLR-5084:
---

I think adding new values to the end of the list is no issue at all. neither is 
renaming labels.

but removing a legal value from the list, i think you need to reindex. Because 
what to do with documents that have that integer value?

in general i'm just trying to make sure we keep things sane here, so that the 
underlying shit is efficient.

 new field type - EnumField
 --

 Key: SOLR-5084
 URL: https://issues.apache.org/jira/browse/SOLR-5084
 Project: Solr
  Issue Type: New Feature
Reporter: Elran Dvir
 Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
 Solr-5084.patch


 We have encountered a use case in our system where we have a few fields 
 (Severity. Risk etc) with a closed set of values, where the sort order for 
 these values is pre-determined but not lexicographic (Critical is higher than 
 High). Generically this is very close to how enums work.
 To implement, I have prototyped a new type of field: EnumField where the 
 inputs are a closed predefined  set of strings in a special configuration 
 file (similar to currency.xml).
 The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734259#comment-13734259
 ] 

Hoss Man commented on SOLR-5084:


bq. but removing a legal value from the list, i think you need to reindex. 
Because what to do with documents that have that integer value?

For sorting and value sources etc... nothing special happens -- they still have 
the same numeric value under the covers; it's just that when writing out the 
stored values (ie: label) you act as if they have no value in the field at 
all (shouldn't affect efficiency at all.)

If the user wants some other behavior the burden is on them to re-index or 
delete the affected docs -- but the simple stuff stays just as simple as if 
they were dealing with the int-label mappings in their own code, the 
validation of legal labels just moves from the client to solr.

 new field type - EnumField
 --

 Key: SOLR-5084
 URL: https://issues.apache.org/jira/browse/SOLR-5084
 Project: Solr
  Issue Type: New Feature
Reporter: Elran Dvir
 Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
 Solr-5084.patch


 We have encountered a use case in our system where we have a few fields 
 (Severity. Risk etc) with a closed set of values, where the sort order for 
 these values is pre-determined but not lexicographic (Critical is higher than 
 High). Generically this is very close to how enums work.
 To implement, I have prototyped a new type of field: EnumField where the 
 inputs are a closed predefined  set of strings in a special configuration 
 file (similar to currency.xml).
 The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5084) new field type - EnumField

2013-08-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734266#comment-13734266
 ] 

Robert Muir commented on SOLR-5084:
---

{quote}
For sorting and value sources etc... nothing special happens – they still have 
the same numeric value under the covers; it's just that when writing out the 
stored values (ie: label) you act as if they have no value in the field at 
all (shouldn't affect efficiency at all.)
{quote}

Then this is just renaming a label to some special value.

I really think the best thing is to keep it simple, like java.lang.Enum. Just 
give a list of values. This way it will be efficient everywhere since the 
values will be dense. Its also conceptually simple.

Otherwise, things get complicated. and the implementation may suffer due to 
sparse ordinals. Really, i dont care, as docvalues will do the right thing as 
long as you have  256 values (regardless of sparsity). Fieldcache wont, but 
doesn't bother me a bit.

But still, there is no sense in making things complicated and inefficient for 
no good reason. Someone could make a HairyComplicatedAndInefficientEnumType for 
that.

 new field type - EnumField
 --

 Key: SOLR-5084
 URL: https://issues.apache.org/jira/browse/SOLR-5084
 Project: Solr
  Issue Type: New Feature
Reporter: Elran Dvir
 Attachments: enumsConfig.xml, schema_example.xml, Solr-5084.patch, 
 Solr-5084.patch


 We have encountered a use case in our system where we have a few fields 
 (Severity. Risk etc) with a closed set of values, where the sort order for 
 these values is pre-determined but not lexicographic (Critical is higher than 
 High). Generically this is very close to how enums work.
 To implement, I have prototyped a new type of field: EnumField where the 
 inputs are a closed predefined  set of strings in a special configuration 
 file (similar to currency.xml).
 The code is based on 4.2.1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: jar-checkums generates extra files?

2013-08-08 Thread Robert Muir
If you google 'svn remove unversioned' you find a couple one-liners
you can alias.

I also found 
http://svn.apache.org/repos/asf/subversion/trunk/contrib/client-side/svn-clean

Weird that it has a GPL license though!

On Thu, Aug 8, 2013 at 4:14 AM, Dawid Weiss
dawid.we...@cs.put.poznan.pl wrote:
 I kind of use a workaround of removing everything except the .svn
 folder and then svn revert -R .
 But this is a dumb solution :)

 D.

 On Thu, Aug 8, 2013 at 1:12 PM, Uwe Schindler u...@thetaphi.de wrote:
 Hi,

 Some GUIs like TortoiseSVN have this. I use this to delete all unversioned
 files in milliseconds(TM). But native svn does not have it, unfortunately.

 Uwe



 Dawid Weiss dawid.we...@gmail.com schrieb:

 Never mind, these were local files and they were svn-ignored, when I
 removed everything and checked out from scratch this problem is no
 longer there.

 I really wish svn had an equivalent of git clean -xfd .

 Dawid

 On Thu, Aug 8, 2013 at 12:39 PM, Dawid Weiss dawid.we...@gmail.com
 wrote:

  When I do this on trunk:

  ant jar-checksums
  svn stat

  I get:
  ?   solr\licenses\jcl-over-slf4j.jar.sha1
  ?   solr\licenses\jul-to-slf4j.jar.sha1
  ?   solr\licenses\log4j.jar.sha1
  ?   solr\licenses\slf4j-api.jar.sha1
  ?   solr\licenses\slf4j-log4j12.jar.sha1

  Where should this be fixed?  Should we svn-ignore those files or
  should they be somehow excluded from the re-generation of SHA
  checksums?

  Daw
  id


 

 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


 --
 Uwe Schindler
 H.-H.-Meier-Allee 63, 28213 Bremen
 http://www.thetaphi.de

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org