[jira] Commented: (LUCENENET-383) System.IO.IOException: read past EOF while deleting the file from upload folder of filemanager.

2010-12-05 Thread chaitanya (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENENET-383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967101#action_12967101
 ] 

chaitanya commented on LUCENENET-383:
-

The error is throwing from  Lucene.Net.Store.BufferedIndexInput.Refill() method.
please check below for the code
if(this.bufferlength=0)
{
throw new IOexception(read past EOF);
}

But we don't know when this bufferlength becoming Zero.

Neal mentioned that .It is possible that the document id that they are 
attempting to delete from the Lucene index does not exist.May be the above 
error comes in this case.

But the file exists in upload folder.And what we found is when deleting the 
file..we are getting above eroor,but  anyhow the file is deleted..only annoying 
thing here is if the file deletion happens successfully also..we are getting 
this error.

What i felt is once after deleting the file...lucene again searching for the 
ID for second time..that time id is not available..hence the error coming.

if this is the situation , why Lucene is searching for this Id..two 
times..for only one request?





 System.IO.IOException: read past EOF while deleting the file from upload 
 folder of filemanager.
 ---

 Key: LUCENENET-383
 URL: https://issues.apache.org/jira/browse/LUCENENET-383
 Project: Lucene.Net
  Issue Type: Bug
 Environment: production
Reporter: chaitanya

 We are getting System.IO.IOException: read past EOF when deleting the file 
 from upload folder of filemanager.It used to work fine earlier.But from fast 
 few days we are getting this error.
 We are using episerver content management system and episerver inturn uses 
 Lucene for indexing.
 Please find the following stack trace of the error.Help me inorder to 
 overcome this error.Thanks in advance
 [IOException: read past EOF]
Lucene.Net.Store.BufferedIndexInput.Refill() +233
Lucene.Net.Store.BufferedIndexInput.ReadByte() +21
Lucene.Net.Store.IndexInput.ReadInt() +13
Lucene.Net.Index.SegmentInfos.Read(Directory directory) +60
Lucene.Net.Index.AnonymousClassWith.DoBody() +45
Lucene.Net.Store.With.Run() +67
Lucene.Net.Index.IndexReader.Open(Directory directory, Boolean 
 closeDirectory) +110
Lucene.Net.Index.IndexReader.Open(String path) +65

 EPiServer.Web.Hosting.Versioning.Store.FileOperations.DeleteItemIdFromIndex(String
  filePath, Object fileId) +78
EPiServer.Web.Hosting.Versioning.Store.FileOperations.DeleteFile(Object 
 dirId, Object fileId) +118
EPiServer.Web.Hosting.Versioning.VersioningFileHandler.Delete() +28
EPiServer.Web.Hosting.VersioningFile.Delete() +118
EPiServer.UI.Hosting.UploadFile.ConfirmReplaceButton_Click(Object sender, 
 EventArgs e) +578
EPiServer.UI.WebControls.ToolButton.OnClick(EventArgs e) +107
EPiServer.UI.WebControls.ToolButton.RaisePostBackEvent(String 
 eventArgument) +135
System.Web.UI.Page.RaisePostBackEvent(IPostBackEventHandler sourceControl, 
 String eventArgument) +13
System.Web.UI.Page.RaisePostBackEvent(NameValueCollection postData) +36
System.Web.UI.Page.ProcessRequestMain(Boolean 
 includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) +1565

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [VOTE] Release PyLucene 2.9.4-1 and 3.0.3-1

2010-12-05 Thread Michael McCandless
+1 to both.

I installed both on Linux (Fedora 13) and ran my test python script
that indexes first 100K line docs from wikipedia and runs a few
searches.  No problems!

Mike

On Sun, Dec 5, 2010 at 1:50 AM, Andi Vajda va...@apache.org wrote:

 With the recent releases of Lucene Java 2.9.4 and 3.0.3, the PyLucene
 2.9.4-1 and 3.0.3-1 releases closely tracking them are ready.

 Release candidates are available from:

    http://people.apache.org/~vajda/staging_area/

 A list of changes in this release can be seen at:
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_2_9/CHANGES
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_3_0/CHANGES

 All versions of PyLucene are built with the same version of JCC, currently
 version 2.7, included in these release artifacts.

 A list of Lucene Java changes can be seen at:
 http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9/CHANGES.txt
 http://svn.apache.org/repos/asf/lucene/java/branches/lucene_3_0/CHANGES.txt

 Please vote to release these artifacts as PyLucene 2.9.4-1 and 3.0.3-1.

 Thanks !

 Andi..

 ps: the KEYS file for PyLucene release signing is at:
    http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS
    http://people.apache.org/~vajda/staging_area/KEYS

 pps: here is my +1



Re: [VOTE] Release PyLucene 2.9.4-1 and 3.0.3-1

2010-12-05 Thread Robert Muir
On Sun, Dec 5, 2010 at 1:50 AM, Andi Vajda va...@apache.org wrote:

 With the recent releases of Lucene Java 2.9.4 and 3.0.3, the PyLucene
 2.9.4-1 and 3.0.3-1 releases closely tracking them are ready.

 Release candidates are available from:

    http://people.apache.org/~vajda/staging_area/

 A list of changes in this release can be seen at:
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_2_9/CHANGES
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_3_0/CHANGES

 All versions of PyLucene are built with the same version of JCC, currently
 version 2.7, included in these release artifacts.

 A list of Lucene Java changes can be seen at:
 http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9/CHANGES.txt
 http://svn.apache.org/repos/asf/lucene/java/branches/lucene_3_0/CHANGES.txt

 Please vote to release these artifacts as PyLucene 2.9.4-1 and 3.0.3-1.


+1, everything looks in order, building pylucene and running 'make
test' seemed fine on both versions.


Exception in migrating from 2.9.x to 3.0.2 on Android

2010-12-05 Thread DM Smith
The current code that works on Android with 2.9.1, but fails with 3.0.2:

Directory dir = FSDirectory.open(file);
...
do something with directory
...

The error we're seeing is:
12-04 21:34:41.629: WARN/System.err(23160): java.lang.NoClassDefFoundError: 
java.lang.management.ManagementFactory
12-04 21:34:41.639: WARN/System.err(23160): at 
org.apache.lucene.store.NativeFSLockFactory.acquireTestLock(NativeFSLockFactory.java:87)
12-04 21:34:41.639: WARN/System.err(23160): at 
org.apache.lucene.store.NativeFSLockFactory.makeLock(NativeFSLockFactory.java:142)
12-04 21:34:41.649: WARN/System.err(23160): at 
org.apache.lucene.store.Directory.makeLock(Directory.java:106)
12-04 21:34:41.649: WARN/System.err(23160): at 
org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1058)

Turns out Android does not have java.lang.management.ManagementFactory. 

There are several work arounds in client code, but not sure what is best.

The bigger question is whether and how Lucene should be modified to accommodate?

Ultimately FSDirectory.open does the following:
if (Constants.WINDOWS) {
  return new SimpleFSDirectory(path, lockFactory);
} else {
  return new NIOFSDirectory(path, lockFactory);
}

Should Android be a supported client OS?

If so, wouldn't it be better not to have OS specific if-then-else and use 
reflection or something else?

Thanks,
DM
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Exception in migrating from 2.9.x to 3.0.2 on Android

2010-12-05 Thread Gérard Dupont
On 5 December 2010 00:16, DM Smith dm-sm...@woh.rr.com wrote:

 Should Android be a supported client OS?
 If so, wouldn't it be better not to have OS specific if-then-else and use
 reflection or something else?


Well Lucene is only relying on standard JVM API. The fact that Androïd is
using a non-standard JVM is IMHO outside the the scope of Lucene.

-- 
Gérard Dupont
Information Processing Control and Cognition (IPCC)
CASSIDIAN - an EADS company

Document  Learning team - LITIS Laboratory


RE: Exception in migrating from 2.9.x to 3.0.2 on Android

2010-12-05 Thread Uwe Schindler
Hi DM,

In Lucene 3.0.3, NativeFSLockFactory no longer aquires a test log and does
not need the process ID anymore, so java.lang.management package is no
longer used.

In general, Lucene Java is compatible to the Java 5 SE specification.
Android uses Harmony and therefore we cannot guarantee compatibility as
Harmony is not TCK tested (but we do with latest versions, soon there will
also be tests on Hudson with Harmony). But only latest versions of Harmony
are really compatible with Lucene, previous versions fail lots of tests (ask
Robert), and Android phones use very antique versions of Harmony - it is not
even sure, that the Java5 Memory Model is correctly implemented in Dalvik!

About 3.0.2: Of course this version even works with latest Harmony, so
Harmony has java.lang.management package (which is java.lang!!!), so the bug
is in Android, simply by excluding a SE package. So you should open bug
report at Google and then hope that they fix it and all the phone
manufacturers like Motor-Roller will update their Android versions.

For your problem: The easy workaround is using Lucene 3.0.3 or simply use
another LockFactory (Andoid is single user so even NoLockFactory would be
fine in most cases). This are the same limitations like with the NFS
filesystem. Just use FSDir.open(dir, lockFactory).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 -Original Message-
 From: DM Smith [mailto:dm-sm...@woh.rr.com]
 Sent: Sunday, December 05, 2010 12:16 AM
 To: dev@lucene.apache.org
 Subject: Exception in migrating from 2.9.x to 3.0.2 on Android
 
 The current code that works on Android with 2.9.1, but fails with 3.0.2:
 
 Directory dir = FSDirectory.open(file);
 ...
 do something with directory
 ...
 
 The error we're seeing is:
 12-04 21:34:41.629: WARN/System.err(23160):
 java.lang.NoClassDefFoundError:
 java.lang.management.ManagementFactory
 12-04 21:34:41.639: WARN/System.err(23160): at
 org.apache.lucene.store.NativeFSLockFactory.acquireTestLock(NativeFSLock
 Factory.java:87)
 12-04 21:34:41.639: WARN/System.err(23160): at
 org.apache.lucene.store.NativeFSLockFactory.makeLock(NativeFSLockFactor
 y.java:142)
 12-04 21:34:41.649: WARN/System.err(23160): at
 org.apache.lucene.store.Directory.makeLock(Directory.java:106)
 12-04 21:34:41.649: WARN/System.err(23160): at
 org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1058)
 
 Turns out Android does not have
 java.lang.management.ManagementFactory.
 
 There are several work arounds in client code, but not sure what is best.
 
 The bigger question is whether and how Lucene should be modified to
 accommodate?
 
 Ultimately FSDirectory.open does the following:
 if (Constants.WINDOWS) {
   return new SimpleFSDirectory(path, lockFactory);
 } else {
   return new NIOFSDirectory(path, lockFactory);
 }
 
 Should Android be a supported client OS?
 
 If so, wouldn't it be better not to have OS specific if-then-else and use
 reflection or something else?
 
 Thanks,
   DM
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Lucene-Solr-tests-only-trunk - Build # 2218 - Failure

2010-12-05 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/2218/

1 tests failed.
REGRESSION:  org.apache.solr.TestDistributedSearch.testDistribSearch

Error Message:
Some threads threw uncaught exceptions!

Stack Trace:
junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:979)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:917)
at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:466)
at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:92)
at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144)




Build Log (for compile errors):
[...truncated 8716 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2798) Randomize indexed collation key testing

2010-12-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12966933#action_12966933
 ] 

Robert Muir commented on LUCENE-2798:
-

Steven, before working too hard on the jdk collation tests, i just had this 
idea:

Are we sure we shouldn't deprecate the jdk collation functionality (remove in 
trunk) and only offer ICU?

I was just thinking that the JDK Collator integration is basically a RAM trap 
due to its aweful keysize, etc:
http://site.icu-project.org/charts/collation-icu4j-sun



 Randomize indexed collation key testing
 ---

 Key: LUCENE-2798
 URL: https://issues.apache.org/jira/browse/LUCENE-2798
 Project: Lucene - Java
  Issue Type: Test
  Components: Analysis
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Assignee: Steven Rowe
Priority: Minor
 Fix For: 3.1, 4.0


 Robert Muir noted on #lucene IRC channel today that Lucene's indexed 
 collation key testing is currently fragile (for example, they had to be 
 revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of 
 Unicode 6.0 collation changes) and coverage is trivial (only 5 locales 
 tested, and no collator options are exercised).  This affects both the JDK 
 implementation in {{modules/analysis/common/}} and the ICU implementation 
 under {{modules/icu/}}.
 The key thing to test is that the order of the indexed terms is the same as 
 that provided by the Collator itself.  Instead of the current set of static 
 tests, this could be achieved via indexing randomly generated terms' 
 collation keys (and collator options) and then comparing the index terms' 
 order to the order provided by the Collator over the original terms.
 Since different terms may produce the same collation key, however, the order 
 of indexed terms is inherently unstable.  When performing runtime collation, 
 the Collator addresses the sort stability issue by adding a secondary sort 
 over the normalized original terms.  In order to directly compare Collator's 
 sort with Lucene's collation key sort, a secondary sort will need to be 
 applied to Lucene's indexed terms as well. Robert has suggested indexing the 
 original terms in addition to their collation keys, then using a Sort over 
 the original terms as the secondary sort.
 Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and 
 trunk uses UTF-8 order, so the implemented secondary sort will need to 
 respect that.
 From #lucene:
 {quote}
 rmuir__: so i think we have to on 3.x, sort the 'expected list' with 
 Collator.compare, if thats equal, then as a tiebreak use String.compareTo
 rmuir__: and in the index sort on the collated field, followed by the 
 original term
 rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the 
 tiebreak for the expected list
 rmuir__: instead compare codepoints (iterating character.codepointAt, or 
 comparing .getBytes(UTF-8))
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2763) Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer

2010-12-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12966943#action_12966943
 ] 

Robert Muir commented on LUCENE-2763:
-

+1, looks good to me.


 Swap URL+Email recognizing StandardTokenizer and UAX29Tokenizer
 ---

 Key: LUCENE-2763
 URL: https://issues.apache.org/jira/browse/LUCENE-2763
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Assignee: Steven Rowe
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2763.patch


 Currently, in addition to implementing the UAX#29 word boundary rules, 
 StandardTokenizer recognizes email adresses and URLs, but doesn't provide a 
 way to turn this behavior off and/or provide overlapping tokens with the 
 components (username from email address, hostname from URL, etc.).
 UAX29Tokenizer should become StandardTokenizer, and current StandardTokenizer 
 should be renamed to something like UAX29TokenizerPlusPlus (or something like 
 that).
 For rationale, see [the discussion at the reopened 
 LUCENE-2167|https://issues.apache.org/jira/browse/LUCENE-2167?focusedCommentId=12929325page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12929325].

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene-Solr-tests-only-trunk - Build # 2221 - Failure

2010-12-05 Thread Yonik Seeley
Well, darn upgrading jetty didn't seem to help this.

-Yonik
http://www.lucidimagination.com



On Sun, Dec 5, 2010 at 7:05 AM, Apache Hudson Server
hud...@hudson.apache.org wrote:
 Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/2221/

 1 tests failed.
 REGRESSION:  org.apache.solr.TestDistributedSearch.testDistribSearch

 Error Message:
 Some threads threw uncaught exceptions!

 Stack Trace:
 junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
        at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:979)
        at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:917)
        at 
 org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:466)
        at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:92)
        at 
 org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144)




 Build Log (for compile errors):
 [...truncated 8716 lines...]



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene-Solr-tests-only-trunk - Build # 2221 - Failure

2010-12-05 Thread Robert Muir
On Sun, Dec 5, 2010 at 9:00 AM, Yonik Seeley yo...@lucidimagination.com wrote:
 Well, darn upgrading jetty didn't seem to help this.


I was getting really hopeful for a while!

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene-Solr-tests-only-trunk - Build # 2211 - Failure

2010-12-05 Thread Robert Muir
On Sun, Dec 5, 2010 at 1:46 AM, Apache Hudson Server
hud...@hudson.apache.org wrote:
 Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/2211/

 1 tests failed.
 REGRESSION:  org.apache.solr.update.AutoCommitTest.testMaxTime


There's still a timing issue in this test I think. I modified it a
while ago to make it better but Hoss mentioned on the mailing list
some way we could change it to not be fragile...

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Assigned: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned SOLR-1979:
-

Assignee: Grant Ingersoll

 Create LanguageIdentifierUpdateProcessor
 

 Key: SOLR-1979
 URL: https://issues.apache.org/jira/browse/SOLR-1979
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1979.patch


 We need the ability to detect language of some random text in order to act 
 upon it, such as indexing the content into language aware fields. Another 
 usecase is to be able to filter/facet on language on random unstructured 
 content.
 To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
 processor is configurable like this:
 {code:xml} 
   processor 
 class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory
 str name=inputFieldsname,subject/str
 str name=outputFieldlanguage_s/str
 str name=idFieldid/str
 str name=fallbacken/str
   /processor
 {code} 
 It will then read the text from inputFields name and subject, perform 
 language identification and output the ISO code for the detected language in 
 the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (SOLR-2244) Add Language Identification support

2010-12-05 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved SOLR-2244.
---

Resolution: Won't Fix

Actually, I'm going to switch back to SOLR-1979, as it is a superset of this 
patch.  I should have a patch up shortly.

 Add Language Identification support
 ---

 Key: SOLR-2244
 URL: https://issues.apache.org/jira/browse/SOLR-2244
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Attachments: solr2244.patch


 For starters, Tika has language identification capabilities that we can 
 likely leverage, but moreover, make it easier for people to plug in language 
 identification into the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12966955#action_12966955
 ] 

Grant Ingersoll commented on SOLR-1979:
---

See http://wiki.apache.org/solr/LanguageDetection for the start of 
documentation.

bq. isReasonablyCertain() always returns false

See TIKA-568.

 Create LanguageIdentifierUpdateProcessor
 

 Key: SOLR-1979
 URL: https://issues.apache.org/jira/browse/SOLR-1979
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1979.patch


 We need the ability to detect language of some random text in order to act 
 upon it, such as indexing the content into language aware fields. Another 
 usecase is to be able to filter/facet on language on random unstructured 
 content.
 To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
 processor is configurable like this:
 {code:xml} 
   processor 
 class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory
 str name=inputFieldsname,subject/str
 str name=outputFieldlanguage_s/str
 str name=idFieldid/str
 str name=fallbacken/str
   /processor
 {code} 
 It will then read the text from inputFields name and subject, perform 
 language identification and output the ISO code for the detected language in 
 the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext

2010-12-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12966963#action_12966963
 ] 

Robert Muir commented on LUCENE-2793:
-

There is another problem we should solve here, and that is the buffersize 
problem.

This is totally broken at the moment for custom directories, here's an example.
I wanted to set the buffersize by default to 4096 (since i measured this is 
like a 20% improvement for my directory impl).

looking at the apis you would think that you simply override the openInput that 
takes no buffer size like this:
{noformat}
  @Override
  public IndexInput openInput(String name) throws IOException {
return openInput(name, 4096);
  }
{noformat}

unfortunately this doesnt work at all! instead you have to do something like 
this for it to actually work:
{noformat}
   @Override
   public IndexInput openInput(String name, int bufferSize) throws IOException {
  ensureOpen();
  return new IndexInput(name, Math.max(bufferSize, 4096));
   }
{noformat}

The problem is, throughout lucene's APIs, the directory's default is never 
used, instead the static BufferedIndexInput.BUFFER_SIZE is used everywhere... 
eg SegmentReader.get:

{noformat}
  public static SegmentReader get(boolean readOnly, SegmentInfo si, int 
termInfosIndexDivisor) throws CorruptIndexException, IOException {
return get(readOnly, si.dir, si, BufferedIndexInput.BUFFER_SIZE, true, 
termInfosIndexDivisor);
  }
{noformat}

So I think lucene's apis should never specify buffersize, we should remove it 
completely from the codecs api, and it should be *replaced* with IOContext.


 Directory createOutput and openInput should take an IOContext
 -

 Key: LUCENE-2793
 URL: https://issues.apache.org/jira/browse/LUCENE-2793
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless

 Today for merging we pass down a larger readBufferSize than for searching 
 because we get better performance.
 I think we should generalize this to a class (IOContext), which would hold 
 the buffer size, but then could hold other flags like DIRECT (bypass OS's 
 buffer cache), SEQUENTIAL, etc.
 Then, we can make the DirectIOLinuxDirectory fully usable because we would 
 only use DIRECT/SEQUENTIAL during merging.
 This will require fixing how IW pools readers, so that a reader opened for 
 merging is not then used for searching, and vice/versa.  Really, it's only 
 all the open file handles that need to be different -- we could in theory 
 share del docs, norms, etc, if that were somehow possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12966964#action_12966964
 ] 

Jan Høydahl commented on SOLR-1979:
---

Simply allowing to set the threshold for isReasonablyCertain() is probably not 
enough to get a robust detection. This is because the distance measure is very 
sensitive to the length of the profiles in use. Thus, it is a bit dangerous to 
expose getDistance() as in TIKA-568, cause that distance measure is kind of an 
internal value, not very normalized and is bound to change in future versions 
of TIKA.

See TIKA-369 and TIKA-496.

I think the right way to go is solving these two issues first. By fixing so 
that getDisance() is not biased towards profile length, we can make a new 
isReasonablyCertain() implementation taking into account the relative distance 
between first and second candidate languages...

 Create LanguageIdentifierUpdateProcessor
 

 Key: SOLR-1979
 URL: https://issues.apache.org/jira/browse/SOLR-1979
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1979.patch


 We need the ability to detect language of some random text in order to act 
 upon it, such as indexing the content into language aware fields. Another 
 usecase is to be able to filter/facet on language on random unstructured 
 content.
 To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
 processor is configurable like this:
 {code:xml} 
   processor 
 class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory
 str name=inputFieldsname,subject/str
 str name=outputFieldlanguage_s/str
 str name=idFieldid/str
 str name=fallbacken/str
   /processor
 {code} 
 It will then read the text from inputFields name and subject, perform 
 language identification and output the ISO code for the detected language in 
 the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12966970#action_12966970
 ] 

Jan Høydahl commented on SOLR-1979:
---

The idField input parameter is just used for decent logging if detection fails. 
It would be more elegant to get the id field name automatically through 
SolrCore...

 Create LanguageIdentifierUpdateProcessor
 

 Key: SOLR-1979
 URL: https://issues.apache.org/jira/browse/SOLR-1979
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1979.patch


 We need the ability to detect language of some random text in order to act 
 upon it, such as indexing the content into language aware fields. Another 
 usecase is to be able to filter/facet on language on random unstructured 
 content.
 To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
 processor is configurable like this:
 {code:xml} 
   processor 
 class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory
 str name=inputFieldsname,subject/str
 str name=outputFieldlanguage_s/str
 str name=idFieldid/str
 str name=fallbacken/str
   /processor
 {code} 
 It will then read the text from inputFields name and subject, perform 
 language identification and output the ISO code for the detected language in 
 the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2158) TestDistributedSearch.testDistribSearch fails often

2010-12-05 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12966971#action_12966971
 ] 

Yonik Seeley commented on SOLR-2158:


OK, so we upgraded jetty... but the failed to respond exception still happens.
Just to try and narrow things down, I put a long sleep inside solr request 
handling and then tried a distributed search... it worked fine. So
it doesn't appear to be something getting hung up in Solr.

- a jetty bug
- an embedded jetty bug
- a HttpClient bug
- a bug in the way solr uses HttpClient

Another data point: with my load testing tool, I can run millions of requests 
against Jetty/Solr (and I just did again). It doesn't use HttpClient though, 
and it uses GET instead of POST.

Some things to try:
 - Modify the load tool to use POST and verify things still work
 - Put a long pause in TestDistributedSearch after the solr servers are brought 
up, and then try load testing against those servers w/ an external tool.
   - if this fails, we know it's an issue with how we embed Jetty
- Make a load testing tool that uses SolrJ exactly the way that distributed 
search uses it, and try it on a normal Solr server
   - if this fails, ti could be an HttpClient bug, or a jetty bug tickled by 
HttpClient specifically
   - if this fails, make a small self-contained load tool that uses only 
HttpClient to remove the possibility of SolrJ bugs 

 TestDistributedSearch.testDistribSearch fails often
 ---

 Key: SOLR-2158
 URL: https://issues.apache.org/jira/browse/SOLR-2158
 Project: Solr
  Issue Type: Bug
  Components: Build
Affects Versions: 3.1, 4.0
 Environment: Hudson
Reporter: Robert Muir
 Fix For: 3.1, 4.0

 Attachments: TEST-org.apache.solr.TestDistributedSearch.txt


 TestDistributedSearch.testDistribSearch fails often in hudson, with some 
 threads throwing uncaught exceptions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12966972#action_12966972
 ] 

Robert Muir commented on SOLR-1979:
---

bq. cause that distance measure is kind of an internal value, not very 
normalized and is bound to change in future versions of TIKA.

bq. we can make a new isReasonablyCertain() implementation taking into account 
the relative distance between first and second candidate languages...

I don't follow the logic: if its not very normalized then it seems like this 
approach doesnt tell you anything... language 1 could be uncertain,
 and language 2 is just completely uncertain, but that tells you nothing: isn't 
it like trying to determine if a good lucene search result score is certainly 
a hit and not really the right way to go?

For example: consider the case where the language isn't supported at all by 
Tika (i dont see a list of supported languages anywhere by the way!).
It would be good for us to know that the detection is uncertain at all... how 
relatively uncertain it is with regards to the next language, is not very 
important.

I think its also important we be able to get this uncertainty or whatever 
different agnostic of the implementation.
For example, we should be able to somehow think of chaining detectors... 

Its really important to cheat and not use heuristics for languages that don't 
need them.
For example, disregarding some strange theoretical/historical cases, you can 
simply look at the unicode properties 
in the document to determine that its in the Greek language, as its basically 
the only modern language using the greek alphabet


 Create LanguageIdentifierUpdateProcessor
 

 Key: SOLR-1979
 URL: https://issues.apache.org/jira/browse/SOLR-1979
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1979.patch


 We need the ability to detect language of some random text in order to act 
 upon it, such as indexing the content into language aware fields. Another 
 usecase is to be able to filter/facet on language on random unstructured 
 content.
 To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
 processor is configurable like this:
 {code:xml} 
   processor 
 class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory
 str name=inputFieldsname,subject/str
 str name=outputFieldlanguage_s/str
 str name=idFieldid/str
 str name=fallbacken/str
   /processor
 {code} 
 It will then read the text from inputFields name and subject, perform 
 language identification and output the ISO code for the detected language in 
 the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (SOLR-2158) TestDistributedSearch.testDistribSearch fails often

2010-12-05 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12966718#action_12966718
 ] 

Yonik Seeley edited comment on SOLR-2158 at 12/5/10 10:38 AM:
--

Moving Robert's stack trace from the description to the comments.

{code}
[junit] Testsuite: org.apache.solr.TestDistributedSearch
[junit] Testcase: testDistribSearch(org.apache.solr.TestDistributedSearch): 
FAILED
[junit] Some threads threw uncaught exceptions!
[junit] junit.framework.AssertionFailedError: Some threads threw uncaught 
exceptions!
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:795)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:768)
[junit] at 
org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:416)
[junit] at 
org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:76)
[junit] at 
org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144)
[junit] 
[junit] 
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 382.297 sec
[junit] 
[junit] - Standard Error -
[junit] 2010. 10. 15 ?? 2:08:04 org.apache.solr.common.SolrException log
[junit] ??: org.apache.solr.common.SolrException: 
org.apache.solr.client.solrj.SolrServerException: No live SolrServers available 
to handle this request
[junit] at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:318)
[junit] at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
[junit] at org.apache.solr.core.SolrCore.execute(SolrCore.java:1325)
[junit] at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337)
[junit] at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240)
[junit] at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
[junit] at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
[junit] at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
[junit] at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
[junit] at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
[junit] at org.mortbay.jetty.Server.handle(Server.java:326)
[junit] at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
[junit] at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:923)
[junit] at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:547)
[junit] at 
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
[junit] at 
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
[junit] at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
[junit] at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
[junit] Caused by: org.apache.solr.client.solrj.SolrServerException: No 
live SolrServers available to handle this request
[junit] at 
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:297)
[junit] at 
org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:513)
[junit] at 
org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:478)
[junit] at 
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
[junit] at java.util.concurrent.FutureTask.run(FutureTask.java:166)
[junit] at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
[junit] at 
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
[junit] at java.util.concurrent.FutureTask.run(FutureTask.java:166)
[junit] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
[junit] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
[junit] at java.lang.Thread.run(Thread.java:636)
[junit] Caused by: org.apache.solr.client.solrj.SolrServerException: 
java.net.ConnectException: Operation timed out
[junit] at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:483)
[junit] at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
[junit] at 
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:274)
[junit] ... 10 more
[junit] Caused by: java.net.ConnectException: Operation timed out
[junit] at 

[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-1979:
--

Attachment: SOLR-1979.patch

I took Jan's and Tommaso's patches and reworked them a bit.  It seems to me 
that there isn't much point in merely identifying the language if you aren't 
going to do something about it.  So, this patch builds on what Jan and Tommaso 
did and then will remap the input fields to new per language fields (note, we 
could make this optional).  I also tried to standardize the input parameters a 
bit.  I dropped the outputField setting and a number of other settings and I 
made the language detection to be per input field.  The basic gist of it is 
that if you input two fields: name, subject, it will detect the language of 
each field and then attempt to map them to a new field.  The new field is made 
by concatenating the original field name with _ + the ISO 639 code.  For 
example, if en is the detected language, then the new field for name would be 
name_en.  If that field doesn't exist, it will fall back to the original field 
(i.e. name).

Left to do:
# Fix the tests.  I don't like how we currently tests UpdateProcessorChains.  
It should not require writing your own little piece of update mechanism.  You 
should be able to simply setup the appropriate configuration, hook it into an 
update handler and then hit that update handler.  
# Need to check the license headers, builds, etc.

 Create LanguageIdentifierUpdateProcessor
 

 Key: SOLR-1979
 URL: https://issues.apache.org/jira/browse/SOLR-1979
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1979.patch, SOLR-1979.patch


 We need the ability to detect language of some random text in order to act 
 upon it, such as indexing the content into language aware fields. Another 
 usecase is to be able to filter/facet on language on random unstructured 
 content.
 To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
 processor is configurable like this:
 {code:xml} 
   processor 
 class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory
 str name=inputFieldsname,subject/str
 str name=outputFieldlanguage_s/str
 str name=idFieldid/str
 str name=fallbacken/str
   /processor
 {code} 
 It will then read the text from inputFields name and subject, perform 
 language identification and output the ISO code for the detected language in 
 the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12966978#action_12966978
 ] 

Robert Muir commented on SOLR-1979:
---

We really need to not be using ISO 639-1 here. 

For example,
Its not expressive enough, not differentiating between Simplified and 
Traditional chinese, yet SmartChineseAnalyzer only works on Simplified.

I would like to see RFC 3066 instead

 Create LanguageIdentifierUpdateProcessor
 

 Key: SOLR-1979
 URL: https://issues.apache.org/jira/browse/SOLR-1979
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1979.patch, SOLR-1979.patch


 We need the ability to detect language of some random text in order to act 
 upon it, such as indexing the content into language aware fields. Another 
 usecase is to be able to filter/facet on language on random unstructured 
 content.
 To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
 processor is configurable like this:
 {code:xml} 
   processor 
 class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory
 str name=inputFieldsname,subject/str
 str name=outputFieldlanguage_s/str
 str name=idFieldid/str
 str name=fallbacken/str
   /processor
 {code} 
 It will then read the text from inputFields name and subject, perform 
 language identification and output the ISO code for the detected language in 
 the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Changes Mess

2010-12-05 Thread Mattmann, Chris A (388J)
Hi Mark,

RE: the credit system. JIRA provides a contribution report here, like this one 
that I generated for Lucene 3.1:

http://s.apache.org/BpL

Just click on Reports  Contribution Report in the upper right of JIRA on the 
main project summary page.

We've been using this in Tika since the beginning to indicate contributions 
from folks and it's worked well.

Cheers,
Chris

On Dec 4, 2010, at 10:03 PM, Mark Miller wrote:

 I like this idea myself - it would encourage better JIRA summaries and 
 reduce duplication.
 
 It's easy to keep a mix of old and new too - keep the things that Grant 
 mentions in CHANGES.txt (back compat migration, misc info), but you can 
 also just export a text Changes from JIRA at release and add that (along 
 with a link). Certainly nice to have a 'hard' copy.
 
 https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12315147styleName=TextprojectId=12310110Create=Create
 
 The only thing I don't like is the loss of the current credit system - I 
 like that better than the crawl through JIRA method. I think prominent 
 credits are a good encouragement for new contributors.
 
 Any comments on that?
 
 - Mark
 
 On 12/2/10 11:46 AM, Grant Ingersoll wrote:
 I think we should drop the item by item change list and instead focus on 3 
 things:
 1. Prose describing the new features (see Tika's changes file for instance) 
 and things users should pay special attention to such as when they might 
 need to re-index.
 2. Calling out explicit compatibility breaks
 3. A Pointer to full list of changes in JIRA.  Alternatively, I believe 
 there is a way in JIRA to export/generate a summary of all issues fixed.
 
 #1 can be done right before release simply by going through #3 and doing the 
 appropriate wordsmithing.  #2 should be tracked as it is found.
 
 It's kind of silly that we have all this duplication of effort built in, not 
 too mention having to track it across two branches.
 
 We do this over in Mahout and I think it works pretty well and reduces the 
 duplication quite a bit since everything is already in JIRA and JIRA 
 produces nice summaries too.  It also encourages people to track things 
 better in JIRA.  #1 above also lends itself well as the basis of press 
 releases/blogs/etc.
 
 -Grant
 
 
 On Dec 1, 2010, at 11:54 AM, Michael McCandless wrote:
 
 So, going forward...
 
 When committing an issue that needs a changes entry, where are we
 supposed to put it?
 
 EG if it's a bug fix that we'll backport all the way to 2.9.x... where
 does it go?
 
 If it's a new feature/API that's going to 3.x and trunk... only in
 3.x's CHANGES?
 
 Mike
 
 On Wed, Dec 1, 2010 at 9:22 AM, Uwe Schindleru...@thetaphi.de  wrote:
 Hi all,
 
 when merging changes done in 2.9.4/3.0.3 with current 3.x and trunk I found
 out that 3.x changes differ immense between the trunk changes.txt and the
 3.x changes.txt. Some entries are missing in the 3.x branch, but are
 available in trunk's 3.x part or other entries using new trunk class names
 are between 3.x changes in trunk.
 
 I copied over the 3.x branch CHANGES.txt over trunks 3.x section and
 attached a patch of this. What should we do? Its messy :( Most parts seem 
 to
 be merge failures. We should go through all those diff'ed issues and check
 where they were really fixed (3.x or trunk) and move the entries
 accordingly. After that in the 3.x branch and in trunk's 3.x section of
 CHANGES.txt should be identical text!
 
 Uwe
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 --
 Grant Ingersoll
 http://www.lucidimagination.com
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Changes Mess

2010-12-05 Thread Robert Muir
On Sun, Dec 5, 2010 at 12:08 PM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Mark,

 RE: the credit system. JIRA provides a contribution report here, like this 
 one that I generated for Lucene 3.1:


My concern with this is that it leaves out important email contributors.

For example if a user reports a bug, we typically include their name
in CHANGES.txt
The user who reports the bug does the hard work of finding that
there is a bug and reporting it to us.
Additionally sometimes they do extra stuff, boiling the problem down
to a certain piece of code, into a test case, etc, even if they don't
know how to fix the bug.
Then again, maybe they are a solr user who doesn't even know the java
programming language but finds a nasty bug in lucene.

In all cases I think if a user finds a bug and we fix it, its
important we credit them as we should encourage people to find bugs :)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (SOLR-2266) java.lang.ArrayIndexOutOfBoundsException in field cache when using a tdate field in a boost function with rord()

2010-12-05 Thread Peter Wolanin (JIRA)
java.lang.ArrayIndexOutOfBoundsException in field cache when using a tdate 
field in a boost function with rord()


 Key: SOLR-2266
 URL: https://issues.apache.org/jira/browse/SOLR-2266
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4.1
 Environment: Mac OS 10.6
java version 1.6.0_22
Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)

Reporter: Peter Wolanin



I have been testing a switch to long and tdate instead of int and date fields 
in the schema.xml for our Drupal integration.  This indexes fine, but search 
fails with a 500 error.

{code}
INFO: [d7] webapp=/solr path=/select 
params={spellcheck=truefacet=truefacet.mincount=1indent=1spellcheck.q=termjson.nl=mapwt=jsonrows=10version=1.2fl=id,entity_id,entity,bundle,bundle_name,nid,title,comment_count,type,created,changed,score,path,url,uid,namestart=0facet.sort=trueq=termbf=recip(rord(created),4,19,19)^200.0}
 status=500 QTime=4 
Dec 5, 2010 11:52:28 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.ArrayIndexOutOfBoundsException: 39
at 
org.apache.lucene.search.FieldCacheImpl$StringIndexCache.createValue(FieldCacheImpl.java:721)
at 
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224)
at 
org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:692)
at 
org.apache.solr.search.function.ReverseOrdFieldSource.getValues(ReverseOrdFieldSource.java:61)
at 
org.apache.solr.search.function.TopValueSource.getValues(TopValueSource.java:57)
at 
org.apache.solr.search.function.ReciprocalFloatFunction.getValues(ReciprocalFloatFunction.java:61)
at 
org.apache.solr.search.function.FunctionQuery$AllScorer.init(FunctionQuery.java:123)
at 
org.apache.solr.search.function.FunctionQuery$FunctionWeight.scorer(FunctionQuery.java:93)
at 
org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:297)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:250)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1101)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:880)
at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at com.acquia.search.HmacFilter.doFilter(HmacFilter.java:62)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
{code}

The exception goes away if I remove the boost function param 
bf=recip(rord(created),4,19,19)^200.0

Omitting the recip() doesn't help, so just bf=rord(created)^200.0 still causes 
the exception.

In 

[jira] Resolved: (LUCENE-1541) Trie range - make trie range indexing more flexible

2010-12-05 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-1541.
---

Resolution: Won't Fix

I don't think a fix is needed anymore.

 Trie range - make trie range indexing more flexible
 ---

 Key: LUCENE-1541
 URL: https://issues.apache.org/jira/browse/LUCENE-1541
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Ning Li
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-1541.patch, LUCENE-1541.patch


 In the current trie range implementation, a single precision step is 
 specified. With a large precision step (say 8), a value is indexed in fewer 
 terms (8) but the number of terms for a range can be large. With a small 
 precision step (say 2), the number of terms for a range is smaller but a 
 value is indexed in more terms (32).
 We want to add an option that different precision steps can be set for 
 different precisions. An expert can use this option to keep the number of 
 terms for a range small and at the same time index a value in a small number 
 of terms. See the discussion in LUCENE-1470 that results in this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Path to jquery?

2010-12-05 Thread Eric Pugh
You are quite right.  I put a bug into JIRA, basically the layout.vm was 
referring to a older version of jquery then what was in the Solr.war file!  I 
do think though that having everything all in the /velocity directory would 
make it easier for someone who is new to Solr to grok how to customize the 
/browse interface!  Most folks do NOT want to be adding/hacking files in the 
solr.war, they just want to use what is distributed!

Eric



On Dec 2, 2010, at 4:45 PM, Ryan McKinley wrote:

 jquery is actually in the .war file, so you read it directly from the server.
 
 The file?file=/velocity... request streams content from inside your
 solr configuration directory
 
 
 
 On Thu, Dec 2, 2010 at 10:35 AM, Eric Pugh
 ep...@opensourceconnections.com wrote:
 Hi all,
 
 Looking at Solr 3.x, it seems like that path to jquery fails if you are 
 using multicore.
 
 In layout.vm there is:
 
 script type=text/javascript 
 src=#{url_for_solr}/admin/jquery-1.2.3.min.js/script
 
 However, for other files it is specified via:
 
  script type=text/javascript 
 src=#{url_for_solr}/admin/file?file=/velocity/jquery.autocomplete.jscontentType=text/javascript/script
 
 
 Thinking that the URL for jquery should be the same as the other 
 jquery.autocomplete.js, and packaged in the /velocity directory as well???
 
 Eric
 
 
 -
 Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
 http://www.opensourceconnections.com
 Co-Author: Solr 1.4 Enterprise Search Server available from 
 http://www.packtpub.com/solr-1-4-enterprise-search-server
 Free/Busy: http://tinyurl.com/eric-cal
 
 
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 

-
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com
Co-Author: Solr 1.4 Enterprise Search Server available from 
http://www.packtpub.com/solr-1-4-enterprise-search-server
Free/Busy: http://tinyurl.com/eric-cal









-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Changes Mess

2010-12-05 Thread Steven A Rowe
On 12/5/2010 at 12:19 PM, Robert Muir wrote:
 On Sun, Dec 5, 2010 at 12:08 PM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
  Hi Mark,
 
  RE: the credit system. JIRA provides a contribution report here, like
  this one that I generated for Lucene 3.1:
 
 
 My concern with this is that it leaves out important email contributors.

I agree, this is a serious problem.

My additional problems with JIRA-generated changes:

1. Huge undifferentiated change lists are frightening and nearly useless, 
regardless of the quality of the descriptions.

JIRA's issue types are:
 
Bug, New Feature, Improvement, Test, Wish, Task

Even if we used JIRA's issue types to group issues, they
are not the same as Lucene's CHANGES.txt issue types:

Changes in backwards compatibility policy, 
Changes in runtime behavior, 
API Changes, Documentation, Bug fixes, New features,
Optimizations, Build, Test Cases, Infrastructure

(I left out Requirements, last used in 2006 under release
1.9 RC1, since Build seems to have replaced it.)

2. There are now four separate CHANGES.txt files in the Lucene code base, 
excluding Solr and its modules (each of which has one of them).  This number 
will only grow as more Lucene contribs become modules.

The JIRA project components list is outdated / incomplete
/ has different granularity than the CHANGES.txt locations,
so using it to group JIRA issues would not work because
they don't align with Lucene/Solr components.

3. Some of the CHANGES.txt entries draw from multiple JIRA issues.

From dev/trunk/lucene/CHANGES.txt:

Trunk: 9 out of 56 include multiple JIRA issues
3.X: 7/94
3.0.0: 3/29
2.9.0: 9/153

I'm assuming a JIRA dump can't do this.

4. Some JIRA issues appear under multiple change categories in CHANGES.txt.

From dev/trunk/lucene/CHANGES.txt:

Trunk: 3 out of 68 multiply categorized
3.X: 9/102
3.0.0: 1/53
2.9.0: 20/166

A JIRA dump would not allow for multiple issue 
categorization, since JIRA only allows a single issue
type to be assigned - I guess they are assumed to be
mutually exclusive.


Maybe our use of JIRA could be changed to address some of these problems, 
through addition of new fields and/or modification of existing fields' 
allowable values?

Steve



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967010#action_12967010
 ] 

Grant Ingersoll commented on SOLR-1979:
---

bq. I would like to see RFC 3066 instead

Yeah, that makes sense, however, I believe Tika returns 639. (Tika doesn't 
recognize Chinese yet at all).  One approach is we could normalize, I suppose.  
Another is to fix Tika.  I'd really like to see Tika support more languages, 
too.

Longer term, I'd like to not do the fieldName_LangCode thing at all and instead 
let the user supply a string that could have variable substitution if they 
want, something like fieldName_${langCode}, or it could be 
${langCode}_fieldName or it could just be another literal.

 Create LanguageIdentifierUpdateProcessor
 

 Key: SOLR-1979
 URL: https://issues.apache.org/jira/browse/SOLR-1979
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1979.patch, SOLR-1979.patch


 We need the ability to detect language of some random text in order to act 
 upon it, such as indexing the content into language aware fields. Another 
 usecase is to be able to filter/facet on language on random unstructured 
 content.
 To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
 processor is configurable like this:
 {code:xml} 
   processor 
 class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory
 str name=inputFieldsname,subject/str
 str name=outputFieldlanguage_s/str
 str name=idFieldid/str
 str name=fallbacken/str
   /processor
 {code} 
 It will then read the text from inputFields name and subject, perform 
 language identification and output the ISO code for the detected language in 
 the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967011#action_12967011
 ] 

Grant Ingersoll commented on SOLR-1979:
---

Another thought, here, is that, over time, this class becomes a base class and 
it becomes easy to replace the language detection piece, that way one gets all 
the infrastructure of this class, but can plugin their own detection.  In fact, 
I'm going to do that right now.

 Create LanguageIdentifierUpdateProcessor
 

 Key: SOLR-1979
 URL: https://issues.apache.org/jira/browse/SOLR-1979
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1979.patch, SOLR-1979.patch


 We need the ability to detect language of some random text in order to act 
 upon it, such as indexing the content into language aware fields. Another 
 usecase is to be able to filter/facet on language on random unstructured 
 content.
 To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
 processor is configurable like this:
 {code:xml} 
   processor 
 class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory
 str name=inputFieldsname,subject/str
 str name=outputFieldlanguage_s/str
 str name=idFieldid/str
 str name=fallbacken/str
   /processor
 {code} 
 It will then read the text from inputFields name and subject, perform 
 language identification and output the ISO code for the detected language in 
 the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Monitoring the UI's mem usage

2010-12-05 Thread Mark Miller
This shouldn't normally be something that you need to do with jruby I'd 
think - but Avram asked about this on the call back when there where ui 
running out of memory issues.


Since we require java 6, this is actually really easy.

Java itself comes with jconsole. It should be on your path. You just 
start it, and it lists running java processes that you can connect to. 
Choose the one with jruby-complete-1.5.3.jar in the name for the UI. The 
back end is the one with start.jar in the name.


I usually prefer visualvm over jconsole (kind of a supe'd up version of 
jconsole with a mem/cpu profiler). Its free and simple to use at 
https://visualvm.dev.java.net/.


That makes it very easy to see how the UI and back end are using memory, 
their garbage collection activity, cpu usage, etc.


I often run one on my laptop screen as I test LWE.

- Mark

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Testing UpdateProcessorChain

2010-12-05 Thread Grant Ingersoll
Anyone have any thoughts on testing UpdateProcessorChain (and Factory).  In 
looking at the Signature (dedup) tests, it seems a little clunky, yet the Solr 
base test class adoc (and related methods) don't seem to support specifying the 
Update handler to hit.

Thoughts?

-Grant
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Monitoring the UI's mem usage

2010-12-05 Thread Mark Miller

Gotto love a wrong email address autocomplete.

On 12/5/10 3:26 PM, Mark Miller wrote:

This shouldn't normally be something that you need to do with jruby I'd
think - but Avram asked about this on the call back when there where ui
running out of memory issues.

Since we require java 6, this is actually really easy.

Java itself comes with jconsole. It should be on your path. You just
start it, and it lists running java processes that you can connect to.
Choose the one with jruby-complete-1.5.3.jar in the name for the UI. The
back end is the one with start.jar in the name.

I usually prefer visualvm over jconsole (kind of a supe'd up version of
jconsole with a mem/cpu profiler). Its free and simple to use at
https://visualvm.dev.java.net/.

That makes it very easy to see how the UI and back end are using memory,
their garbage collection activity, cpu usage, etc.

I often run one on my laptop screen as I test LWE.

- Mark

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Testing UpdateProcessorChain

2010-12-05 Thread Yonik Seeley
On Sun, Dec 5, 2010 at 3:28 PM, Grant Ingersoll gsing...@apache.org wrote:
 Anyone have any thoughts on testing UpdateProcessorChain (and Factory).  In 
 looking at the Signature (dedup) tests, it seems a little clunky, yet the 
 Solr base test class adoc (and related methods) don't seem to support 
 specifying the Update handler to hit.

You can specify an alternate update processor with any update command.
SolrTestCaseJ4 has this:
  public static String add(XmlDoc doc, String... args) {

so... you should be able to do something like
add(doc(id,10),update.processor,foo)

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Exception in migrating from 2.9.x to 3.0.2 on Android

2010-12-05 Thread DM Smith
Thanks Uwe (and others). We'll adapt.

Is there any interest here in knowing if there are any other problems regarding 
Lucene on Android? From what I see, it is the first mobile platform on which 
Lucene can run.

-- DM

On Dec 5, 2010, at 5:16 AM, Uwe Schindler wrote:

 Hi DM,
 
 In Lucene 3.0.3, NativeFSLockFactory no longer aquires a test log and does
 not need the process ID anymore, so java.lang.management package is no
 longer used.
 
 In general, Lucene Java is compatible to the Java 5 SE specification.
 Android uses Harmony and therefore we cannot guarantee compatibility as
 Harmony is not TCK tested (but we do with latest versions, soon there will
 also be tests on Hudson with Harmony). But only latest versions of Harmony
 are really compatible with Lucene, previous versions fail lots of tests (ask
 Robert), and Android phones use very antique versions of Harmony - it is not
 even sure, that the Java5 Memory Model is correctly implemented in Dalvik!
 
 About 3.0.2: Of course this version even works with latest Harmony, so
 Harmony has java.lang.management package (which is java.lang!!!), so the bug
 is in Android, simply by excluding a SE package. So you should open bug
 report at Google and then hope that they fix it and all the phone
 manufacturers like Motor-Roller will update their Android versions.
 
 For your problem: The easy workaround is using Lucene 3.0.3 or simply use
 another LockFactory (Andoid is single user so even NoLockFactory would be
 fine in most cases). This are the same limitations like with the NFS
 filesystem. Just use FSDir.open(dir, lockFactory).
 
 Uwe
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 -Original Message-
 From: DM Smith [mailto:dm-sm...@woh.rr.com]
 Sent: Sunday, December 05, 2010 12:16 AM
 To: dev@lucene.apache.org
 Subject: Exception in migrating from 2.9.x to 3.0.2 on Android
 
 The current code that works on Android with 2.9.1, but fails with 3.0.2:
 
 Directory dir = FSDirectory.open(file);
 ...
 do something with directory
 ...
 
 The error we're seeing is:
 12-04 21:34:41.629: WARN/System.err(23160):
 java.lang.NoClassDefFoundError:
 java.lang.management.ManagementFactory
 12-04 21:34:41.639: WARN/System.err(23160): at
 org.apache.lucene.store.NativeFSLockFactory.acquireTestLock(NativeFSLock
 Factory.java:87)
 12-04 21:34:41.639: WARN/System.err(23160): at
 org.apache.lucene.store.NativeFSLockFactory.makeLock(NativeFSLockFactor
 y.java:142)
 12-04 21:34:41.649: WARN/System.err(23160): at
 org.apache.lucene.store.Directory.makeLock(Directory.java:106)
 12-04 21:34:41.649: WARN/System.err(23160): at
 org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1058)
 
 Turns out Android does not have
 java.lang.management.ManagementFactory.
 
 There are several work arounds in client code, but not sure what is best.
 
 The bigger question is whether and how Lucene should be modified to
 accommodate?
 
 Ultimately FSDirectory.open does the following:
if (Constants.WINDOWS) {
  return new SimpleFSDirectory(path, lockFactory);
} else {
  return new NIOFSDirectory(path, lockFactory);
}
 
 Should Android be a supported client OS?
 
 If so, wouldn't it be better not to have OS specific if-then-else and use
 reflection or something else?
 
 Thanks,
  DM
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Exception in migrating from 2.9.x to 3.0.2 on Android

2010-12-05 Thread Mark Miller
I have an interest - don't really care if it uses true java or not. I 
say keep it coming. Where/if it makes sense, why not make lucene work 
better with it. Perhaps that is not possible or too difficult in every 
case - but I'd still like to see the cases pop up. Better than those 
spam wiki update emails.


- Mark

On 12/5/10 3:36 PM, DM Smith wrote:

Thanks Uwe (and others). We'll adapt.

Is there any interest here in knowing if there are any other problems regarding 
Lucene on Android? From what I see, it is the first mobile platform on which 
Lucene can run.

-- DM

On Dec 5, 2010, at 5:16 AM, Uwe Schindler wrote:


Hi DM,

In Lucene 3.0.3, NativeFSLockFactory no longer aquires a test log and does
not need the process ID anymore, so java.lang.management package is no
longer used.

In general, Lucene Java is compatible to the Java 5 SE specification.
Android uses Harmony and therefore we cannot guarantee compatibility as
Harmony is not TCK tested (but we do with latest versions, soon there will
also be tests on Hudson with Harmony). But only latest versions of Harmony
are really compatible with Lucene, previous versions fail lots of tests (ask
Robert), and Android phones use very antique versions of Harmony - it is not
even sure, that the Java5 Memory Model is correctly implemented in Dalvik!

About 3.0.2: Of course this version even works with latest Harmony, so
Harmony has java.lang.management package (which is java.lang!!!), so the bug
is in Android, simply by excluding a SE package. So you should open bug
report at Google and then hope that they fix it and all the phone
manufacturers like Motor-Roller will update their Android versions.

For your problem: The easy workaround is using Lucene 3.0.3 or simply use
another LockFactory (Andoid is single user so even NoLockFactory would be
fine in most cases). This are the same limitations like with the NFS
filesystem. Just use FSDir.open(dir, lockFactory).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


-Original Message-
From: DM Smith [mailto:dm-sm...@woh.rr.com]
Sent: Sunday, December 05, 2010 12:16 AM
To: dev@lucene.apache.org
Subject: Exception in migrating from 2.9.x to 3.0.2 on Android

The current code that works on Android with 2.9.1, but fails with 3.0.2:

Directory dir = FSDirectory.open(file);
...
do something with directory
...

The error we're seeing is:
12-04 21:34:41.629: WARN/System.err(23160):
java.lang.NoClassDefFoundError:
java.lang.management.ManagementFactory
12-04 21:34:41.639: WARN/System.err(23160): at
org.apache.lucene.store.NativeFSLockFactory.acquireTestLock(NativeFSLock
Factory.java:87)
12-04 21:34:41.639: WARN/System.err(23160): at
org.apache.lucene.store.NativeFSLockFactory.makeLock(NativeFSLockFactor
y.java:142)
12-04 21:34:41.649: WARN/System.err(23160): at
org.apache.lucene.store.Directory.makeLock(Directory.java:106)
12-04 21:34:41.649: WARN/System.err(23160): at
org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1058)

Turns out Android does not have
java.lang.management.ManagementFactory.

There are several work arounds in client code, but not sure what is best.

The bigger question is whether and how Lucene should be modified to
accommodate?

Ultimately FSDirectory.open does the following:
if (Constants.WINDOWS) {
  return new SimpleFSDirectory(path, lockFactory);
} else {
  return new NIOFSDirectory(path, lockFactory);
}

Should Android be a supported client OS?

If so, wouldn't it be better not to have OS specific if-then-else and use
reflection or something else?

Thanks,
DM
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967016#action_12967016
 ] 

Yonik Seeley commented on SOLR-1979:


bq. The new field is made by concatenating the original field name with _ + 
the ISO 639 code. 

This could be problematic given a large set of language codes since they could 
collide with existing dynamic field definitions.
Perhaps something with text in the name also?

Perhaps fieldName_${langCode}Text

Examples:
name_enText
name_frText

It would probably also be nice to be able to map a number of languages to a 
single field say you have a single analyzer that can handle CJK, then you 
may want that whole collection of languages mapped to a single _cjk field.

And just because you can detect a language doesn't mean you know how to handle 
it differently... so also have an optional catchall that handles all languages 
not specifically mapped.




 Create LanguageIdentifierUpdateProcessor
 

 Key: SOLR-1979
 URL: https://issues.apache.org/jira/browse/SOLR-1979
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1979.patch, SOLR-1979.patch


 We need the ability to detect language of some random text in order to act 
 upon it, such as indexing the content into language aware fields. Another 
 usecase is to be able to filter/facet on language on random unstructured 
 content.
 To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
 processor is configurable like this:
 {code:xml} 
   processor 
 class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory
 str name=inputFieldsname,subject/str
 str name=outputFieldlanguage_s/str
 str name=idFieldid/str
 str name=fallbacken/str
   /processor
 {code} 
 It will then read the text from inputFields name and subject, perform 
 language identification and output the ISO code for the detected language in 
 the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967019#action_12967019
 ] 

Robert Muir commented on SOLR-1979:
---

bq. Yeah, that makes sense, however, I believe Tika returns 639.

Right, but 639 is just a subset of 3066 etc. 

So, ignore what tika does. its 639 identifiers are also valid 3066.

Our API should at least be 3066, Java7/ICU already support BCP47 locale 
identifiers etc, so you get the normalization there for free.

{quote}
It would probably also be nice to be able to map a number of languages to a 
single field say you have a single analyzer that can handle CJK, then you 
may want that whole collection of languages mapped to a single _cjk field.

And just because you can detect a language doesn't mean you know how to handle 
it differently... so also have an optional catchall that handles all languages 
not specifically mapped.
{quote}

Both of these are good reasons why we must avoid 639-1.
We should be able to use things like macrolanguages and undetermined language.





 Create LanguageIdentifierUpdateProcessor
 

 Key: SOLR-1979
 URL: https://issues.apache.org/jira/browse/SOLR-1979
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1979.patch, SOLR-1979.patch


 We need the ability to detect language of some random text in order to act 
 upon it, such as indexing the content into language aware fields. Another 
 usecase is to be able to filter/facet on language on random unstructured 
 content.
 To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
 processor is configurable like this:
 {code:xml} 
   processor 
 class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory
 str name=inputFieldsname,subject/str
 str name=outputFieldlanguage_s/str
 str name=idFieldid/str
 str name=fallbacken/str
   /processor
 {code} 
 It will then read the text from inputFields name and subject, perform 
 language identification and output the ISO code for the detected language in 
 the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2266) java.lang.ArrayIndexOutOfBoundsException in field cache when using a tdate field in a boost function with rord()

2010-12-05 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967022#action_12967022
 ] 

Yonik Seeley commented on SOLR-2266:


OK, here's my guess: it's probably due to multiple indexed values per field 
value.  ord/rord uses the StringIndex to get the ord values, which can't handle 
multiple indexed tokens per field value.

The tdate type has a precisionStep  0, meaning it will index multiple values 
per field value to speed up range queries.
If you don't need faster range queries on this type, then use date instead of 
tdate.

But the ideal fix here is to eliminate the use of ord/rord since they also use 
up more memory... sorting by created will instantiate a per-segment long[] 
FieldCache entry.
It would be nice if that could be reused for the function queries too.  This is 
the case if you use ms().
http://wiki.apache.org/solr/FunctionQuery#ms

 java.lang.ArrayIndexOutOfBoundsException in field cache when using a tdate 
 field in a boost function with rord()
 

 Key: SOLR-2266
 URL: https://issues.apache.org/jira/browse/SOLR-2266
 Project: Solr
  Issue Type: Bug
Affects Versions: 1.4.1
 Environment: Mac OS 10.6
 java version 1.6.0_22
 Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
 Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)
Reporter: Peter Wolanin

 I have been testing a switch to long and tdate instead of int and date fields 
 in the schema.xml for our Drupal integration.  This indexes fine, but search 
 fails with a 500 error.
 {code}
 INFO: [d7] webapp=/solr path=/select 
 params={spellcheck=truefacet=truefacet.mincount=1indent=1spellcheck.q=termjson.nl=mapwt=jsonrows=10version=1.2fl=id,entity_id,entity,bundle,bundle_name,nid,title,comment_count,type,created,changed,score,path,url,uid,namestart=0facet.sort=trueq=termbf=recip(rord(created),4,19,19)^200.0}
  status=500 QTime=4 
 Dec 5, 2010 11:52:28 AM org.apache.solr.common.SolrException log
 SEVERE: java.lang.ArrayIndexOutOfBoundsException: 39
 at 
 org.apache.lucene.search.FieldCacheImpl$StringIndexCache.createValue(FieldCacheImpl.java:721)
 at 
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224)
 at 
 org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:692)
 at 
 org.apache.solr.search.function.ReverseOrdFieldSource.getValues(ReverseOrdFieldSource.java:61)
 at 
 org.apache.solr.search.function.TopValueSource.getValues(TopValueSource.java:57)
 at 
 org.apache.solr.search.function.ReciprocalFloatFunction.getValues(ReciprocalFloatFunction.java:61)
 at 
 org.apache.solr.search.function.FunctionQuery$AllScorer.init(FunctionQuery.java:123)
 at 
 org.apache.solr.search.function.FunctionQuery$FunctionWeight.scorer(FunctionQuery.java:93)
 at 
 org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:297)
 at 
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:250)
 at org.apache.lucene.search.Searcher.search(Searcher.java:171)
 at 
 org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1101)
 at 
 org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:880)
 at 
 org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
 at 
 org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
 at 
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
 at com.acquia.search.HmacFilter.doFilter(HmacFilter.java:62)
 at 
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
 at 
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
 at 
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at 
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
 at 
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
 at 
 org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
 at 
 

Re: Testing UpdateProcessorChain

2010-12-05 Thread Grant Ingersoll

On Dec 5, 2010, at 3:34 PM, Yonik Seeley wrote:

 On Sun, Dec 5, 2010 at 3:28 PM, Grant Ingersoll gsing...@apache.org wrote:
 Anyone have any thoughts on testing UpdateProcessorChain (and Factory).  In 
 looking at the Signature (dedup) tests, it seems a little clunky, yet the 
 Solr base test class adoc (and related methods) don't seem to support 
 specifying the Update handler to hit.
 
 You can specify an alternate update processor with any update command.
 SolrTestCaseJ4 has this:
  public static String add(XmlDoc doc, String... args) {
 
 so... you should be able to do something like
 add(doc(id,10),update.processor,foo)

Yeah, I am calling that.  I think the problem is that assertU() calls 
doLegacyUpdate, which doesn't handle getting the chain.
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Exception in migrating from 2.9.x to 3.0.2 on Android

2010-12-05 Thread Robert Muir
what I am saying, is that this is a java project, and I don't want to write
to some least common denominator/intersection of java and android. if an api
doesn't exist in android, I could care less. inzstead, why can't interest
parties have a little project where we port lucene java (perhaps a trivial
patch), setup automated/tests etc. this I would be interested in, but let's
keep lucene java as java
On Dec 5, 2010 9:40 PM, Mark Miller markrmil...@gmail.com wrote:
 I have an interest - don't really care if it uses true java or not. I
 say keep it coming. Where/if it makes sense, why not make lucene work
 better with it. Perhaps that is not possible or too difficult in every
 case - but I'd still like to see the cases pop up. Better than those
 spam wiki update emails.

 - Mark

 On 12/5/10 3:36 PM, DM Smith wrote:
 Thanks Uwe (and others). We'll adapt.

 Is there any interest here in knowing if there are any other problems
regarding Lucene on Android? From what I see, it is the first mobile
platform on which Lucene can run.

 -- DM

 On Dec 5, 2010, at 5:16 AM, Uwe Schindler wrote:

 Hi DM,

 In Lucene 3.0.3, NativeFSLockFactory no longer aquires a test log and
does
 not need the process ID anymore, so java.lang.management package is no
 longer used.

 In general, Lucene Java is compatible to the Java 5 SE specification.
 Android uses Harmony and therefore we cannot guarantee compatibility as
 Harmony is not TCK tested (but we do with latest versions, soon there
will
 also be tests on Hudson with Harmony). But only latest versions of
Harmony
 are really compatible with Lucene, previous versions fail lots of tests
(ask
 Robert), and Android phones use very antique versions of Harmony - it is
not
 even sure, that the Java5 Memory Model is correctly implemented in
Dalvik!

 About 3.0.2: Of course this version even works with latest Harmony, so
 Harmony has java.lang.management package (which is java.lang!!!), so the
bug
 is in Android, simply by excluding a SE package. So you should open bug
 report at Google and then hope that they fix it and all the phone
 manufacturers like Motor-Roller will update their Android versions.

 For your problem: The easy workaround is using Lucene 3.0.3 or simply
use
 another LockFactory (Andoid is single user so even NoLockFactory would
be
 fine in most cases). This are the same limitations like with the NFS
 filesystem. Just use FSDir.open(dir, lockFactory).

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de

 -Original Message-
 From: DM Smith [mailto:dm-sm...@woh.rr.com]
 Sent: Sunday, December 05, 2010 12:16 AM
 To: dev@lucene.apache.org
 Subject: Exception in migrating from 2.9.x to 3.0.2 on Android

 The current code that works on Android with 2.9.1, but fails with
3.0.2:

 Directory dir = FSDirectory.open(file);
 ...
 do something with directory
 ...

 The error we're seeing is:
 12-04 21:34:41.629: WARN/System.err(23160):
 java.lang.NoClassDefFoundError:
 java.lang.management.ManagementFactory
 12-04 21:34:41.639: WARN/System.err(23160): at

org.apache.lucene.store.NativeFSLockFactory.acquireTestLock(NativeFSLock
 Factory.java:87)
 12-04 21:34:41.639: WARN/System.err(23160): at
 org.apache.lucene.store.NativeFSLockFactory.makeLock(NativeFSLockFactor
 y.java:142)
 12-04 21:34:41.649: WARN/System.err(23160): at
 org.apache.lucene.store.Directory.makeLock(Directory.java:106)
 12-04 21:34:41.649: WARN/System.err(23160): at
 org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1058)

 Turns out Android does not have
 java.lang.management.ManagementFactory.

 There are several work arounds in client code, but not sure what is
best.

 The bigger question is whether and how Lucene should be modified to
 accommodate?

 Ultimately FSDirectory.open does the following:
 if (Constants.WINDOWS) {
 return new SimpleFSDirectory(path, lockFactory);
 } else {
 return new NIOFSDirectory(path, lockFactory);
 }

 Should Android be a supported client OS?

 If so, wouldn't it be better not to have OS specific if-then-else and
use
 reflection or something else?

 Thanks,
 DM
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
additional
 commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967032#action_12967032
 ] 

Jan Høydahl commented on SOLR-1979:
---

@Robert: Yes, there must be a way to tell whether or not the language even has 
a profile, through some well defined method. It's not important HOW we improve 
detection certainty, but comparing the top n distances could help. I'm also a 
fan of including other metrics than profile similarity if that can help, 
however for unique scripts that will automatically be covered by profile 
similarity. Detailed solution discussions should continue in TIKA-369.

Macro languages: See TIKA-493

It makes sense to allow for detecting languages outside 639-1, and I believe 
RFC3066 and BCP47 are both re-using the 639 codes, so that if there is a 
2-letter code for a language it will be used. 639-1 is what everyone already 
knows.

In general, improvements should be done in Tika space, then use those in Solr, 
thus building one strong language detection library.

@Grant: I actually planned to do the regEx based field name mapping in a 
separate UpdateProcessor, to make things more flexible. Example:
{code:xml} 
  processor 
class=org.apache.solr.update.processor.LanguageFieldMapperUpdateProcessor
str name=languageFieldlanguage/str
str name=fromRegEx(.*?)_lang/str
str name=toRegEx$1_$lang/str
str name=notSupportedLanguageToRegEx$1_t/str
str name=supportedLanguagesde,en,fr,it,es,nl/str
  /processor
{code} 

Your thought of allowing to detect language for individual fields in one go is 
also interesting. I'd love to see metadata support in SolrInputDocument, so 
that one processor could annotate a @language on the fields analyzed. Then next 
processor could act on metadata to rename field...

@Yonik: By allowing regex naming of field names, we give users a generic tool 
to avoid field name clashes, by picking the pattern.. Mapping multiple 
languages to same suffix also makes sense.


 Create LanguageIdentifierUpdateProcessor
 

 Key: SOLR-1979
 URL: https://issues.apache.org/jira/browse/SOLR-1979
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1979.patch, SOLR-1979.patch


 We need the ability to detect language of some random text in order to act 
 upon it, such as indexing the content into language aware fields. Another 
 usecase is to be able to filter/facet on language on random unstructured 
 content.
 To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
 processor is configurable like this:
 {code:xml} 
   processor 
 class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory
 str name=inputFieldsname,subject/str
 str name=outputFieldlanguage_s/str
 str name=idFieldid/str
 str name=fallbacken/str
   /processor
 {code} 
 It will then read the text from inputFields name and subject, perform 
 language identification and output the ISO code for the detected language in 
 the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1048) Ids parameter and fl=score throws an exception for wt=json

2010-12-05 Thread Jon Bodner (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967035#action_12967035
 ] 

Jon Bodner commented on SOLR-1048:
--

The issue is still present in the 1.4.1 code base for Solr.  I found the source 
of the problem.  In the ids stage for sharding, the score is not calculated (it 
was returned in the previous stage), so the DocSlice's scores float array is 
still null.  XMLWriter and BinaryResponseWriter include lines like:

includeScore = includeScore  ids.hasScores();

but JSONWriter does not. 

This issue is only going to present itself when you are debugging, since I 
think the ids parameter is only used for sharding, and Solr uses the javabin 
wire protocol instead of json.


 Ids parameter and fl=score throws an exception for wt=json
 --

 Key: SOLR-1048
 URL: https://issues.apache.org/jira/browse/SOLR-1048
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 1.3
Reporter: Laurent Chavet

 http://yourHost:8080/solr/select/?ids=YourDocIdversion=2.2start=0rows=10indent=onfl=score,idq=%2B*:*
 shows that when using ids= the score for docs is null; when using wt=json:
 http://yourHost:8080/solr/select/?ids=YourDocIdversion=2.2start=0rows=10indent=onfl=score,idq=%2B*:*wt=json
 that throws a NullPointerException:
 HTTP Status 500 - null java.lang.NullPointerException at 
 org.apache.solr.search.DocSlice$1.score(DocSlice.java:120) at 
 org.apache.solr.request.JSONWriter.writeDocList(JSONResponseWriter.java:490) 
 at 
 org.apache.solr.request.TextResponseWriter.writeVal(TextResponseWriter.java:140)
  at 
 org.apache.solr.request.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:175)
  at 
 org.apache.solr.request.JSONWriter.writeNamedList(JSONResponseWriter.java:288)
  at 
 org.apache.solr.request.JSONWriter.writeResponse(JSONResponseWriter.java:88) 
 at 
 org.apache.solr.request.JSONResponseWriter.write(JSONResponseWriter.java:49) 
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
  at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
  at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
  at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
  at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
  at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) 
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) 
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
  at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) 
 at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:847) 
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
  at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) 
 at java.lang.Thread.run(Thread.java:619)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Exception in migrating from 2.9.x to 3.0.2 on Android

2010-12-05 Thread Mark Miller

On 12/5/10 5:05 PM, Robert Muir wrote:

what I am saying, is that this is a java project, and I don't want to
write to some least common denominator/intersection of java and android.


So don't - DM submitting cases that don't work and you not giving a shit 
are not mutually exclusive.


- Mark


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



FieldCache usage for custom field collapse in solr 1.4

2010-12-05 Thread Adam H.
Hey,
I'm trying to use the lucene FieldCache for some custom field collapsing
implementation: basically i'm collapsing on a non-stored field,
and so am using the fieldcache to retrieve field value instances during run.

I noticed I'm getting some OOM's after deploying it, and after looking into
it for abit, figured that it might be to do with using a call like this:

StringIndex fieldCacheVals = FieldCache.DEFAULT.getStringIndex(reader,
collapseField);

where 'reader' is the instance of the SolrIndexReader passed along to the
component with the ResponseBuilder.SolrQueryRequest object.

As I understand, this can double memory usage due to (re)loading this
fieldcache on a reader-wide basis rather than on a per segment basis?
If so, what would be a way to migrate this code to use a per segment cache?
i'm not sure I understand the semantics there at all...

Any help will be greatly appreciated, thanks alot!

Adam


Re: Exception in migrating from 2.9.x to 3.0.2 on Android

2010-12-05 Thread Robert Muir
On Sun, Dec 5, 2010 at 6:10 PM, Mark Miller markrmil...@gmail.com wrote:
 On 12/5/10 5:05 PM, Robert Muir wrote:

 what I am saying, is that this is a java project, and I don't want to
 write to some least common denominator/intersection of java and android.

 So don't - DM submitting cases that don't work and you not giving a shit are
 not mutually exclusive.


Just trying to say, i dont think we should change the programming
language of the project without a proper vote.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Exception in migrating from 2.9.x to 3.0.2 on Android

2010-12-05 Thread Mark Miller

bq.  Perhaps that is not possible or too difficult in every case

To clarify - sounds like I could be saying, well, perhaps we can't 
improve every case, but some we can. I'm saying *too difficult in every* 
case - even if we don't try and *fix a single case* - its still 
beneficial for you to report and discuss these issues IMO. And as I 
said, I'll remain interested.



- Mark

On 12/5/10 3:40 PM, Mark Miller wrote:

I have an interest - don't really care if it uses true java or not. I
say keep it coming. Where/if it makes sense, why not make lucene work
better with it. Perhaps that is not possible or too difficult in every
case - but I'd still like to see the cases pop up. Better than those
spam wiki update emails.

- Mark

On 12/5/10 3:36 PM, DM Smith wrote:

Thanks Uwe (and others). We'll adapt.

Is there any interest here in knowing if there are any other problems
regarding Lucene on Android? From what I see, it is the first mobile
platform on which Lucene can run.

-- DM

On Dec 5, 2010, at 5:16 AM, Uwe Schindler wrote:


Hi DM,

In Lucene 3.0.3, NativeFSLockFactory no longer aquires a test log and
does
not need the process ID anymore, so java.lang.management package is no
longer used.

In general, Lucene Java is compatible to the Java 5 SE specification.
Android uses Harmony and therefore we cannot guarantee compatibility as
Harmony is not TCK tested (but we do with latest versions, soon there
will
also be tests on Hudson with Harmony). But only latest versions of
Harmony
are really compatible with Lucene, previous versions fail lots of
tests (ask
Robert), and Android phones use very antique versions of Harmony - it
is not
even sure, that the Java5 Memory Model is correctly implemented in
Dalvik!

About 3.0.2: Of course this version even works with latest Harmony, so
Harmony has java.lang.management package (which is java.lang!!!), so
the bug
is in Android, simply by excluding a SE package. So you should open bug
report at Google and then hope that they fix it and all the phone
manufacturers like Motor-Roller will update their Android versions.

For your problem: The easy workaround is using Lucene 3.0.3 or simply
use
another LockFactory (Andoid is single user so even NoLockFactory
would be
fine in most cases). This are the same limitations like with the NFS
filesystem. Just use FSDir.open(dir, lockFactory).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


-Original Message-
From: DM Smith [mailto:dm-sm...@woh.rr.com]
Sent: Sunday, December 05, 2010 12:16 AM
To: dev@lucene.apache.org
Subject: Exception in migrating from 2.9.x to 3.0.2 on Android

The current code that works on Android with 2.9.1, but fails with
3.0.2:

Directory dir = FSDirectory.open(file);
...
do something with directory
...

The error we're seeing is:
12-04 21:34:41.629: WARN/System.err(23160):
java.lang.NoClassDefFoundError:
java.lang.management.ManagementFactory
12-04 21:34:41.639: WARN/System.err(23160): at
org.apache.lucene.store.NativeFSLockFactory.acquireTestLock(NativeFSLock

Factory.java:87)
12-04 21:34:41.639: WARN/System.err(23160): at
org.apache.lucene.store.NativeFSLockFactory.makeLock(NativeFSLockFactor
y.java:142)
12-04 21:34:41.649: WARN/System.err(23160): at
org.apache.lucene.store.Directory.makeLock(Directory.java:106)
12-04 21:34:41.649: WARN/System.err(23160): at
org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1058)

Turns out Android does not have
java.lang.management.ManagementFactory.

There are several work arounds in client code, but not sure what is
best.

The bigger question is whether and how Lucene should be modified to
accommodate?

Ultimately FSDirectory.open does the following:
if (Constants.WINDOWS) {
return new SimpleFSDirectory(path, lockFactory);
} else {
return new NIOFSDirectory(path, lockFactory);
}

Should Android be a supported client OS?

If so, wouldn't it be better not to have OS specific if-then-else
and use
reflection or something else?

Thanks,
DM
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For
additional
commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org






-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Exception in migrating from 2.9.x to 3.0.2 on Android

2010-12-05 Thread Mark Miller

On 12/5/10 6:15 PM, Robert Muir wrote:

On Sun, Dec 5, 2010 at 6:10 PM, Mark Millermarkrmil...@gmail.com  wrote:

On 12/5/10 5:05 PM, Robert Muir wrote:


what I am saying, is that this is a java project, and I don't want to
write to some least common denominator/intersection of java and android.


So don't - DM submitting cases that don't work and you not giving a shit are
not mutually exclusive.



Just trying to say, i dont think we should change the programming
language of the project without a proper vote.



Then your just overreacting again.

Allow me to sum up for you:

DM: hey, we are trying to use lucene on android - this is not working
Uwe and someone: thats not real java, we don't support it
Rmuir : **%$$!! (kidding - i dont remember what you said)
DM: Oh, pardon me. Well okay - but would anyone be interested in us 
reporting what doesn't work as we go through this? Android is the only 
mobile platform lucene works on I think.

Mark: Oh yeah - interesting - please do. I'd be interested in seeing.
Rmuir: don't change the lucene impl language without a vote! Gr!
Mark: ??
Native Police: why are you so aggressive?


- Mark

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967046#action_12967046
 ] 

Grant Ingersoll commented on SOLR-1979:
---

bq. @Grant: I actually planned to do the regEx based field name mapping in a 
separate UpdateProcessor, to make things more flexible

I don't really see that it makes it any more flexible.  If it was a general 
purpose mapper, maybe, but since it is tied to the language field, why not just 
put in the language processor?  I've already got the method that choose the 
output field as a protected.  With that, one merely would need to extend it to 
provide an alternate method from what you have proposed.

 Create LanguageIdentifierUpdateProcessor
 

 Key: SOLR-1979
 URL: https://issues.apache.org/jira/browse/SOLR-1979
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1979.patch, SOLR-1979.patch


 We need the ability to detect language of some random text in order to act 
 upon it, such as indexing the content into language aware fields. Another 
 usecase is to be able to filter/facet on language on random unstructured 
 content.
 To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
 processor is configurable like this:
 {code:xml} 
   processor 
 class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory
 str name=inputFieldsname,subject/str
 str name=outputFieldlanguage_s/str
 str name=idFieldid/str
 str name=fallbacken/str
   /processor
 {code} 
 It will then read the text from inputFields name and subject, perform 
 language identification and output the ISO code for the detected language in 
 the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-1979:
--

Attachment: SOLR-1979.patch

Here's a patch that passes the tests.  Note, I modified the Solr base test case 
to have some new methods to properly call update handlers and then validate the 
results.

 Create LanguageIdentifierUpdateProcessor
 

 Key: SOLR-1979
 URL: https://issues.apache.org/jira/browse/SOLR-1979
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch


 We need the ability to detect language of some random text in order to act 
 upon it, such as indexing the content into language aware fields. Another 
 usecase is to be able to filter/facet on language on random unstructured 
 content.
 To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
 processor is configurable like this:
 {code:xml} 
   processor 
 class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory
 str name=inputFieldsname,subject/str
 str name=outputFieldlanguage_s/str
 str name=idFieldid/str
 str name=fallbacken/str
   /processor
 {code} 
 It will then read the text from inputFields name and subject, perform 
 language identification and output the ISO code for the detected language in 
 the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967048#action_12967048
 ] 

Grant Ingersoll commented on SOLR-1979:
---

Note, the patch still needs more tests and needs to check headers, etc. as well 
as the better field mapping and the proper language support that Robert is 
talking about.

 Create LanguageIdentifierUpdateProcessor
 

 Key: SOLR-1979
 URL: https://issues.apache.org/jira/browse/SOLR-1979
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch


 We need the ability to detect language of some random text in order to act 
 upon it, such as indexing the content into language aware fields. Another 
 usecase is to be able to filter/facet on language on random unstructured 
 content.
 To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
 processor is configurable like this:
 {code:xml} 
   processor 
 class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory
 str name=inputFieldsname,subject/str
 str name=outputFieldlanguage_s/str
 str name=idFieldid/str
 str name=fallbacken/str
   /processor
 {code} 
 It will then read the text from inputFields name and subject, perform 
 language identification and output the ISO code for the detected language in 
 the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2235) implement PerFieldAnalyzerWrapper.getOffsetGap

2010-12-05 Thread Nick Pellow (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967057#action_12967057
 ] 

Nick Pellow commented on LUCENE-2235:
-

I just upgraded to 3.0.3 and we started getting NullPointerExceptions coming 
from PerFieldAnalyzerWrapper.
We have a  PerFieldAnalyzerWrapper that has a null defaultAnalyzer:
{code}
private final PerFieldAnalyzerWrapper analyzer = new 
PerFieldAnalyzerWrapper(null);
{code}

We add analyzers to all fields that are analyzed. ie: field.isAnalyzed() == 
true.
getOffsetGap on  PerFieldAnalyzerWrapper is being called, even for these 
non-analyzed fields. Is this expected behaviour?

Lines 200-203 of DocInverterPerField are: 
{code}
if (anyToken)
  fieldState.offset += docState.analyzer.getOffsetGap(field);
fieldState.boost *= field.getBoost();
  }

{code}
Should this be checking that a field is indeed analyzed before calling 
getOffsetGap ?


 implement PerFieldAnalyzerWrapper.getOffsetGap
 --

 Key: LUCENE-2235
 URL: https://issues.apache.org/jira/browse/LUCENE-2235
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 3.0
 Environment: Any
Reporter: Javier Godoy
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9.4, 3.0.3, 3.1, 4.0

 Attachments: LUCENE-2235.patch, PerFieldAnalyzerWrapper.patch


 PerFieldAnalyzerWrapper does not delegates calls to getOffsetGap(Fieldable), 
 instead it returns the default values from the implementation of Analyzer. 
 (Similar to LUCENE-659 PerFieldAnalyzerWrapper fails to implement 
 getPositionIncrementGap)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2599) Deprecate Spatial Contrib

2010-12-05 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967064#action_12967064
 ] 

Chris Male commented on LUCENE-2599:


I just noticed that Solr depends upon some methods in DistanceUtils.  We'll 
need to move that into the module before removing the contrib from 4x.

 Deprecate Spatial Contrib
 -

 Key: LUCENE-2599
 URL: https://issues.apache.org/jira/browse/LUCENE-2599
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/spatial
Affects Versions: 4.0
Reporter: Chris Male
 Attachments: LUCENE-2599.patch, LUCENE-2599.patch


 The spatial contrib is blighted by bugs.  The latest series, found by Grant 
 and discussed 
 [here|http://search.lucidimagination.com/search/document/c32e81783642df47/spatial_rethinking_cartesian_tiers_implementation]
  shows that we need to re-think the cartesian tier implementation.
 Given the need to create a spatial module containing code taken from both 
 lucene and Solr, it makes sense to deprecate the spatial contrib, and start 
 from scratch in the new module.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Exception in migrating from 2.9.x to 3.0.2 on Android

2010-12-05 Thread Robert Muir
On Sun, Dec 5, 2010 at 9:12 PM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 I personally consider android a valid platform for lucene and we
 should try to reduce the pain for android folks as much as possible.
 Changing supported platforms is a totally different thing to me.


good, you can start a separate subproject as a port then.

but until then, android isnt supported by lucene-java.
android is a different programming language, and by supporting it, we
change the programming language of the lucene-java project.

this requires a vote, until then, its not supported by definition
since our documented programming language is java, not android.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor

2010-12-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967076#action_12967076
 ] 

Robert Muir commented on SOLR-1979:
---

{quote}
It makes sense to allow for detecting languages outside 639-1, and I believe 
RFC3066 and BCP47 are both re-using the 639 codes, so that if there is a 
2-letter code for a language it will be used. 639-1 is what everyone already 
knows.

In general, improvements should be done in Tika space, then use those in Solr, 
thus building one strong language detection library.
{quote}

yes they do, the 639-1 codes that tika outputs are also valid BCP47 codes :)

but in solr, when designing up front, i was just saying we shouldn't limit any 
abstract portion to 639-1 when another implementation might support 3066 or 
BCP47... we should make sure we allow that.


 Create LanguageIdentifierUpdateProcessor
 

 Key: SOLR-1979
 URL: https://issues.apache.org/jira/browse/SOLR-1979
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Jan Høydahl
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch


 We need the ability to detect language of some random text in order to act 
 upon it, such as indexing the content into language aware fields. Another 
 usecase is to be able to filter/facet on language on random unstructured 
 content.
 To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The 
 processor is configurable like this:
 {code:xml} 
   processor 
 class=org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory
 str name=inputFieldsname,subject/str
 str name=outputFieldlanguage_s/str
 str name=idFieldid/str
 str name=fallbacken/str
   /processor
 {code} 
 It will then read the text from inputFields name and subject, perform 
 language identification and output the ISO code for the detected language in 
 the outputField. If no language was detected, fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Changes Mess

2010-12-05 Thread Mattmann, Chris A (388J)
Hi Steven,

Yep, like you state below JIRA *could* be configured to deal with this. 

In all honesty, putting tons of thought and effort into how to precisely deal 
with the changes you specify below might be somewhat overkill.

Cheers,
Chris

On Dec 5, 2010, at 12:17 PM, Steven A Rowe wrote:

 On 12/5/2010 at 12:19 PM, Robert Muir wrote:
 On Sun, Dec 5, 2010 at 12:08 PM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Mark,
 
 RE: the credit system. JIRA provides a contribution report here, like
 this one that I generated for Lucene 3.1:
 
 
 My concern with this is that it leaves out important email contributors.
 
 I agree, this is a serious problem.
 
 My additional problems with JIRA-generated changes:
 
 1. Huge undifferentiated change lists are frightening and nearly useless, 
 regardless of the quality of the descriptions.
 
   JIRA's issue types are:

   Bug, New Feature, Improvement, Test, Wish, Task
 
   Even if we used JIRA's issue types to group issues, they
   are not the same as Lucene's CHANGES.txt issue types:
 
   Changes in backwards compatibility policy, 
   Changes in runtime behavior, 
   API Changes, Documentation, Bug fixes, New features,
   Optimizations, Build, Test Cases, Infrastructure
 
   (I left out Requirements, last used in 2006 under release
   1.9 RC1, since Build seems to have replaced it.)
 
 2. There are now four separate CHANGES.txt files in the Lucene code base, 
 excluding Solr and its modules (each of which has one of them).  This number 
 will only grow as more Lucene contribs become modules.
 
   The JIRA project components list is outdated / incomplete
   / has different granularity than the CHANGES.txt locations,
   so using it to group JIRA issues would not work because
   they don't align with Lucene/Solr components.
 
 3. Some of the CHANGES.txt entries draw from multiple JIRA issues.
 
   From dev/trunk/lucene/CHANGES.txt:
 
   Trunk: 9 out of 56 include multiple JIRA issues
   3.X: 7/94
   3.0.0: 3/29
   2.9.0: 9/153
 
   I'm assuming a JIRA dump can't do this.
 
 4. Some JIRA issues appear under multiple change categories in CHANGES.txt.
 
   From dev/trunk/lucene/CHANGES.txt:
 
   Trunk: 3 out of 68 multiply categorized
   3.X: 9/102
   3.0.0: 1/53
   2.9.0: 20/166
 
   A JIRA dump would not allow for multiple issue 
   categorization, since JIRA only allows a single issue
   type to be assigned - I guess they are assumed to be
   mutually exclusive.
 
 
 Maybe our use of JIRA could be changed to address some of these problems, 
 through addition of new fields and/or modification of existing fields' 
 allowable values?
   
 Steve
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Changes Mess

2010-12-05 Thread Steven A Rowe
Hi Chris,

On 12/5/2010 at 10:36 PM, Chris Mattman wrote:
 Yep, like you state below JIRA *could* be configured to deal with this.
 
 In all honesty, putting tons of thought and effort into how to precisely
 deal with the changes you specify below might be somewhat overkill.

I think dumping CHANGES.txt in favor of output from a badly misconfigured issue 
tracking system would be foolish.  

One way to deal with the problem is to stay with CHANGES.txt.  (We've been down 
this road before, and this is where we landed in the past.)

Another would be to fix the issue tracking system.

Yet another way would be to declare the problem non-existent and screw our 
users by insulting them with a honking great mass of changes without any 
indication about what they are or how they are inter-related.  (You won't be 
surprised at this point, I think, by my -1 to this.)

Steve



Re: Changes Mess

2010-12-05 Thread Mattmann, Chris A (388J)
 
 Yet another way would be to declare the problem non-existent and screw our 
 users by insulting them with a honking great mass of changes without any 
 indication about what they are or how they are inter-related.  (You won't be 
 surprised at this point, I think, by my -1 to this.)

Right, I'm one of those users (have been in the past and am somewhat still) as 
well as a former member of the PMC and so acting like I'm suggesting screwing 
them over (them which would include me) by simply suggesting that solving 
this mess in completeness is intractable so you just have to go with a 
heuristic (which I'd argue spending oodles of time on isn't worth it) is also a 
bit insulting.

I suggested that JIRA can handle this. We're using it  in oh, about 2-3 Apache 
projects I'm on and it's working great. If you think it's a mess for all the 
stuff you put in the email, great, that's your prerogative. I'm just saying in 
my experience it hasn't been that bad.

Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1395) Integrate Katta

2010-12-05 Thread JohnWu (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967086#action_12967086
 ] 

JohnWu commented on SOLR-1395:
--

TomLiu:

I still jam in the query dipatch  to subproxy!

SEVERE: Error calling public abstract org.apache.solr.katta.KattaResponse 
org.apache.solr.katta.ISolrServer.request(java.lang.String[],org.apache.solr.katta.KattaRequest)
 throws java.lang.Exception on pc-slave02:2 (try # 1 of 3) (id=0)
java.lang.reflect.InvocationTargetException

   so, I give you my config in proxy, please review them:

   in proxy 

   1)
solrHome- solrconfig.xml
 config

 requestHandler name=standard 
class=solr.KattaRequestHandler default=true

lst name=defaults
str name=echoParamsexplicit/str
str name=shards*/str
/lst
   /requestHandler
/config

  ok, all the shard is watched and hold in Zookeeper, through 
zookeeper zkCli.sh

 [zk: pc-master(CONNECTED) 11] ls /katta/shard-to-nodes
 [SPIndex05#1287138886138-99384445, 
SPIndex04#1287138886138-99384445]

2)
 In proxy 
 katta.node.properties:
 node.server.class=net.sf.katta.lib.lucene.LuceneServer


   3)
query:  
http://localhost:8080/solr-1395-katta-0.6.2-2patch/select/?q=lovealiceversion=2.2start=0rows=10indent=onisShard=falsedistrib=true

is that right?
   especial in this step 2,

Thanks!

JohnWu



 Integrate Katta
 ---

 Key: SOLR-1395
 URL: https://issues.apache.org/jira/browse/SOLR-1395
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: Next

 Attachments: back-end.log, front-end.log, hadoop-core-0.19.0.jar, 
 katta-core-0.6-dev.jar, katta-solrcores.jpg, katta.node.properties, 
 katta.zk.properties, log4j-1.2.13.jar, solr-1395-1431-3.patch, 
 solr-1395-1431-4.patch, solr-1395-1431-katta0.6.patch, 
 solr-1395-1431-katta0.6.patch, solr-1395-1431.patch, 
 solr-1395-katta-0.6.2-1.patch, solr-1395-katta-0.6.2-2.patch, 
 solr-1395-katta-0.6.2-3.patch, solr-1395-katta-0.6.2.patch, SOLR-1395.patch, 
 SOLR-1395.patch, SOLR-1395.patch, test-katta-core-0.6-dev.jar, 
 zkclient-0.1-dev.jar, zookeeper-3.2.1.jar

   Original Estimate: 336h
  Remaining Estimate: 336h

 We'll integrate Katta into Solr so that:
 * Distributed search uses Hadoop RPC
 * Shard/SolrCore distribution and management
 * Zookeeper based failover
 * Indexes may be built using Hadoop

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2471) Supporting bulk copies in Directory

2010-12-05 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967100#action_12967100
 ] 

Shai Erera commented on LUCENE-2471:


At some point IndexInput/Output.copyBytes did use FileChannel optimization in 
FSDirectory, but that caused troubles I think when the copying thread was 
interrupted. So it was removed and we were left w/ the default impl.

 Supporting bulk copies in Directory
 ---

 Key: LUCENE-2471
 URL: https://issues.apache.org/jira/browse/LUCENE-2471
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Earwin Burrfoot
 Fix For: 3.1, 4.0


 A method can be added to IndexOutput that accepts IndexInput, and writes 
 bytes using it as a source.
 This should be used for bulk-merge cases (offhand - norms, docstores?). Some 
 Directories can then override default impl and skip intermediate buffers 
 (NIO, MMap, RAM?).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2800) Search Index Generation fails

2010-12-05 Thread Sunitha Belavagi (JIRA)
Search Index Generation fails
-

 Key: LUCENE-2800
 URL: https://issues.apache.org/jira/browse/LUCENE-2800
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.0.0
 Environment: Windows Server 2003 
Reporter: Sunitha Belavagi


Hi,

We are using lucene 2.0.0 for search index In our Comergent application
 It was working fine since from more than 3 years. 
From this week, it is throwing Exception while creating New Index and also for 
Incremental Index.
Below is the exception


com.comergent.api.appservices.productService.ProductServiceException: 
java.io.IOException: Cannot delete 
...\searchIndex\en_US\MasterIndex_602580\segments 
at 
com.comergent.reference.appservices.productService.search.indexBuilder.CatalogIndexSetBuilder.indexPCFromCache(CatalogIndexSetBuilder.java:634)
 
at 
com.comergent.reference.appservices.productService.search.indexBuilder.CatalogIndexSetBuilder.buildIndexSet(CatalogIndexSetBuilder.java:276)
 
at 
com.comergent.appservices.search.indexBuilder.IndexSetBuilder$BuilderThread.run(IndexSetBuilder.java:469)
 
Caused by: java.io.IOException: Cannot delete 
searchIndex\en_US\MasterIndex_602580\segments 
at org.apache.lucene.store.FSDirectory.renameFile(FSDirectory.java:268) 
at org.apache.lucene.index.SegmentInfos.write(SegmentInfos.java:95) 
at org.apache.lucene.index.IndexWriter$4.doBody(IndexWriter.java:726) 
at org.apache.lucene.store.Lock$With.run(Lock.java:99) 
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:724) 
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:686) 
at 
org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:674) 
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:479) 
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:462) 
at 
com.comergent.reference.appservices.productService.search.indexBuilder.CatalogIndexSetBuilder.indexPCFromCache(CatalogIndexSetBuilder.java:630)
 
... 2 more 
2010.12.05 06:25:13:532 Env/Thread-21961:ERROR:CatalogIndexSetBuilder 
CatalogIndexSetBuilder: [MasterIndex_602580] - Exception: 
com.comergent.api.appservices.productService.ProductServiceException: 
java.io.IOException: Cannot delete ...\MasterIndex_602580\segments
2010.12.05 06:25:13:532 Env/Thread-21961:INFO:CMGT_SEARCH 
IndexSetBuilder$BuilderThread: error building the index for: MasterIndex_602580
com.comergent.api.exception.ComergentException: 
com.comergent.api.appservices.productService.ProductServiceException: 
java.io.IOException: Cannot delete 
\searchIndex\en_US\MasterIndex_602580\segments
at 
com.comergent.reference.appservices.productService.search.indexBuilder.CatalogIndexSetBuilder.buildIndexSet(CatalogIndexSetBuilder.java:305)
at 
com.comergent.appservices.search.indexBuilder.IndexSetBuilder$BuilderThread.run(IndexSetBuilder.java:469)
Caused by: 
com.comergent.api.appservices.productService.ProductServiceException: 
java.io.IOException: Cannot delete ...\MasterIndex_602580\segments
at 
com.comergent.reference.appservices.productService.search.indexBuilder.CatalogIndexSetBuilder.indexPCFromCache(CatalogIndexSetBuilder.java:634)
at 
com.comergent.reference.appservices.productService.search.indexBuilder.CatalogIndexSetBuilder.buildIndexSet(CatalogIndexSetBuilder.java:276)
... 1 more
Caused by: java.io.IOException: Cannot delete ...\MasterIndex_602580\segments
at org.apache.lucene.store.FSDirectory.renameFile(FSDirectory.java:268)
at org.apache.lucene.index.SegmentInfos.write(SegmentInfos.java:95)
at org.apache.lucene.index.IndexWriter$4.doBody(IndexWriter.java:726)
at org.apache.lucene.store.Lock$With.run(Lock.java:99)
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:724)
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:686)
at 
org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:674)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:479)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:462)
at 
com.comergent.reference.appservices.productService.search.indexBuilder.CatalogIndexSetBuilder.indexPCFromCache(CatalogIndexSetBuilder.java:630)
... 2 more

2010.12.05 06:25:13:938 Env/http-8080-Processor75:INFO:CMGT_SEARCH 
IndexSetBuilder: error building the index: 
com.comergent.api.appservices.search.exception.IndexingException: Error in 
executing some builder threads...
at 
com.comergent.appservices.search.indexBuilder.IndexSetBuilder.monitor(IndexSetBuilder.java:440)
at 
com.comergent.appservices.search.indexBuilder.IndexSetBuilder.build(IndexSetBuilder.java:185)
at 

[jira] Commented: (LUCENE-2235) implement PerFieldAnalyzerWrapper.getOffsetGap

2010-12-05 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967115#action_12967115
 ] 

Uwe Schindler commented on LUCENE-2235:
---

Hi Nick,
thanks for reporting this. Your problem only occurs since the missing method 
was added (before PFAW only returned some default, now it throws NPE) in that 
case.

In general, Lucene does not support *null* analyzers anywhere (not as ctor 
argument in IW/IWC) or e.g. here. You should always add a simple analyzer to 
IndexWriter (WhitespaceAnalyzer, SimpleAnalyzer, KeywordAnalyzer) or other 
methods taking Analyzer.

To really fix this, we have to review all places that don't need to call 
Analyzers. There are e.g. other places, like when you directly pass the 
TokenStream to the Field with new Field(name, TokenStream), it also calls the 
analyzer, so you have to implement it.

 implement PerFieldAnalyzerWrapper.getOffsetGap
 --

 Key: LUCENE-2235
 URL: https://issues.apache.org/jira/browse/LUCENE-2235
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 3.0
 Environment: Any
Reporter: Javier Godoy
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9.4, 3.0.3, 3.1, 4.0

 Attachments: LUCENE-2235.patch, PerFieldAnalyzerWrapper.patch


 PerFieldAnalyzerWrapper does not delegates calls to getOffsetGap(Fieldable), 
 instead it returns the default values from the implementation of Analyzer. 
 (Similar to LUCENE-659 PerFieldAnalyzerWrapper fails to implement 
 getPositionIncrementGap)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1395) Integrate Katta

2010-12-05 Thread tom liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12967118#action_12967118
 ] 

tom liu commented on SOLR-1395:
---

in proxy:
katta.node.properties:
#node.server.class=net.sf.katta.lib.lucene.LuceneServer
node.server.class=org.apache.solr.katta.DeployableSolrKattaServer

you must put apache-solr-core-XXX.jar to katta's lib, and some relative jars.

 Integrate Katta
 ---

 Key: SOLR-1395
 URL: https://issues.apache.org/jira/browse/SOLR-1395
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: Next

 Attachments: back-end.log, front-end.log, hadoop-core-0.19.0.jar, 
 katta-core-0.6-dev.jar, katta-solrcores.jpg, katta.node.properties, 
 katta.zk.properties, log4j-1.2.13.jar, solr-1395-1431-3.patch, 
 solr-1395-1431-4.patch, solr-1395-1431-katta0.6.patch, 
 solr-1395-1431-katta0.6.patch, solr-1395-1431.patch, 
 solr-1395-katta-0.6.2-1.patch, solr-1395-katta-0.6.2-2.patch, 
 solr-1395-katta-0.6.2-3.patch, solr-1395-katta-0.6.2.patch, SOLR-1395.patch, 
 SOLR-1395.patch, SOLR-1395.patch, test-katta-core-0.6-dev.jar, 
 zkclient-0.1-dev.jar, zookeeper-3.2.1.jar

   Original Estimate: 336h
  Remaining Estimate: 336h

 We'll integrate Katta into Solr so that:
 * Distributed search uses Hadoop RPC
 * Shard/SolrCore distribution and management
 * Zookeeper based failover
 * Indexes may be built using Hadoop

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org