Re: [jira] Created: (LUCENE-1257) Port to Java5

2008-04-04 Thread Toke Eskildsen
On Wed, 2008-04-02 at 09:30 -0400, Mark Miller wrote:
> > - replacement of indexed for loops with for each constructs
> 
> Is this always the best idea? Doesn't the for loop construct make an
> iterator, which can be much slower than an indexed for loop?

Only in the case of iterations over collections. For arrays, the foreach
is syntactic sugar for indexed for-loop.
http://java.sun.com/docs/books/jls/third_edition/html/statements.html#14.14.2

Whether or not using an iterator for e.g. ArrayList is better than an
indexed for-loop is another question. This is the old problems of
balancing general code vs. performance: An indexed for-loop might be
faster than an iterator for ArrayList, but it is definitely slower for
LinkedList.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1121) Use nio.transferTo when copying large blocks of bytes

2008-04-04 Thread Raghu Angadi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12585677#action_12585677
 ] 

Raghu Angadi commented on LUCENE-1121:
--

Only savings I would expect from transferTo() would be CPU reduction. Does the 
benchmark above measure wall clock time or "cpu time"? Btw, the windows results 
are pretty... strange.

HADOOP-3164 shows expected CPU benefit. Still need to do more extensive tests 
where I max out CPU with and without patch and compare the wall clock time. 
Initial test just compares cpu reported on /proc/pid/stat with a test that is 
disk bound.

> Use nio.transferTo when copying large blocks of bytes
> -
>
> Key: LUCENE-1121
> URL: https://issues.apache.org/jira/browse/LUCENE-1121
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1121.patch, LUCENE-1121.patch, testIO.java
>
>
> When building a CFS file, and also when merging stored fields (and
> term vectors, with LUCENE-1120), we copy large blocks of bytes at
> once.
> We currently do this with an intermediate buffer.
> But, nio.transferTo should be somewhat faster on OS's that offer low
> level IO APIs for moving blocks of bytes between files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1039) Bayesian classifiers using Lucene as data store

2008-04-04 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12585700#action_12585700
 ] 

Karl Wettin commented on LUCENE-1039:
-

Cuong Hoang - 03/Apr/08 06:28 PM
>>Each document must only contain one token in the class field
>Does that mean each document in the training set can only belong to one class?

You can have multiple class fields, but you can only classify an instance to 
one class at the time. Currently class and classes buffer is set in instances, 
I think it should be possible to move that code to NaiveBayesClassifier to 
allow classification on multiple classes on the same Instances.

Instances.java:
{code:java}
  private String classField;
  private String[] classes;
{code}

>I try to run the test case but get NullPointerException:

> at 
> org.apache.lucene.classifier.BayesianClassifier.classFeatureFrequency(BayesianClassifier.java:92)

The pass tests here, did you perhaps alter the content in some way?

In BayesianClassifier.java, add the following on row 92:

{code:java}
classDocs.seek(new Term(instances.getClassField(), _class));
+classDocs.next();
while (featureDocs.next()) {
{code}

Does that help?

> Bayesian classifiers using Lucene as data store
> ---
>
> Key: LUCENE-1039
> URL: https://issues.apache.org/jira/browse/LUCENE-1039
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Karl Wettin
>Priority: Minor
> Attachments: LUCENE-1039.txt
>
>
> Bayesian classifiers using Lucene as data store. Based on the Naive Bayes and 
> Fisher method algorithms as described by Toby Segaran in "Programming 
> Collective Intelligence", ISBN 978-0-596-52932-1. 
> Have fun.
> Poor java docs, but the TestCase shows how to use it:
> {code:java}
> public class TestClassifier extends TestCase {
>   public void test() throws Exception {
> InstanceFactory instanceFactory = new InstanceFactory() {
>   public Document factory(String text, String _class) {
> Document doc = new Document();
> doc.add(new Field("class", _class, Field.Store.YES, 
> Field.Index.NO_NORMS));
> doc.add(new Field("text", text, Field.Store.YES, Field.Index.NO, 
> Field.TermVector.NO));
> doc.add(new Field("text/ngrams/start", text, Field.Store.NO, 
> Field.Index.TOKENIZED, Field.TermVector.YES));
> doc.add(new Field("text/ngrams/inner", text, Field.Store.NO, 
> Field.Index.TOKENIZED, Field.TermVector.YES));
> doc.add(new Field("text/ngrams/end", text, Field.Store.NO, 
> Field.Index.TOKENIZED, Field.TermVector.YES));
> return doc;
>   }
>   Analyzer analyzer = new Analyzer() {
> private int minGram = 2;
> private int maxGram = 3;
> public TokenStream tokenStream(String fieldName, Reader reader) {
>   TokenStream ts = new StandardTokenizer(reader);
>   ts = new LowerCaseFilter(ts);
>   if (fieldName.endsWith("/ngrams/start")) {
> ts = new EdgeNGramTokenFilter(ts, 
> EdgeNGramTokenFilter.Side.FRONT, minGram, maxGram);
>   } else if (fieldName.endsWith("/ngrams/inner")) {
> ts = new NGramTokenFilter(ts, minGram, maxGram);
>   } else if (fieldName.endsWith("/ngrams/end")) {
> ts = new EdgeNGramTokenFilter(ts, EdgeNGramTokenFilter.Side.BACK, 
> minGram, maxGram);
>   }
>   return ts;
> }
>   };
>   public Analyzer getAnalyzer() {
> return analyzer;
>   }
> };
> Directory dir = new RAMDirectory();
> new IndexWriter(dir, null, true).close();
> Instances instances = new Instances(dir, instanceFactory, "class");
> instances.addInstance("hello world", "en");
> instances.addInstance("hallå världen", "sv");
> instances.addInstance("this is london calling", "en");
> instances.addInstance("detta är london som ringer", "sv");
> instances.addInstance("john has a long mustache", "en");
> instances.addInstance("john har en lång mustache", "sv");
> instances.addInstance("all work and no play makes jack a dull boy", "en");
> instances.addInstance("att bara arbeta och aldrig leka gör jack en trist 
> gosse", "sv");
> instances.addInstance("shrimp sandwich", "en");
> instances.addInstance("räksmörgås", "sv");
> instances.addInstance("it's now or never", "en");
> instances.addInstance("det är nu eller aldrig", "sv");
> instances.addInstance("to tie up at a landing-stage", "en");
> instances.addInstance("att angöra en brygga", "sv");
> instances.addInstance("it's now time for the children's television 
> shows", "en");
> instances.addInstance("nu är det dags för barnprogram", "sv");
> instances.flush();
> testClassifier(instances, new NaiveBayesClassifier());
> testClassifier(instances, new