Hi
The sequence file format will be Text and Vector Writable.
suppose you have test document named as 1,2,3,4.
The you can have sequence file format as Key : /test/1 Value : <vectors1>
/test/2 Value : <vectors2>
this line in BayesTestMapper
//the key is the expected value
context.write(new Text(SLASH.split(key.toString())[1]), new
VectorWritable(result));
and TestNaiveBayesDriver.java might help you . if you remove this part from
this code you will not get confusion matrix and initial labels are not
required.
if (bestIdx != Integer.MIN_VALUE) {
ClassifierResult classifierResult = new ClassifierResult(labelMap
.get(bestIdx), bestScore);
analyzer.addInstance(pair.getFirst().toString(), classifierResult);
}
your out file will contain our document name suppose 1 and label vector
with its values.
hope this help.
Thanks,
Vaibhav
[email protected]
On Tue, Jul 29, 2014 at 7:16 PM, Luca Filipponi <[email protected]>
wrote:
> I am using mahout 0.9, which part of source code should I look?
>
> My problem is that I don't know how to the sequence file without the label
> should be structured.
>
> Do you have any hint?
>
> Il giorno 29/lug/2014, alle ore 15:24, vaibhav srivastava <
> [email protected]> ha scritto:
>
> > Hi,
> > If you want to create a test set and if you do not want to measure
> accuracy.
> > Then you can make an instance of claasifier and load your model on that
> > classifier and then can find the best score.
> > Look at navie bayes test code.
> > Hope this help. Thanks .
> > On 29 Jul 2014 12:53, "Luca Filipponi" <[email protected]>
> wrote:
> >
> >> Hi , I am trying to develop sentiment analysis on italian tweet from
> >> twitter using the naive bayes classifier, but I've some trouble.
> >>
> >> My idea was to classify a lot of tweet as positive, negative or
> neautral,
> >> and using that as training set for the Classifier. To do that I've
> wrote a
> >> sequence file, in the format <Text,Text>, where in the key there is
> >> /label/tweetID and in the key the text, and then the text of all the
> >> dataset is converted in tfidf vector, using mahout utilities.
> >>
> >> Then I'm using the command:
> >>
> >> ./mahout trainnb and ./mahout testnb to check the classifier, and the
> >> score is right (I've got nearly 100% because the test set is the same as
> >> the train set)
> >>
> >> My question is if I want to use a test set that is unlabeled how should
> it
> >> be created? because if the format isn't like:
> >>
> >> key = /label/ the classifier can't find the label and I've got an
> >> exception
> >>
> >> but in a new dataset, obviously this will be unlabeled because i need to
> >> classify that, so I don't know what put in the key of the sequence file.
> >>
> >> I've searched online for some example, but the only ones that I've found
> >> use the split command, on the original dataset, and then testing on
> part of
> >> that, but isn't my case.
> >>
> >>
> >> Every idea for developing a better sentiment analysis is welcome, thanks
> >> in advance for the help.
> >>
> >>
>
>
--
Thanks and Regards,
Vaibhav Srivastava
Email-id: [email protected]
Mobile no.: 9552543029