Good luck. BTW, if you have to manage a lot of documents, I think you need to merge documents into map or sequence file (document ID key and document value pairs) on HDFS. Apache Nutch will be helpful. Then, you can create a inverted index MR program by editing few lines of the word-count MR example.
On Wed, May 22, 2013 at 4:42 PM, Steven van Beelen <[email protected]> wrote: > For a project I'm trying to implement an Inverted Indexing algorithm, which > has a 'term' and 'postingslist', in which the postings list consists of a > 'document id' and 'payload' (in my case term frequency per document). > I was thinking of inserting multiple different documents and taking the > filename as documentID, hence the necessity. > But I've found a way to work around this problem of mine by using different > input which does not require the filename to be retrievable in a BSP task. > > If I will be needing it later on in my project and am working on it, I'll > let you know. > > Thanks for the help thus far! > > > > On Wed, May 22, 2013 at 1:16 AM, Edward J. Yoon <[email protected]>wrote: > >> Hi, >> >> Short answer is no, we don't provide API for what you are trying to do. >> >> However, it can be added easily. See BSPPeerImpl.initInput() method, >> InputSplit interface and FileSplit classes. >> >> Why do you need that function? If there's reasonable necessity, Let's >> add it together. >> >> On Tue, May 21, 2013 at 7:04 PM, Steven van Beelen <[email protected]> >> wrote: >> > Hi all, >> > >> > The title says it: is there a way to retrieve the filename of the >> > input/inputsplit a BSP Task is working on? I've been looking for some >> time >> > in the docs and source files, but cannot seem to find if one is able to >> > retrieve the filename/pathname from the input used. >> > >> > Cheers >> >> >> >> -- >> Best Regards, Edward J. Yoon >> @eddieyoon >> -- Best Regards, Edward J. Yoon @eddieyoon
