yes there is. Each document has a UUID as its identifier. The actual output of my map reduce job that produces the list of person names looks like this
docId Name Type length offset f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Lea PERSON 3 10858 f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Lea PERSON 3 11063 f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Ken PERSON 3 11186 f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Marottoli PERSON 9 11234 f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Berkowitz PERSON 9 17073 f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Lea PERSON 3 17095 f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Stephanie PERSON 9 17330 f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Putt PERSON 4 17340 f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Stephanie PERSON 9 17347 f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Stephanie PERSON 9 17480 f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Putt PERSON 4 17490 f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Berkowitz PERSON 9 19498 f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Stephanie PERSON 9 19530 Use the following code to produce a table inside Hive. DROP TABLE IF EXISTS entities_extract; CREATE TABLE entities_extract (doc_id STRING, name STRING, type STRING, len INT, offset BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/research/45924/hive/entities_extract'; LOAD DATA LOCAL INPATH '/home/researcher/hadoop-runnables/files/entitie_extract_by_doc.txt' OVERWRITE INTO TABLE entities_extract; On Feb 3, 2013, at 8:07 PM, John Omernik <j...@omernik.com> wrote: > Is there some think akin to a document I'd so we can assure all rows > belonging to the same document can be sent to one mapper? > > On Feb 3, 2013 1:00 PM, "Martijn van Leeuwen" <icodesh...@gmail.com> wrote: > Hi John, > > Here is some background about my data and what I want as output. > > I have a 215K documents containing text. From those text files I extract > names of persons, organisations and locations by using the Stanford NER > library. (see http://nlp.stanford.edu/software/CRF-NER.shtml) > > Looking at the following line: > > Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole from > his father. > > when the classifier is done annotating the line looks like this: > > <PERSON>Jan<PERSON><OFFSET>0<OFFSET> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> > was on this way to <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the > vehicle <PERSON>Jan<PERSON><OFFSET>48<OFFSET> > <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> stole from his father. > > When looping through this annotated line you can save the persons and its > offsets, please note that offset is a LONG value, inside a Map for example: > > MAP<STRING, LONG> entities > > Jan, 0 > Janssen, 5 > Klaas, 26 > Jan, 48 > Janssen, 50 > > Jan Janssen in the line is actually the one person and not two. Jan occurs at > offset 0, to determine if Janssen belongs to Jan I could subtract the length > of Jan (3) + 1 (whitespace) from Janssen's offset (5) and if outcome isn't > greater then 1 then combine the two person into one person. > > (offset Jansen) - (offset Jan + whitespace) not greater then 1 > > If this is true then combine the two person and save this inside a new > MAP<STRING, LONG[]> like > Jan Janssen, [ 0 ]. > > The next time we come across Jan Janssen inside the text then just save the > offset. Which produces the following MAP<STRING, LONG[]> > > Jan Janssen, [0, 48] > > I hope this clarifies my question. > If things are still unclear please don't hesitate to ask me to clarify my > question further. > > Kind regards, > Martijn > > On Feb 3, 2013, at 1:05 PM, John Omernik <j...@omernik.com> wrote: > >> Well there are some methods that may work, but I'd have to understand your >> data and your constraints more. You want to be able to (As it sounds) sort >> by offset, and then look at the one row, and then the next row, to determine >> if the the two items should be joined. It "looks" like you are doing a >> string comparison between numbers ("100 "to "104" there is only one >> "position" out of three that is different (0 vs 4). Trouble is, look at id >> 3 and id 4. 150 to 160 is only one position different as well, are you >> looking for Klaas Jan? Also, is the ID fields filled from the first match? >> It seems like you have some very odd data here. I don't think you've >> provided enough information on the data for us to be able to help you. >> >> >> >> On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen <icodesh...@gmail.com> >> wrote: >> Hi all, >> >> I new to Apache Hive and I am doing some test to see if it fits my needs, >> one of the questions I have if it is possible to "peek" for the next row in >> order to find out if the values should be combined. Let me explain by an >> example. >> >> Let say my data looks like this >> >> Id name offset >> 1 Jan 100 >> 2 Janssen 104 >> 3 Klaas 150 >> 4 Jan 160 >> 5 Janssen 164 >> >> An my output to another table should be this >> >> Id fullname offsets >> 1 Jan Janssen [ 100, 160 ] >> >> I would like to combine the name values from two rows where the offset of >> the two rows are no more then 1 character apart. >> >> Is this type of data manipulation is possible and if it is could someone >> point me to the right direction hopefully with some explaination? >> >> Kind regards >> Martijn >> >