Hi John, Here is some background about my data and what I want as output.
I have a 215K documents containing text. From those text files I extract names of persons, organisations and locations by using the Stanford NER library. (see http://nlp.stanford.edu/software/CRF-NER.shtml) Looking at the following line: Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole from his father. when the classifier is done annotating the line looks like this: <PERSON>Jan<PERSON><OFFSET>0<OFFSET> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> was on this way to <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the vehicle <PERSON>Jan<PERSON><OFFSET>48<OFFSET> <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> stole from his father. When looping through this annotated line you can save the persons and its offsets, please note that offset is a LONG value, inside a Map for example: MAP<STRING, LONG> entities Jan, 0 Janssen, 5 Klaas, 26 Jan, 48 Janssen, 50 Jan Janssen in the line is actually the one person and not two. Jan occurs at offset 0, to determine if Janssen belongs to Jan I could subtract the length of Jan (3) + 1 (whitespace) from Janssen's offset (5) and if outcome isn't greater then 1 then combine the two person into one person. (offset Jansen) - (offset Jan + whitespace) not greater then 1 If this is true then combine the two person and save this inside a new MAP<STRING, LONG[]> like Jan Janssen, [ 0 ]. The next time we come across Jan Janssen inside the text then just save the offset. Which produces the following MAP<STRING, LONG[]> Jan Janssen, [0, 48] I hope this clarifies my question. If things are still unclear please don't hesitate to ask me to clarify my question further. Kind regards, Martijn On Feb 3, 2013, at 1:05 PM, John Omernik <[email protected]> wrote: > Well there are some methods that may work, but I'd have to understand your > data and your constraints more. You want to be able to (As it sounds) sort by > offset, and then look at the one row, and then the next row, to determine if > the the two items should be joined. It "looks" like you are doing a string > comparison between numbers ("100 "to "104" there is only one "position" out > of three that is different (0 vs 4). Trouble is, look at id 3 and id 4. 150 > to 160 is only one position different as well, are you looking for Klaas Jan? > Also, is the ID fields filled from the first match? It seems like you have > some very odd data here. I don't think you've provided enough information on > the data for us to be able to help you. > > > > On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen <[email protected]> > wrote: > Hi all, > > I new to Apache Hive and I am doing some test to see if it fits my needs, one > of the questions I have if it is possible to "peek" for the next row in order > to find out if the values should be combined. Let me explain by an example. > > Let say my data looks like this > > Id name offset > 1 Jan 100 > 2 Janssen 104 > 3 Klaas 150 > 4 Jan 160 > 5 Janssen 164 > > An my output to another table should be this > > Id fullname offsets > 1 Jan Janssen [ 100, 160 ] > > I would like to combine the name values from two rows where the offset of the > two rows are no more then 1 character apart. > > Is this type of data manipulation is possible and if it is could someone > point me to the right direction hopefully with some explaination? > > Kind regards > Martijn >
