Hi John,

Here is some background about my data and what I want as output.

I have a 215K documents containing text. From those text files I extract names 
of persons, organisations and locations by using the Stanford NER library. (see 
http://nlp.stanford.edu/software/CRF-NER.shtml) 

Looking at the following line:

Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole from his 
father.

when the classifier is done annotating the line looks like this:

<PERSON>Jan<PERSON><OFFSET>0<OFFSET> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> 
was on this way to <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the vehicle 
<PERSON>Jan<PERSON><OFFSET>48<OFFSET> <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> 
stole from his father.

When looping through this annotated line you can save the persons and its 
offsets, please note that offset is a LONG value, inside a Map for example:

MAP<STRING, LONG> entities

Jan, 0
Janssen, 5 
Klaas, 26
Jan, 48
Janssen, 50

Jan Janssen in the line is actually the one person and not two. Jan occurs at 
offset 0, to determine if Janssen belongs to Jan I could subtract the length of 
Jan (3) + 1 (whitespace) from Janssen's offset (5) and if outcome isn't greater 
then 1 then combine the two person into one person.

(offset Jansen) - (offset Jan + whitespace) not greater then 1

If this is true then combine the two person and save this inside a new 
MAP<STRING, LONG[]> like
Jan Janssen, [ 0 ].

The next time we come across Jan Janssen inside the text then just save the 
offset. Which produces the following MAP<STRING, LONG[]>

Jan Janssen, [0, 48] 

I hope this clarifies my question. 
If things are still unclear please don't hesitate to ask me to clarify my 
question further.

Kind regards,
Martijn

On Feb 3, 2013, at 1:05 PM, John Omernik <[email protected]> wrote:

> Well there are some methods that may work, but I'd have to understand your 
> data and your constraints more. You want to be able to (As it sounds) sort by 
> offset, and then look at the one row, and then the next row, to determine if 
> the the two items should be joined. It "looks" like you  are doing a string 
> comparison between numbers ("100 "to "104" there is only one "position" out 
> of three that is different (0 vs 4).  Trouble is, look at id 3 and id 4.  150 
> to 160 is only one position different as well, are you looking for Klaas Jan? 
>  Also, is the ID fields filled from the first match? It seems like you have 
> some very odd data here. I don't think you've provided enough information on 
> the data for us to be able to help you. 
> 
> 
> 
> On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen <[email protected]> 
> wrote:
> Hi all,
> 
> I new to Apache Hive and I am doing some test to see if it fits my needs, one 
> of the questions I have if it is possible to "peek" for the next row in order 
> to find out if the values should be combined. Let me explain by an example.
> 
> Let say my data looks like this
> 
> Id name offset
> 1 Jan 100
> 2 Janssen 104
> 3 Klaas 150
> 4 Jan 160
> 5 Janssen 164
> 
> An my output to another table should be this
> 
> Id fullname offsets
> 1 Jan Janssen [ 100, 160 ]
> 
> I would like to combine the name values from two rows where the offset of the 
> two rows are no more then 1 character apart.
> 
> Is this type of data manipulation is possible and if it is could someone 
> point me to the right direction hopefully with some explaination?
> 
> Kind regards
> Martijn
> 

Reply via email to