yes there is. Each document has a UUID as its identifier. The actual output of 
my map reduce job that produces the list of person names looks like this

docId                                                                  Name 
Type                length offset
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     10858
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     11063
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Ken     PERSON     3     11186
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Marottoli     PERSON     9     11234
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Berkowitz     PERSON     9     17073
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     17095
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9     17330
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Putt     PERSON     4     17340
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9     17347
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9     17480
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Putt     PERSON     4     17490
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Berkowitz     PERSON     9     19498
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9     19530

Use the following code to produce a table inside Hive.

DROP TABLE IF EXISTS entities_extract;

    CREATE TABLE entities_extract (doc_id STRING, name STRING, type STRING, len 
INT, offset BIGINT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    LINES TERMINATED BY '\n'
    STORED AS TEXTFILE
    LOCATION '/research/45924/hive/entities_extract';

LOAD DATA LOCAL INPATH 
'/home/researcher/hadoop-runnables/files/entitie_extract_by_doc.txt' OVERWRITE 
INTO TABLE entities_extract;



On Feb 3, 2013, at 8:07 PM, John Omernik <j...@omernik.com> wrote:

> Is there some think akin to a document I'd so we can assure all rows 
> belonging to the same document can be sent to one mapper?
> 
> On Feb 3, 2013 1:00 PM, "Martijn van Leeuwen" <icodesh...@gmail.com> wrote:
> Hi John,
> 
> Here is some background about my data and what I want as output.
> 
> I have a 215K documents containing text. From those text files I extract 
> names of persons, organisations and locations by using the Stanford NER 
> library. (see http://nlp.stanford.edu/software/CRF-NER.shtml) 
> 
> Looking at the following line:
> 
> Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole from 
> his father.
> 
> when the classifier is done annotating the line looks like this:
> 
> <PERSON>Jan<PERSON><OFFSET>0<OFFSET> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> 
> was on this way to <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the 
> vehicle <PERSON>Jan<PERSON><OFFSET>48<OFFSET> 
> <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> stole from his father.
> 
> When looping through this annotated line you can save the persons and its 
> offsets, please note that offset is a LONG value, inside a Map for example:
> 
> MAP<STRING, LONG> entities
> 
> Jan, 0
> Janssen, 5 
> Klaas, 26
> Jan, 48
> Janssen, 50
> 
> Jan Janssen in the line is actually the one person and not two. Jan occurs at 
> offset 0, to determine if Janssen belongs to Jan I could subtract the length 
> of Jan (3) + 1 (whitespace) from Janssen's offset (5) and if outcome isn't 
> greater then 1 then combine the two person into one person.
> 
> (offset Jansen) - (offset Jan + whitespace) not greater then 1
> 
> If this is true then combine the two person and save this inside a new 
> MAP<STRING, LONG[]> like
> Jan Janssen, [ 0 ].
> 
> The next time we come across Jan Janssen inside the text then just save the 
> offset. Which produces the following MAP<STRING, LONG[]>
> 
> Jan Janssen, [0, 48] 
> 
> I hope this clarifies my question. 
> If things are still unclear please don't hesitate to ask me to clarify my 
> question further.
> 
> Kind regards,
> Martijn
> 
> On Feb 3, 2013, at 1:05 PM, John Omernik <j...@omernik.com> wrote:
> 
>> Well there are some methods that may work, but I'd have to understand your 
>> data and your constraints more. You want to be able to (As it sounds) sort 
>> by offset, and then look at the one row, and then the next row, to determine 
>> if the the two items should be joined. It "looks" like you  are doing a 
>> string comparison between numbers ("100 "to "104" there is only one 
>> "position" out of three that is different (0 vs 4).  Trouble is, look at id 
>> 3 and id 4.  150 to 160 is only one position different as well, are you 
>> looking for Klaas Jan?  Also, is the ID fields filled from the first match? 
>> It seems like you have some very odd data here. I don't think you've 
>> provided enough information on the data for us to be able to help you. 
>> 
>> 
>> 
>> On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen <icodesh...@gmail.com> 
>> wrote:
>> Hi all,
>> 
>> I new to Apache Hive and I am doing some test to see if it fits my needs, 
>> one of the questions I have if it is possible to "peek" for the next row in 
>> order to find out if the values should be combined. Let me explain by an 
>> example.
>> 
>> Let say my data looks like this
>> 
>> Id name offset
>> 1 Jan 100
>> 2 Janssen 104
>> 3 Klaas 150
>> 4 Jan 160
>> 5 Janssen 164
>> 
>> An my output to another table should be this
>> 
>> Id fullname offsets
>> 1 Jan Janssen [ 100, 160 ]
>> 
>> I would like to combine the name values from two rows where the offset of 
>> the two rows are no more then 1 character apart.
>> 
>> Is this type of data manipulation is possible and if it is could someone 
>> point me to the right direction hopefully with some explaination?
>> 
>> Kind regards
>> Martijn
>> 
> 

Reply via email to