Thank you guys! I will have a look at this.

Kind regards,
Martijn

On Feb 3, 2013, at 8:36 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote:

> You may want to look at sort by, distribute by, and cluster by. This
> syntax controls which Reducers the data end up on and how it is sorted
> on each reducer.
> 
> On Sun, Feb 3, 2013 at 2:27 PM, Martijn van Leeuwen
> <icodesh...@gmail.com> wrote:
>> yes there is. Each document has a UUID as its identifier. The actual output
>> of my map reduce job that produces the list of person names looks like this
>> 
>> docId        Name Type length offset
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     10858
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     11063
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Ken     PERSON     3     11186
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Marottoli     PERSON     9
>> 11234
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Berkowitz     PERSON     9
>> 17073
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     17095
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
>> 17330
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Putt     PERSON     4     17340
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
>> 17347
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
>> 17480
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Putt     PERSON     4     17490
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Berkowitz     PERSON     9
>> 19498
>> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
>> 19530
>> 
>> Use the following code to produce a table inside Hive.
>> 
>> DROP TABLE IF EXISTS entities_extract;
>> 
>>    CREATE TABLE entities_extract (doc_id STRING, name STRING, type STRING,
>> len INT, offset BIGINT)
>>    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>>    LINES TERMINATED BY '\n'
>>    STORED AS TEXTFILE
>>    LOCATION '/research/45924/hive/entities_extract';
>> 
>> LOAD DATA LOCAL INPATH
>> '/home/researcher/hadoop-runnables/files/entitie_extract_by_doc.txt'
>> OVERWRITE INTO TABLE entities_extract;
>> 
>> 
>> 
>> On Feb 3, 2013, at 8:07 PM, John Omernik <j...@omernik.com> wrote:
>> 
>> Is there some think akin to a document I'd so we can assure all rows
>> belonging to the same document can be sent to one mapper?
>> 
>> On Feb 3, 2013 1:00 PM, "Martijn van Leeuwen" <icodesh...@gmail.com> wrote:
>>> 
>>> Hi John,
>>> 
>>> Here is some background about my data and what I want as output.
>>> 
>>> I have a 215K documents containing text. From those text files I extract
>>> names of persons, organisations and locations by using the Stanford NER
>>> library. (see http://nlp.stanford.edu/software/CRF-NER.shtml)
>>> 
>>> Looking at the following line:
>>> 
>>> Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole
>>> from his father.
>>> 
>>> when the classifier is done annotating the line looks like this:
>>> 
>>> <PERSON>Jan<PERSON><OFFSET>0<OFFSET>
>>> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> was on this way to
>>> <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the vehicle
>>> <PERSON>Jan<PERSON><OFFSET>48<OFFSET>
>>> <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> stole from his father.
>>> 
>>> When looping through this annotated line you can save the persons and its
>>> offsets, please note that offset is a LONG value, inside a Map for example:
>>> 
>>> MAP<STRING, LONG> entities
>>> 
>>> Jan, 0
>>> Janssen, 5
>>> Klaas, 26
>>> Jan, 48
>>> Janssen, 50
>>> 
>>> Jan Janssen in the line is actually the one person and not two. Jan occurs
>>> at offset 0, to determine if Janssen belongs to Jan I could subtract the
>>> length of Jan (3) + 1 (whitespace) from Janssen's offset (5) and if outcome
>>> isn't greater then 1 then combine the two person into one person.
>>> 
>>> (offset Jansen) - (offset Jan + whitespace) not greater then 1
>>> 
>>> If this is true then combine the two person and save this inside a new
>>> MAP<STRING, LONG[]> like
>>> Jan Janssen, [ 0 ].
>>> 
>>> The next time we come across Jan Janssen inside the text then just save
>>> the offset. Which produces the following MAP<STRING, LONG[]>
>>> 
>>> Jan Janssen, [0, 48]
>>> 
>>> I hope this clarifies my question.
>>> If things are still unclear please don't hesitate to ask me to clarify my
>>> question further.
>>> 
>>> Kind regards,
>>> Martijn
>>> 
>>> On Feb 3, 2013, at 1:05 PM, John Omernik <j...@omernik.com> wrote:
>>> 
>>> Well there are some methods that may work, but I'd have to understand your
>>> data and your constraints more. You want to be able to (As it sounds) sort
>>> by offset, and then look at the one row, and then the next row, to determine
>>> if the the two items should be joined. It "looks" like you  are doing a
>>> string comparison between numbers ("100 "to "104" there is only one
>>> "position" out of three that is different (0 vs 4).  Trouble is, look at id
>>> 3 and id 4.  150 to 160 is only one position different as well, are you
>>> looking for Klaas Jan?  Also, is the ID fields filled from the first match?
>>> It seems like you have some very odd data here. I don't think you've
>>> provided enough information on the data for us to be able to help you.
>>> 
>>> 
>>> 
>>> On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen <icodesh...@gmail.com>
>>> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> I new to Apache Hive and I am doing some test to see if it fits my needs,
>>>> one of the questions I have if it is possible to "peek" for the next row in
>>>> order to find out if the values should be combined. Let me explain by an
>>>> example.
>>>> 
>>>> Let say my data looks like this
>>>> 
>>>> Id name offset
>>>> 1 Jan 100
>>>> 2 Janssen 104
>>>> 3 Klaas 150
>>>> 4 Jan 160
>>>> 5 Janssen 164
>>>> 
>>>> An my output to another table should be this
>>>> 
>>>> Id fullname offsets
>>>> 1 Jan Janssen [ 100, 160 ]
>>>> 
>>>> I would like to combine the name values from two rows where the offset of
>>>> the two rows are no more then 1 character apart.
>>>> 
>>>> Is this type of data manipulation is possible and if it is could someone
>>>> point me to the right direction hopefully with some explaination?
>>>> 
>>>> Kind regards
>>>> Martijn
>>> 
>>> 
>>> 
>> 

Reply via email to