Thank you guys! I will have a look at this. Kind regards, Martijn
On Feb 3, 2013, at 8:36 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > You may want to look at sort by, distribute by, and cluster by. This > syntax controls which Reducers the data end up on and how it is sorted > on each reducer. > > On Sun, Feb 3, 2013 at 2:27 PM, Martijn van Leeuwen > <icodesh...@gmail.com> wrote: >> yes there is. Each document has a UUID as its identifier. The actual output >> of my map reduce job that produces the list of person names looks like this >> >> docId Name Type length offset >> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Lea PERSON 3 10858 >> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Lea PERSON 3 11063 >> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Ken PERSON 3 11186 >> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Marottoli PERSON 9 >> 11234 >> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Berkowitz PERSON 9 >> 17073 >> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Lea PERSON 3 17095 >> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Stephanie PERSON 9 >> 17330 >> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Putt PERSON 4 17340 >> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Stephanie PERSON 9 >> 17347 >> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Stephanie PERSON 9 >> 17480 >> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Putt PERSON 4 17490 >> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Berkowitz PERSON 9 >> 19498 >> f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Stephanie PERSON 9 >> 19530 >> >> Use the following code to produce a table inside Hive. >> >> DROP TABLE IF EXISTS entities_extract; >> >> CREATE TABLE entities_extract (doc_id STRING, name STRING, type STRING, >> len INT, offset BIGINT) >> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' >> LINES TERMINATED BY '\n' >> STORED AS TEXTFILE >> LOCATION '/research/45924/hive/entities_extract'; >> >> LOAD DATA LOCAL INPATH >> '/home/researcher/hadoop-runnables/files/entitie_extract_by_doc.txt' >> OVERWRITE INTO TABLE entities_extract; >> >> >> >> On Feb 3, 2013, at 8:07 PM, John Omernik <j...@omernik.com> wrote: >> >> Is there some think akin to a document I'd so we can assure all rows >> belonging to the same document can be sent to one mapper? >> >> On Feb 3, 2013 1:00 PM, "Martijn van Leeuwen" <icodesh...@gmail.com> wrote: >>> >>> Hi John, >>> >>> Here is some background about my data and what I want as output. >>> >>> I have a 215K documents containing text. From those text files I extract >>> names of persons, organisations and locations by using the Stanford NER >>> library. (see http://nlp.stanford.edu/software/CRF-NER.shtml) >>> >>> Looking at the following line: >>> >>> Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole >>> from his father. >>> >>> when the classifier is done annotating the line looks like this: >>> >>> <PERSON>Jan<PERSON><OFFSET>0<OFFSET> >>> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> was on this way to >>> <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the vehicle >>> <PERSON>Jan<PERSON><OFFSET>48<OFFSET> >>> <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> stole from his father. >>> >>> When looping through this annotated line you can save the persons and its >>> offsets, please note that offset is a LONG value, inside a Map for example: >>> >>> MAP<STRING, LONG> entities >>> >>> Jan, 0 >>> Janssen, 5 >>> Klaas, 26 >>> Jan, 48 >>> Janssen, 50 >>> >>> Jan Janssen in the line is actually the one person and not two. Jan occurs >>> at offset 0, to determine if Janssen belongs to Jan I could subtract the >>> length of Jan (3) + 1 (whitespace) from Janssen's offset (5) and if outcome >>> isn't greater then 1 then combine the two person into one person. >>> >>> (offset Jansen) - (offset Jan + whitespace) not greater then 1 >>> >>> If this is true then combine the two person and save this inside a new >>> MAP<STRING, LONG[]> like >>> Jan Janssen, [ 0 ]. >>> >>> The next time we come across Jan Janssen inside the text then just save >>> the offset. Which produces the following MAP<STRING, LONG[]> >>> >>> Jan Janssen, [0, 48] >>> >>> I hope this clarifies my question. >>> If things are still unclear please don't hesitate to ask me to clarify my >>> question further. >>> >>> Kind regards, >>> Martijn >>> >>> On Feb 3, 2013, at 1:05 PM, John Omernik <j...@omernik.com> wrote: >>> >>> Well there are some methods that may work, but I'd have to understand your >>> data and your constraints more. You want to be able to (As it sounds) sort >>> by offset, and then look at the one row, and then the next row, to determine >>> if the the two items should be joined. It "looks" like you are doing a >>> string comparison between numbers ("100 "to "104" there is only one >>> "position" out of three that is different (0 vs 4). Trouble is, look at id >>> 3 and id 4. 150 to 160 is only one position different as well, are you >>> looking for Klaas Jan? Also, is the ID fields filled from the first match? >>> It seems like you have some very odd data here. I don't think you've >>> provided enough information on the data for us to be able to help you. >>> >>> >>> >>> On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen <icodesh...@gmail.com> >>> wrote: >>>> >>>> Hi all, >>>> >>>> I new to Apache Hive and I am doing some test to see if it fits my needs, >>>> one of the questions I have if it is possible to "peek" for the next row in >>>> order to find out if the values should be combined. Let me explain by an >>>> example. >>>> >>>> Let say my data looks like this >>>> >>>> Id name offset >>>> 1 Jan 100 >>>> 2 Janssen 104 >>>> 3 Klaas 150 >>>> 4 Jan 160 >>>> 5 Janssen 164 >>>> >>>> An my output to another table should be this >>>> >>>> Id fullname offsets >>>> 1 Jan Janssen [ 100, 160 ] >>>> >>>> I would like to combine the name values from two rows where the offset of >>>> the two rows are no more then 1 character apart. >>>> >>>> Is this type of data manipulation is possible and if it is could someone >>>> point me to the right direction hopefully with some explaination? >>>> >>>> Kind regards >>>> Martijn >>> >>> >>> >>