Re: Long ids from String

Martin Junghanns Tue, 03 Nov 2015 01:58:37 -0800

Hi,

Sounds like RDF problems to me :)


To build an index you could do the following:

triplet := <a,b,c>

(0) build set of all triplets (with strings)
triplets = triplets1.union(triplets2)

(1) assign unique long ids to each vertex
vertices = triplets.flatMap() => [<a>,<c>,...].distinct()
vertexWithID = vertices.zipWithUniqueID() => [<1,a>,<2,c>]

(2) for each triplet dataset

// update source and target identifier in triplet dataset
triplets1.join(vertexWithID)
        .where(0) // <a,b,c>
        .equalTo(1) // <1,a>
        .with(/* replace source string with unique id */)
        .join(vertexWithId)
        .where(2) // <1,b,c>
        .equalTo(1) // <2,c>
        .with(/* replace target string with unique id */)

=> <1,b,2>

(3) store updated triplet sets for later processing

Of course this is a lot of computational effort, but it needs to be done
once and you have an index of your graphs which you can use for further
processing.

If your job contains only one join between the triplet datasets, this is
clearly not an option.

Just an idea :)

Best,
Martin


On 03.11.2015 10:10, Flavio Pompermaier wrote:
> Hi Martin,
> thanks for the suggestion but unfortunately in my use case I have another
> problem: I have to join triplets when f2==f0..is there any way to translate
> also references? i.e. if I have 2 tuples <a,b,c> <c,d,e>, when I apply that
> function I obtain something like <1,<a,b,c>>,<2,<c,d,e>>.
> I'd like to be able to join the 1st tuple with the 2nd (so I should know
> that c=>2).
> Which strategy do you think it could be the best option to achieve that?
> At the moment I was thinking to persist the ids and use a temporary table
> with an autoincrement long id but maybe there's a simpler solution..
> 
> Best,
> Flavio
> 
> On Tue, Nov 3, 2015 at 9:59 AM, Martin Junghanns <m.jungha...@mailbox.org>
> wrote:
> 
>> Hi Flavio,
>>
>> If you just want to assign a unique Long identifier to each element in
>> your dataset, you can use the DataSetUtils.zipWithUniqueId() method [1].
>>
>> Best,
>> Martin
>>
>> [1]
>>
>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java#L131
>>
>> On 03.11.2015 09:42, Flavio Pompermaier wrote:
>>> Hi to all,
>>>
>>> I was reading the thread about the Neo4j connector and an old question
>> came
>>> to my mind.
>>>
>>> In my Flink job I have Tuples with String ids that I use to join on that
>>> I'd like to convert to long (because Flink should improve quite a lot the
>>> memory usage and the processing time if I'm not wrong).
>>> Is there any recommended way to do that conversion in Flink?
>>>
>>> Best,
>>> Flavio
>>>
>>
> 
> 
>

Re: Long ids from String

Reply via email to