subject:"\[racket\-users\] Best data structure for ordered data set with insertion and reordering\?"

Re: [racket-users] Best data structure for ordered data set with insertion and reordering?

2020-07-17 Thread David Storrs

Thanks George.  Much appreciated.

On Thu, Jul 16, 2020 at 11:21 PM George Neuner  wrote:

>
> Hi David,
>
> On 7/16/2020 11:44 AM, David Storrs wrote:
>
> On Thu, Jul 16, 2020 at 10:09 AM George Neuner 
> wrote:
>
>>
>> The problem seems under-specified.  Can you say more about the real
>> purpose?
>>
>
> Basic version:  It's a peer-to-peer encrypted swarmed file sharing system
> that presents like Dropbox on the front end (i.e. "make a change to the
> filesystem on peer A and peers B-Z will replicate that change") and works
> something like Bittorrent on the back end in that files are sent in chunks
> but it offers functionality that Bittorrent does not, such as encrypted
> transfer, WoT authentication, etc.
>
>
> Interesting.  So I'm guessing your problem is to (compactly) represent the
> state of the shared space.
>
> Do you plan on having index servers, or are you aiming for a fully
> distributed solution?  And, if distributed, do you want each node to
> maintain its own state picture of the shared space, or were you thinking
> that nodes could just snoop admin broadcasts looking for mention of data
> they don't currently have?  [Your question about how to pair / collapse
> messages suggests you might be considering a snoopy solution.]
>
> Asking because keeping a state picture has scalability issues, a snoopy
> solution has complexity issues, and (depending on latency) both have issues
> with performing unnecessary work.  In any event, I have some suggestions.
>
>
>
> Snoopy is the more interesting case.  You start with a queue of file
> operations to be done as gleaned from the admin messages - mkdir, rmdir,
> fetch a file, delete a file, etc. - in whatever order the messages were
> received.
>
> Separately, you maintain a (hash table) mapping from pathnames to a list
> of queue nodes that operate on that object.  The map should use weak
> references so that nodes can safely be removed from the queue and discarded
> without also needing to update the map.  If queue processing gets to some
> operation first, any map reference to it will dissolve (be replaced by
> #f).
>
> When a message is received, you lookup the pathname in the map, and if a
> complementary operation is found in the queue, you remove and discard it.
> [You can also remove references in the map or just let them dissolve
> depending on your handling.]   Then simply discard the message.
>
> Otherwise you queue whatever operation the message indicates and add a
> reference to the queue node under the object's pathname in the map.
>
> Extra complexity comes in having to notice that map entries (pathnames)
> have no operations left in the queue.  Weak references don't just disappear
> - they are changed to #f when the referenced object is no longer reachable
> - however AFAICT there is no hashtable variant that permits weak reference
> values, so you have to use weak-boxes and those continue to exist even if
> the objects they reference are gone.   Useless map entries will need to be
> identified and removed somehow.
>
>
> Modeling the filesystem can be done rather simply with a trie in which
> folders are represented by mutable hash tables and files by structures.
> You can combine this with the operation queue above, but in this case
> lookups can be done in the trie and queue references kept in the trie
> nodes.  And the trie provides a snapshot of the current state which may be
> useful for other purposes.
>
>
> The trick in either case is processing latency: you don't want to wait too
> long, but if you really want to avoid unnecessary work you need to delay
> performing file operations long enough that complementary messages are
> likely to be received.
>
>
>
> What if messages are lost permanently, e.g., due to hardware crash?
>>
>
>> What it you receive a create but a corresponding delete or update is
>> lost - then your information / picture of the file system state is wrong.
>>
>> What if you receive a file delete without a corresponding create? In the
>> absence of other information, can you even assume there *was* a create?
>> If these messages are sent in response to user actions, can they ever be
>> sent mistakenly?
>>
>>
> The ultimate answer to these questions is "If things get out of sync in a
> way that the system cannot resolve, it will be flagged for a human to
> resolve." There are things we do that mitigate them -- for example, a
> write-ahead log for messages received from peers -- but we acknowledge that
> we cannot resolve 100% of situations automatically.  Neither can any other
> file replication service.  (Dropbox, Box.com, etc)
>
> Also relevantly, differences are reconciled across multiple peers.  If
> there's 5 peers in your replication set and the other 4 agree that there
> should be a file at path P but you don't have one then it's safe to assume
> that you missed a File-Create message.  And yes, that comes with issues of
> its own (Q: What if it was deleted on your machine and none of the others
> got you

Re: [racket-users] Best data structure for ordered data set with insertion and reordering?

2020-07-16 Thread George Neuner



Hi David,

On 7/16/2020 11:44 AM, David Storrs wrote:
On Thu, Jul 16, 2020 at 10:09 AM George Neuner > wrote:



The problem seems under-specified.  Can you say more about the
real purpose?


Basic version:  It's a peer-to-peer encrypted swarmed file sharing 
system that presents like Dropbox on the front end (i.e. "make a 
change to the filesystem on peer A and peers B-Z will replicate that 
change") and works something like Bittorrent on the back end in that 
files are sent in chunks but it offers functionality that Bittorrent 
does not, such as encrypted transfer, WoT authentication, etc.


Interesting.  So I'm guessing your problem is to (compactly) represent 
the state of the shared space.


Do you plan on having index servers, or are you aiming for a fully 
distributed solution?  And, if distributed, do you want each node to 
maintain its own state picture of the shared space, or were you thinking 
that nodes could just snoop admin broadcasts looking for mention of data 
they don't currently have?  [Your question about how to pair / collapse 
messages suggests you might be considering a snoopy solution.]


Asking because keeping a state picture has scalability issues, a snoopy 
solution has complexity issues, and (depending on latency) both have 
issues with performing unnecessary work.  In any event, I have some 
suggestions.




Snoopy is the more interesting case.  You start with a queue of file 
operations to be done as gleaned from the admin messages - mkdir, rmdir, 
fetch a file, delete a file, etc. - in whatever order the messages were 
received.


Separately, you maintain a (hash table) mapping from pathnames to a list 
of queue nodes that operate on that object.  The map should use weak 
references so that nodes can safely be removed from the queue and 
discarded without also needing to update the map.  If queue processing 
gets to some operation first, any map reference to it will dissolve (be 
replaced by #f).


When a message is received, you lookup the pathname in the map, and if a 
complementary operation is found in the queue, you remove and discard 
it.  [You can also remove references in the map or just let them 
dissolve depending on your handling.]   Then simply discard the message.


Otherwise you queue whatever operation the message indicates and add a 
reference to the queue node under the object's pathname in the map.


Extra complexity comes in having to notice that map entries (pathnames) 
have no operations left in the queue.  Weak references don't just 
disappear - they are changed to #f when the referenced object is no 
longer reachable - however AFAICT there is no hashtable variant that 
permits weak reference values, so you have to use weak-boxes and those 
continue to exist even if the objects they reference are gone.   Useless 
map entries will need to be identified and removed somehow.



Modeling the filesystem can be done rather simply with a trie in which 
folders are represented by mutable hash tables and files by structures.  
You can combine this with the operation queue above, but in this case 
lookups can be done in the trie and queue references kept in the trie 
nodes.  And the trie provides a snapshot of the current state which may 
be useful for other purposes.



The trick in either case is processing latency: you don't want to wait 
too long, but if you really want to avoid unnecessary work you need to 
delay performing file operations long enough that complementary messages 
are likely to be received.





What if messages are lost permanently, e.g., due to hardware crash?


What it you receive a create but a corresponding delete or update is
lost - then your information / picture of the file system state is
wrong.

What if you receive a file delete without a corresponding create?
In the
absence of other information, can you even assume there *was* a
create?
If these messages are sent in response to user actions, can they
ever be
sent mistakenly?


The ultimate answer to these questions is "If things get out of sync 
in a way that the system cannot resolve, it will be flagged for a 
human to resolve." There are things we do that mitigate them -- for 
example, a write-ahead log for messages received from peers -- but we 
acknowledge that we cannot resolve 100% of situations automatically.  
Neither can any other file replication service.  (Dropbox, Box.com, etc)


Also relevantly, differences are reconciled across multiple peers.  If 
there's 5 peers in your replication set and the other 4 agree that 
there should be a file at path P but you don't have one then it's safe 
to assume that you missed a File-Create message.  And yes, that comes 
with issues of its own (Q: What if it was deleted on your machine and 
none of the others got your File-Delete because you crashed before 
sending it? A: Worst case, the file gets recreated and the user 
deletes it again.  Also, move fi

Re: [racket-users] Best data structure for ordered data set with insertion and reordering?

2020-07-16 Thread David Storrs

On Thu, Jul 16, 2020 at 10:09 AM George Neuner  wrote:

>
> On 7/16/2020 4:29 AM, David Storrs wrote:
>
> The problem seems under-specified.  Can you say more about the real
> purpose?
>

Basic version:  It's a peer-to-peer encrypted swarmed file sharing system
that presents like Dropbox on the front end (i.e. "make a change to the
filesystem on peer A and peers B-Z will replicate that change") and works
something like Bittorrent on the back end in that files are sent in chunks
but it offers functionality that Bittorrent does not, such as encrypted
transfer, WoT authentication, etc.

What if messages are lost permanently, e.g., due to hardware crash?
>

> What it you receive a create but a corresponding delete or update is
> lost - then your information / picture of the file system state is wrong.
>
> What if you receive a file delete without a corresponding create? In the
> absence of other information, can you even assume there *was* a create?
> If these messages are sent in response to user actions, can they ever be
> sent mistakenly?
>
>
The ultimate answer to these questions is "If things get out of sync in a
way that the system cannot resolve, it will be flagged for a human to
resolve." There are things we do that mitigate them -- for example, a
write-ahead log for messages received from peers -- but we acknowledge that
we cannot resolve 100% of situations automatically.  Neither can any other
file replication service.  (Dropbox, Box.com, etc)

Also relevantly, differences are reconciled across multiple peers.  If
there's 5 peers in your replication set and the other 4 agree that there
should be a file at path P but you don't have one then it's safe to assume
that you missed a File-Create message.  And yes, that comes with issues of
its own (Q: What if it was deleted on your machine and none of the others
got your File-Delete because you crashed before sending it? A: Worst case,
the file gets recreated and the user deletes it again.  Also, move files to
a Trash folder in response to a File-Delete, don't actually delete them for
a certain period of time) but again we fall back to human resolution.

>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/CAE8gKof5ivZSN_nbPFtv0YJKyH8V-SexL1j6hqTeq0DYrY_8Rw%40mail.gmail.com.

Re: [racket-users] Best data structure for ordered data set with insertion and reordering?

2020-07-16 Thread George Neuner




On 7/16/2020 4:29 AM, David Storrs wrote:

tl;dr
Can anyone recommend a data structure that is ordered and supports 
efficient reordering, insertion at arbitrary location, and deletion?


Long form:

I'm working on an operation-log reconciliation problem, where each 
operation is one of:


  File-Create    P H
  File-Update   P H
  File-Delete    P H
  Folder-Create P
  Folder-Delete P

P = path
H = hash of the file (e.g. md5)

Operation messages travel over the network, meaning that they could 
arrive out of order.  (They could also be missed entirely, but that's 
a separate problem that I'm not working on yet.)


Shouldn't be a problem:  "at-least-once" and "at-most-once" semantics 
both are pretty easy.  It's the "exactly-once" that is the problem.  
Hopefully you have no need for that.



Specifically, I want to be able to take a series of messages and 
collapse them where appropriate.  For example:


  File-Update P H1 -> H2
  File-Create P1 H1
Result after collapse:
  '(File-Create P1 H2)

  File-Create P H1
  File-Delete P H1
Result after collapse:
  '()

  File-Delete P X
  File-Create P X
Result after collapse:
  '()
  File-Delete P1 H1
  File-Create P2 H2
  File-Create P1 H1
Result after collapse:
  '(File-Create P2 H2)

I've been mulling over various ways to handle all of this and digging 
around to find examples of other people doing it, but I'm wondering if 
there's a data structure or existing algorithm that will handle it 
cleanly.  Does anyone know of such a thing?


What if messages are lost permanently, e.g., due to hardware crash?

What it you receive a create but a corresponding delete or update is 
lost - then your information / picture of the file system state is wrong.


What if you receive a file delete without a corresponding create? In the 
absence of other information, can you even assume there *was* a create?  
If these messages are sent in response to user actions, can they ever be 
sent mistakenly?


The problem seems under-specified.  Can you say more about the real purpose?


I can think of some ways to keep an index of the file system and track 
changes made to it ... but at best that could provide a snapshot in time 
rather than an ongoing audit log.  And it seems like a log would not be 
terribly useful without all the information.


George

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/e5bfead8-20cc-47b5-3f10-6a568aa04ba9%40comcast.net.

[racket-users] Best data structure for ordered data set with insertion and reordering?

2020-07-16 Thread David Storrs

tl;dr
Can anyone recommend a data structure that is ordered and supports
efficient reordering, insertion at arbitrary location, and deletion?

Long form:

I'm working on an operation-log reconciliation problem, where each
operation is one of:

  File-CreateP H
  File-Update   P H
  File-DeleteP H
  Folder-Create P
  Folder-Delete P

P = path
H = hash of the file (e.g. md5)

Operation messages travel over the network, meaning that they could arrive
out of order.  (They could also be missed entirely, but that's a separate
problem that I'm not working on yet.)

Specifically, I want to be able to take a series of messages and collapse
them where appropriate.  For example:

  File-Update P H1 -> H2
  File-Create P1 H1
Result after collapse:
  '(File-Create P1 H2)

  File-Create P H1
  File-Delete P H1
Result after collapse:
  '()

  File-Delete P X
  File-Create P X
Result after collapse:
  '()

  File-Delete P1 H1
  File-Create P2 H2
  File-Create P1 H1
Result after collapse:
  '(File-Create P2 H2)

I've been mulling over various ways to handle all of this and digging
around to find examples of other people doing it, but I'm wondering if
there's a data structure or existing algorithm that will handle it
cleanly.  Does anyone know of such a thing?

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/CAE8gKoe-GHZvVFu7vr%2B6%2BJ27TmF0LgzaxFpC1yzDRzaO6tqhTw%40mail.gmail.com.

Re: [racket-users] Best data structure for ordered data set with insertion and reordering?

Re: [racket-users] Best data structure for ordered data set with insertion and reordering?

Re: [racket-users] Best data structure for ordered data set with insertion and reordering?

Re: [racket-users] Best data structure for ordered data set with insertion and reordering?

[racket-users] Best data structure for ordered data set with insertion and reordering?

5 matches

Site Navigation

Mail list logo

Footer information