[Revctrl] Algebraic manipulation of patches vs. dependencies

Discussion:

Mark Seaborn

19 years ago

I've been trying out Darcs recently, finding it useful, and trying to
understand how it works.

What puzzles me is that Darcs does not store a dependency graph of all
the patches in a repository. Instead it keeps patches in a list, and
manipulates the list algebraically, reordering patches that commute.
When two patches modifying the same file are reordered, the line
numbers they refer to must be rewritten. In some cases, Darcs groups
together patches that can be applied in parallel.

Is there any reason why Darcs doesn't go a step further by storing a
full dependency tree? This would have the advantage of being a
canonical representation. Patches' line numbers would not need to be
rewritten according to where the patch appears in the sequence. The
line numbers of a patch could be given relative to the file contents
defined by all the patches it depends on.

Mark

Zooko O'Whielacronx

19 years ago

Permalink

...

Mark Seaborn:

I think this is an excellent question. The major jumping off point that
separates darcs from my own personal vaporous idle thought experiment of
a tool (which is named "zxc") is that in my tool dependencies are
explicitly calculated (greedily) and explicitly stored (persistently).

I do not know for sure, but I harbor the suspicion that the reason darcs
does otherwise is that Haskell programmers (like many smart programmers)
are enamoured of timeless, lazy, declarative, mathematical algorithms.
Therefore, they naturally think of writing a purely functional algorithm
which evaluates to the right result, and that they naturally don't think
of algorithms which are imperative and which explicitly and greedily
store intermediate results to disk.

Regards,

Zooko

David Roundy

19 years ago

Permalink

...

If we stored patches as a full dependency tree, then recording a new
change would be an O(N) process, where N is the number of patches in
the history. That's a scary thought, to me anyways. Similarly,
pulling a single change would also be an O(N) process in general,
since you'd have to do a merge in order to find out how to apply the
patch to the current file.

--
David Roundy

Zooko O'Whielacronx

19 years ago

Permalink

Post by David Roundy
If we stored patches as a full dependency tree, then recording a new
change would be an O(N) process, where N is the number of patches in
the history. That's a scary thought, to me anyways. Similarly,
pulling a single change would also be an O(N) process in general,
since you'd have to do a merge in order to find out how to apply the
patch to the current file.

I'm not sure this is correct, since the algorithm to determine
dependencies can use cached information and knowledge about the kind of
the patch. For example, suppose the patch being recorded is a "hunk"
patch which alters the contents of a file. Suppose there is a table
(already present when this algorithm begins) mapping each line of that
file to the patch which has most recently touched that line. Then the
cost of determining the dependencies of the patch is simply a table
lookup for each line of the hunk, which is less than O(N) (and also is
plenty fast for practical purposes).

Note that this is has some similarities with Codeville...

Regards,

Zooko

Mark Seaborn

19 years ago

Permalink

Post by Zooko O'Whielacronx

...

Right. You would cache annotated versions of the pristine copies of
the files. The annotated versions would contain information similar
to what is produced by "cvs/svn/darcs annotate": Ranges of lines would
be associated with the patch that inserted those lines. You would
also need to record deletions: Indexes between line ranges would be
marked with patches that had removed lines from that point to produce
this version of the file.

Here's what I have in mind:

The contents of a file would be specified by a set of non-conflicting
patches.

Suppose for example the first patch to insert into a file is p1, and it
produces the following annotated file:

File {p1}:
lines from p1:
0 >int f()
1 >{
2 > foo();
3 > bar();
4 > return 0;
5 >}

With {p1} as the starting point, you record a patch p2, defined as:

Patch p2 is defined as: from {p1} remove range 3..4

which gives the annotated file:

File {p1,p2}:
lines from p1:
0 >int f()
1 >{
2 > foo();
lines were deleted by p2 at this point
lines from p1:
3 > return 0;
4 >}

If you then replace lines 2 and 3, the resulting patch will depend on

Post by Zooko O'Whielacronx
return 1;

giving the annotated file:

File {p1,p2,p3}:
lines from p1:
0 >int f()
1 >{
lines from p3:
2 > return 1;
lines from p1:
3 >}

If you choose to edit the first line of {p1,p2,p3}, the resulting

Post by Zooko O'Whielacronx
int function()

{p1,p4} specifies a valid file, and so does {p1,p2,p3,p4}.

When representing {p1,p2,p3,p4}, an implementation only needs to store
{p3,p4}, because p1 and p2 are implied by the definitions of p3 and
p4, which list their dependencies. The closure of the set under the
dependency relation gets back p1 and p2:
Closure{p3,p4} = {p1,p2,p3,p4}

Is this anything like what you had in mind, Zooko?

Cheers,
Mark