[Linuxptp-devel] [PATCH] ptp4l: flush old cached packets

Discussion:

Jiri Benc

2013-07-18 15:14:47 UTC

This patch fixes a bug with time mysteriously jumping back and forth:

ptp4l[930.687]: port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
ptp4l[931.687]: master offset 17 s2 freq +33014 path delay 2728
ptp4l[932.687]: master offset -74 s2 freq +32928 path delay 2734
ptp4l[933.687]: master offset 2 s2 freq +32982 path delay 2734
ptp4l[934.687]: master offset -3 s2 freq +32977 path delay 2728
ptp4l[935.687]: master offset 17 s2 freq +32996 path delay 2729
ptp4l[936.687]: master offset -10 s2 freq +32974 path delay 2729
ptp4l[937.687]: master offset 35 s2 freq +33016 path delay 2727
ptp4l[938.686]: master offset 60001851388 s2 freq +62499999 path delay 2728
ptp4l[939.687]: master offset -62464938 s2 freq -62431946 path delay 2728

The last follow up message arriving out of order is cached. Before the state
machine changes to UNCALIBRATED, all sync and follow up messages are discarded.
If we get into that state between a sync and follow up message, the latter is
cached. When there's no real roerdering happening, it's kept cached forever.

When we restart the master, it starts numbering the messages from zero again.
The initial synchronization doesn't take always the same amount of time, so it
can happen that we get into UNCALIBRATED a little bit faster than before,
managing to get the sync message with the sequenceId that we missed last time.
As it has the same sequenceId as the cached (old) follow up message, it's
incorrectly assumed those two belong together.

Flush the cache when changing to UNCALIBRATED. Also, do similar thing for other
cached packets.

Signed-off-by: Jiri Benc <***@redhat.com>
---
port.c | 28 +++++++++++++++++++++++++---
1 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/port.c b/port.c
index d29a70c..e097a81 100644
--- a/port.c
+++ b/port.c
@@ -1115,10 +1115,8 @@ static int port_is_enabled(struct port *p)
return 1;
}

-static void port_disable(struct port *p)
+static void flush_last_sync(struct port *p)
{
- int i;
-
if (p->last_follow_up) {
msg_put(p->last_follow_up);
p->last_follow_up = NULL;
@@ -1127,10 +1125,18 @@ static void port_disable(struct port *p)
msg_put(p->last_sync);
p->last_sync = NULL;
}
+}
+
+static void flush_delay_req(struct port *p)
+{
if (p->delay_req) {
msg_put(p->delay_req);
p->delay_req = NULL;
}
+}
+
+static void flush_peer_delay(struct port *p)
+{
if (p->peer_delay_req) {
msg_put(p->peer_delay_req);
p->peer_delay_req = NULL;
@@ -1143,6 +1149,15 @@ static void port_disable(struct port *p)
msg_put(p->peer_delay_fup);
p->peer_delay_fup = NULL;
}
+}
+
+static void port_disable(struct port *p)
+{
+ int i;
+
+ flush_last_sync(p);
+ flush_delay_req(p);
+ flush_peer_delay(p);

p->best = NULL;
free_foreign_masters(p);
@@ -1782,6 +1797,9 @@ static void port_e2e_transition(struct port *p, enum port_state next)
port_set_announce_tmo(p);
break;
case PS_UNCALIBRATED:
+ flush_last_sync(p);
+ flush_delay_req(p);
+ /* fall through */
case PS_SLAVE:
port_set_announce_tmo(p);
port_set_delay_tmo(p);
@@ -1819,6 +1837,9 @@ static void port_p2p_transition(struct port *p, enum port_state next)
port_set_announce_tmo(p);
break;
case PS_UNCALIBRATED:
+ flush_last_sync(p);
+ flush_peer_delay(p);
+ /* fall through */
case PS_SLAVE:
port_set_announce_tmo(p);
break;
@@ -1857,6 +1878,7 @@ int port_dispatch(struct port *p, enum fsm_event event, int mdiff)
port_show_transition(p, next, event);
p->state = next;
if (next == PS_LISTENING && p->delayMechanism == DM_P2P) {
+ flush_delay_req(p);
port_set_delay_tmo(p);
}
return 1;

--
1.7.6.5

Keller, Jacob E

2013-07-18 21:04:33 UTC

Permalink

Jiri,

-----Original Message-----
Sent: Thursday, July 18, 2013 8:15 AM
Subject: [Linuxptp-devel] [PATCH] ptp4l: flush old cached packets
ptp4l[930.687]: port 1: UNCALIBRATED to SLAVE on
MASTER_CLOCK_SELECTED
ptp4l[931.687]: master offset 17 s2 freq +33014 path delay 2728
ptp4l[932.687]: master offset -74 s2 freq +32928 path delay 2734
ptp4l[933.687]: master offset 2 s2 freq +32982 path delay 2734
ptp4l[934.687]: master offset -3 s2 freq +32977 path delay 2728
ptp4l[935.687]: master offset 17 s2 freq +32996 path delay 2729
ptp4l[936.687]: master offset -10 s2 freq +32974 path delay 2729
ptp4l[937.687]: master offset 35 s2 freq +33016 path delay 2727
ptp4l[938.686]: master offset 60001851388 s2 freq +62499999 path delay 2728
ptp4l[939.687]: master offset -62464938 s2 freq -62431946 path delay
2728
The last follow up message arriving out of order is cached. Before the state
machine changes to UNCALIBRATED, all sync and follow up messages are discarded.
If we get into that state between a sync and follow up message, the latter is
cached. When there's no real roerdering happening, it's kept cached forever.
When we restart the master, it starts numbering the messages from zero again.
The initial synchronization doesn't take always the same amount of time, so it
can happen that we get into UNCALIBRATED a little bit faster than before,
managing to get the sync message with the sequenceId that we missed last time.
As it has the same sequenceId as the cached (old) follow up message, it's
incorrectly assumed those two belong together.
Flush the cache when changing to UNCALIBRATED. Also, do similar thing for other
cached packets.

Nice catch!

---
port.c | 28 +++++++++++++++++++++++++---
1 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/port.c b/port.c
index d29a70c..e097a81 100644
--- a/port.c
+++ b/port.c
@@ -1115,10 +1115,8 @@ static int port_is_enabled(struct port *p)
return 1;
}
-static void port_disable(struct port *p)
+static void flush_last_sync(struct port *p)
{
- int i;
-

Boy this diff is ugly.... I see though you are blocking out parts of port_disable into functions.

if (p->last_follow_up) {
msg_put(p->last_follow_up);
p->last_follow_up = NULL;
@@ -1127,10 +1125,18 @@ static void port_disable(struct port *p)
msg_put(p->last_sync);
p->last_sync = NULL;
}
+}
+
+static void flush_delay_req(struct port *p)
+{
if (p->delay_req) {
msg_put(p->delay_req);
p->delay_req = NULL;
}
+}
+
+static void flush_peer_delay(struct port *p)
+{
if (p->peer_delay_req) {
msg_put(p->peer_delay_req);
p->peer_delay_req = NULL;
@@ -1143,6 +1149,15 @@ static void port_disable(struct port *p)
msg_put(p->peer_delay_fup);
p->peer_delay_fup = NULL;
}
+}
+
+static void port_disable(struct port *p)
+{
+ int i;
+
+ flush_last_sync(p);
+ flush_delay_req(p);
+ flush_peer_delay(p);
p->best = NULL;
free_foreign_masters(p);
@@ -1782,6 +1797,9 @@ static void port_e2e_transition(struct port *p, enum port_state next)
port_set_announce_tmo(p);
break;
+ flush_last_sync(p);
+ flush_delay_req(p);
+ /* fall through */
port_set_announce_tmo(p);
port_set_delay_tmo(p);
@@ -1819,6 +1837,9 @@ static void port_p2p_transition(struct port *p, enum port_state next)
port_set_announce_tmo(p);
break;
+ flush_last_sync(p);
+ flush_peer_delay(p);
+ /* fall through */
port_set_announce_tmo(p);
break;
@@ -1857,6 +1878,7 @@ int port_dispatch(struct port *p, enum
fsm_event event, int mdiff)
port_show_transition(p, next, event);
p->state = next;
if (next == PS_LISTENING && p->delayMechanism ==
DM_P2P) {
+ flush_delay_req(p);
port_set_delay_tmo(p);
}
return 1;
--
1.7.6.5

Looks good to me, though it was a little hard to process that diff.

- Jake

Richard Cochran

2013-07-19 03:42:18 UTC

Permalink

Post by Keller, Jacob E
Nice catch!

Yep.

Post by Keller, Jacob E
Boy this diff is ugly.... I see though you are blocking out parts of port_disable into functions.

But its not Jiri's fault. Some unified diffs just turn out that way.

Thanks,
Richard

Keller, Jacob E

2013-07-19 20:48:49 UTC

Permalink

-----Original Message-----
Sent: Thursday, July 18, 2013 8:42 PM
To: Keller, Jacob E
Subject: Re: [Linuxptp-devel] [PATCH] ptp4l: flush old cached packets

Post by Keller, Jacob E
Nice catch!

Yep.

Post by Keller, Jacob E
Boy this diff is ugly.... I see though you are blocking out parts of

port_disable into functions.
But its not Jiri's fault. Some unified diffs just turn out that way.

This is correct. I was just processing as I typed the email.

- Jake

Thanks,
Richard

Richard Cochran

2013-07-19 03:31:10 UTC

Permalink

Post by Jiri Benc
When we restart the master, it starts numbering the messages from zero again.
The initial synchronization doesn't take always the same amount of time, so it
can happen that we get into UNCALIBRATED a little bit faster than before,
managing to get the sync message with the sequenceId that we missed last time.
As it has the same sequenceId as the cached (old) follow up message, it's
incorrectly assumed those two belong together.

Out of curiosity, did you find this by analysis (just looking at the
code), or did the problem appear through testing?

Thanks,
Richard

Jiri Benc

2013-07-19 08:40:26 UTC

Permalink

Post by Richard Cochran
Out of curiosity, did you find this by analysis (just looking at the
code), or did the problem appear through testing?

Hit that while testing. Debugging it took me a while, it was not clear
what was going on. At first, it even looked like a driver bug.

Jiri

--
Jiri Benc

Richard Cochran

2013-07-19 05:04:22 UTC

Permalink

Post by Jiri Benc
@@ -1857,6 +1878,7 @@ int port_dispatch(struct port *p, enum fsm_event event, int mdiff)
port_show_transition(p, next, event);
p->state = next;
if (next == PS_LISTENING && p->delayMechanism == DM_P2P) {
+ flush_delay_req(p);
port_set_delay_tmo(p);
}
return 1;

Can you explain why this one is needed?

Just before this we have:

if (port_is_enabled(p)) {
port_disable(p);
}

Thanks,
Richard

Jiri Benc

2013-07-19 08:35:19 UTC

Permalink

Post by Richard Cochran

Can you explain why this one is needed?

It's not, you're right :-) I'll send v2.

Thanks,

Jiri

--
Jiri Benc