mbox series

[net-next,v2,0/4] Implement classifier-action terse dump mode

Message ID 20200515114014.3135-1-vladbu@mellanox.com
Headers show
Series Implement classifier-action terse dump mode | expand

Message

Vlad Buslov May 15, 2020, 11:40 a.m. UTC
Output rate of current upstream kernel TC filter dump implementation if
relatively low (~100k rules/sec depending on configuration). This
constraint impacts performance of software switch implementation that
rely on TC for their datapath implementation and periodically call TC
filter dump to update rules stats. Moreover, TC filter dump output a lot
of static data that don't change during the filter lifecycle (filter
key, specific action details, etc.) which constitutes significant
portion of payload on resulting netlink packets and increases amount of
syscalls necessary to dump all filters on particular Qdisc. In order to
significantly improve filter dump rate this patch sets implement new
mode of TC filter dump operation named "terse dump" mode. In this mode
only parameters necessary to identify the filter (handle, action cookie,
etc.) and data that can change during filter lifecycle (filter flags,
action stats, etc.) are preserved in dump output while everything else
is omitted.

Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only
available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires
individual classifier support (new tcf_proto_ops->terse_dump()
callback). Support for action terse dump is implemented in act API and
don't require changing individual action implementations.

The following table provides performance comparison between regular
filter dump and new terse dump mode for two classifier-action profiles:
one minimal config with L2 flower classifier and single gact action and
another heavier config with L2+5tuple flower classifier with
tunnel_key+mirred actions.

 Classifier-action type      |        dump |  terse dump | X improvement 
                             | (rules/sec) | (rules/sec) |               
-----------------------------+-------------+-------------+---------------
 L2 with gact                |       141.8 |       293.2 |          2.07 
 L2+5tuple tunnel_key+mirred |        76.4 |       198.8 |          2.60 

Benchmark details: to measure the rate tc filter dump and terse dump
commands are invoked on ingress Qdisc that have one million filters
configured using following commands.

> time sudo tc -s filter show dev ens1f0 ingress >/dev/null

> time sudo tc -s filter show terse dev ens1f0 ingress >/dev/null

Value in results table is calculated by dividing 1000000 total rules by
"real" time reported by time command.

Setup details: 2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 32GB memory

Vlad Buslov (4):
  net: sched: introduce terse dump flag
  net: sched: implement terse dump support in act
  net: sched: cls_flower: implement terse dump support
  selftests: implement flower classifier terse dump tests

 include/net/act_api.h                         |  2 +-
 include/net/pkt_cls.h                         |  1 +
 include/net/sch_generic.h                     |  4 ++
 include/uapi/linux/rtnetlink.h                |  6 ++
 net/sched/act_api.c                           | 30 +++++++--
 net/sched/cls_api.c                           | 67 ++++++++++++++++---
 net/sched/cls_flower.c                        | 43 ++++++++++++
 .../tc-testing/tc-tests/filters/tests.json    | 38 +++++++++++
 8 files changed, 174 insertions(+), 17 deletions(-)

Comments

Jakub Kicinski May 15, 2020, 5:04 p.m. UTC | #1
On Fri, 15 May 2020 14:40:10 +0300 Vlad Buslov wrote:
> Output rate of current upstream kernel TC filter dump implementation if
> relatively low (~100k rules/sec depending on configuration). This
> constraint impacts performance of software switch implementation that
> rely on TC for their datapath implementation and periodically call TC
> filter dump to update rules stats. Moreover, TC filter dump output a lot
> of static data that don't change during the filter lifecycle (filter
> key, specific action details, etc.) which constitutes significant
> portion of payload on resulting netlink packets and increases amount of
> syscalls necessary to dump all filters on particular Qdisc. In order to
> significantly improve filter dump rate this patch sets implement new
> mode of TC filter dump operation named "terse dump" mode. In this mode
> only parameters necessary to identify the filter (handle, action cookie,
> etc.) and data that can change during filter lifecycle (filter flags,
> action stats, etc.) are preserved in dump output while everything else
> is omitted.

Please keep the review tags you already got when making minor changes.

Reviewed-by: Jakub Kicinski <kuba@kernel.org>
David Miller May 15, 2020, 5:25 p.m. UTC | #2
From: Vlad Buslov <vladbu@mellanox.com>
Date: Fri, 15 May 2020 14:40:10 +0300

> Output rate of current upstream kernel TC filter dump implementation if
> relatively low (~100k rules/sec depending on configuration). This
> constraint impacts performance of software switch implementation that
> rely on TC for their datapath implementation and periodically call TC
> filter dump to update rules stats. Moreover, TC filter dump output a lot
> of static data that don't change during the filter lifecycle (filter
> key, specific action details, etc.) which constitutes significant
> portion of payload on resulting netlink packets and increases amount of
> syscalls necessary to dump all filters on particular Qdisc. In order to
> significantly improve filter dump rate this patch sets implement new
> mode of TC filter dump operation named "terse dump" mode. In this mode
> only parameters necessary to identify the filter (handle, action cookie,
> etc.) and data that can change during filter lifecycle (filter flags,
> action stats, etc.) are preserved in dump output while everything else
> is omitted.
> 
> Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only
> available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires
> individual classifier support (new tcf_proto_ops->terse_dump()
> callback). Support for action terse dump is implemented in act API and
> don't require changing individual action implementations.
 ...

This looks fine, so series applied.

But really if people just want an efficient stats dump there is probably
a better way to efficiently encode just the IDs and STATs.  Maybe even
put the stats in pages that userland can mmap() and avoid all of this
system call overhead and locking altogether.
Cong Wang May 17, 2020, 7:13 p.m. UTC | #3
On Fri, May 15, 2020 at 4:40 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
> Output rate of current upstream kernel TC filter dump implementation if
> relatively low (~100k rules/sec depending on configuration). This
> constraint impacts performance of software switch implementation that
> rely on TC for their datapath implementation and periodically call TC
> filter dump to update rules stats. Moreover, TC filter dump output a lot
> of static data that don't change during the filter lifecycle (filter
> key, specific action details, etc.) which constitutes significant
> portion of payload on resulting netlink packets and increases amount of
> syscalls necessary to dump all filters on particular Qdisc. In order to
> significantly improve filter dump rate this patch sets implement new
> mode of TC filter dump operation named "terse dump" mode. In this mode
> only parameters necessary to identify the filter (handle, action cookie,
> etc.) and data that can change during filter lifecycle (filter flags,
> action stats, etc.) are preserved in dump output while everything else
> is omitted.
>
> Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only
> available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires
> individual classifier support (new tcf_proto_ops->terse_dump()
> callback). Support for action terse dump is implemented in act API and
> don't require changing individual action implementations.

Sorry for being late.

Why terse dump needs a new ops if it only dumps a subset of the
regular dump? That is, why not just pass a boolean flag to regular
->dump() implementation?

I guess that might break user-space ABI? At least some netlink
attributes are not always dumped anyway, so it does not look like
a problem?

Thanks.
Vlad Buslov May 18, 2020, 6:44 a.m. UTC | #4
On Sun 17 May 2020 at 22:13, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Fri, May 15, 2020 at 4:40 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>> Output rate of current upstream kernel TC filter dump implementation if
>> relatively low (~100k rules/sec depending on configuration). This
>> constraint impacts performance of software switch implementation that
>> rely on TC for their datapath implementation and periodically call TC
>> filter dump to update rules stats. Moreover, TC filter dump output a lot
>> of static data that don't change during the filter lifecycle (filter
>> key, specific action details, etc.) which constitutes significant
>> portion of payload on resulting netlink packets and increases amount of
>> syscalls necessary to dump all filters on particular Qdisc. In order to
>> significantly improve filter dump rate this patch sets implement new
>> mode of TC filter dump operation named "terse dump" mode. In this mode
>> only parameters necessary to identify the filter (handle, action cookie,
>> etc.) and data that can change during filter lifecycle (filter flags,
>> action stats, etc.) are preserved in dump output while everything else
>> is omitted.
>>
>> Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only
>> available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires
>> individual classifier support (new tcf_proto_ops->terse_dump()
>> callback). Support for action terse dump is implemented in act API and
>> don't require changing individual action implementations.
>
> Sorry for being late.
>
> Why terse dump needs a new ops if it only dumps a subset of the
> regular dump? That is, why not just pass a boolean flag to regular
> ->dump() implementation?
>
> I guess that might break user-space ABI? At least some netlink
> attributes are not always dumped anyway, so it does not look like
> a problem?
>
> Thanks.

Hi Cong,

I considered adding a flag to ->dump() callback but decided against it
for following reasons:

- It complicates fl_dump() code by adding additional conditionals. Not a
  big problem but it seemed better for me to have a standalone callback
  because with combined implementation it is even hard to deduce what
  does terse dump actually output.

- My initial implementation just called regular dump for classifiers
  that don't support terse dump, but in internal review Jiri insisted
  that cls API should fail if it can't satisfy user's request and having
  dedicated callback allows implementation to return an error if
  classifier doesn't define ->terse_dump(). With flag approach it would
  be not trivial to determine if implementation actually uses the flag.
  I guess I could have added new tcf_proto_ops->flags value to designate
  terse dump support, but checking for dedicated callback existence
  seemed like obvious approach.

Regards,
Vlad
Vlad Buslov May 18, 2020, 6:46 a.m. UTC | #5
On Fri 15 May 2020 at 20:04, Jakub Kicinski <kuba@kernel.org> wrote:
> On Fri, 15 May 2020 14:40:10 +0300 Vlad Buslov wrote:
>> Output rate of current upstream kernel TC filter dump implementation if
>> relatively low (~100k rules/sec depending on configuration). This
>> constraint impacts performance of software switch implementation that
>> rely on TC for their datapath implementation and periodically call TC
>> filter dump to update rules stats. Moreover, TC filter dump output a lot
>> of static data that don't change during the filter lifecycle (filter
>> key, specific action details, etc.) which constitutes significant
>> portion of payload on resulting netlink packets and increases amount of
>> syscalls necessary to dump all filters on particular Qdisc. In order to
>> significantly improve filter dump rate this patch sets implement new
>> mode of TC filter dump operation named "terse dump" mode. In this mode
>> only parameters necessary to identify the filter (handle, action cookie,
>> etc.) and data that can change during filter lifecycle (filter flags,
>> action stats, etc.) are preserved in dump output while everything else
>> is omitted.
>
> Please keep the review tags you already got when making minor changes.

Thanks. I generally hesitant to retain other people's tags when making
any changes to the code. I'll retain your tags when making trivial
changes from now on.

>
> Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Vlad Buslov May 18, 2020, 6:50 a.m. UTC | #6
On Fri 15 May 2020 at 20:25, David Miller <davem@davemloft.net> wrote:
> From: Vlad Buslov <vladbu@mellanox.com>
> Date: Fri, 15 May 2020 14:40:10 +0300
>
>> Output rate of current upstream kernel TC filter dump implementation if
>> relatively low (~100k rules/sec depending on configuration). This
>> constraint impacts performance of software switch implementation that
>> rely on TC for their datapath implementation and periodically call TC
>> filter dump to update rules stats. Moreover, TC filter dump output a lot
>> of static data that don't change during the filter lifecycle (filter
>> key, specific action details, etc.) which constitutes significant
>> portion of payload on resulting netlink packets and increases amount of
>> syscalls necessary to dump all filters on particular Qdisc. In order to
>> significantly improve filter dump rate this patch sets implement new
>> mode of TC filter dump operation named "terse dump" mode. In this mode
>> only parameters necessary to identify the filter (handle, action cookie,
>> etc.) and data that can change during filter lifecycle (filter flags,
>> action stats, etc.) are preserved in dump output while everything else
>> is omitted.
>>
>> Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only
>> available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires
>> individual classifier support (new tcf_proto_ops->terse_dump()
>> callback). Support for action terse dump is implemented in act API and
>> don't require changing individual action implementations.
>  ...
>
> This looks fine, so series applied.
>
> But really if people just want an efficient stats dump there is probably
> a better way to efficiently encode just the IDs and STATs.  Maybe even
> put the stats in pages that userland can mmap() and avoid all of this
> system call overhead and locking altogether.

Thanks! Adding such API will be my next step, if terse dump performance
proves insufficient.
Edward Cree May 18, 2020, 3:37 p.m. UTC | #7
On 15/05/2020 12:40, Vlad Buslov wrote:
> In order to
> significantly improve filter dump rate this patch sets implement new
> mode of TC filter dump operation named "terse dump" mode. In this mode
> only parameters necessary to identify the filter (handle, action cookie,
> etc.) and data that can change during filter lifecycle (filter flags,
> action stats, etc.) are preserved in dump output while everything else
> is omitted.
I realise I'm a bit late, but isn't this the kind of policy that shouldn't
 be hard-coded in the kernel?  I.e. if next year it turns out that some
 user needs one parameter that's been omitted here, but not the whole dump,
 are they going to want to add another mode to the uapi?
Should this not instead have been done as a set of flags to specify which
 pieces of information the caller wanted in the dump, rather than a mode
 flag selecting a pre-defined set?

-ed
Cong Wang May 18, 2020, 6:46 p.m. UTC | #8
On Sun, May 17, 2020 at 11:44 PM Vlad Buslov <vladbu@mellanox.com> wrote:
>
>
> On Sun 17 May 2020 at 22:13, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Fri, May 15, 2020 at 4:40 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> >>
> >> Output rate of current upstream kernel TC filter dump implementation if
> >> relatively low (~100k rules/sec depending on configuration). This
> >> constraint impacts performance of software switch implementation that
> >> rely on TC for their datapath implementation and periodically call TC
> >> filter dump to update rules stats. Moreover, TC filter dump output a lot
> >> of static data that don't change during the filter lifecycle (filter
> >> key, specific action details, etc.) which constitutes significant
> >> portion of payload on resulting netlink packets and increases amount of
> >> syscalls necessary to dump all filters on particular Qdisc. In order to
> >> significantly improve filter dump rate this patch sets implement new
> >> mode of TC filter dump operation named "terse dump" mode. In this mode
> >> only parameters necessary to identify the filter (handle, action cookie,
> >> etc.) and data that can change during filter lifecycle (filter flags,
> >> action stats, etc.) are preserved in dump output while everything else
> >> is omitted.
> >>
> >> Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only
> >> available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires
> >> individual classifier support (new tcf_proto_ops->terse_dump()
> >> callback). Support for action terse dump is implemented in act API and
> >> don't require changing individual action implementations.
> >
> > Sorry for being late.
> >
> > Why terse dump needs a new ops if it only dumps a subset of the
> > regular dump? That is, why not just pass a boolean flag to regular
> > ->dump() implementation?
> >
> > I guess that might break user-space ABI? At least some netlink
> > attributes are not always dumped anyway, so it does not look like
> > a problem?
> >
> > Thanks.
>
> Hi Cong,
>
> I considered adding a flag to ->dump() callback but decided against it
> for following reasons:
>
> - It complicates fl_dump() code by adding additional conditionals. Not a
>   big problem but it seemed better for me to have a standalone callback
>   because with combined implementation it is even hard to deduce what
>   does terse dump actually output.

This is not a problem, at least you can add a big if in fl_dump(),
something like:

if (terse) {
  // do terse dump
  return 0;
}
// normal dump

>
> - My initial implementation just called regular dump for classifiers
>   that don't support terse dump, but in internal review Jiri insisted
>   that cls API should fail if it can't satisfy user's request and having
>   dedicated callback allows implementation to return an error if
>   classifier doesn't define ->terse_dump(). With flag approach it would
>   be not trivial to determine if implementation actually uses the flag.

Hmm? For those not support terse dump, we can just do:

if (terse)
  return -EOPNOTSUPP;
// normal dump goes here

You just have to pass 'terse' flag to all implementations and let them
to decide whether to support it or not.


>   I guess I could have added new tcf_proto_ops->flags value to designate
>   terse dump support, but checking for dedicated callback existence
>   seemed like obvious approach.

This does not look necessary, as long as we can just pass the flag
down to each ->dump().

Thanks.
Cong Wang May 18, 2020, 6:50 p.m. UTC | #9
On Mon, May 18, 2020 at 8:38 AM Edward Cree <ecree@solarflare.com> wrote:
>
> On 15/05/2020 12:40, Vlad Buslov wrote:
> > In order to
> > significantly improve filter dump rate this patch sets implement new
> > mode of TC filter dump operation named "terse dump" mode. In this mode
> > only parameters necessary to identify the filter (handle, action cookie,
> > etc.) and data that can change during filter lifecycle (filter flags,
> > action stats, etc.) are preserved in dump output while everything else
> > is omitted.
> I realise I'm a bit late, but isn't this the kind of policy that shouldn't
>  be hard-coded in the kernel?  I.e. if next year it turns out that some
>  user needs one parameter that's been omitted here, but not the whole dump,
>  are they going to want to add another mode to the uapi?
> Should this not instead have been done as a set of flags to specify which
>  pieces of information the caller wanted in the dump, rather than a mode
>  flag selecting a pre-defined set?

Excellent point!

I agree, this is more elegant, although potentially needs more work.

I am not sure whether we can simply pass those flags to cb->args[],
if not, that will need more work at netlink layer.

Thanks.
Vlad Buslov May 19, 2020, 9:04 a.m. UTC | #10
On Mon 18 May 2020 at 18:37, Edward Cree <ecree@solarflare.com> wrote:
> On 15/05/2020 12:40, Vlad Buslov wrote:
>> In order to
>> significantly improve filter dump rate this patch sets implement new
>> mode of TC filter dump operation named "terse dump" mode. In this mode
>> only parameters necessary to identify the filter (handle, action cookie,
>> etc.) and data that can change during filter lifecycle (filter flags,
>> action stats, etc.) are preserved in dump output while everything else
>> is omitted.
> I realise I'm a bit late, but isn't this the kind of policy that shouldn't
>  be hard-coded in the kernel?  I.e. if next year it turns out that some
>  user needs one parameter that's been omitted here, but not the whole dump,
>  are they going to want to add another mode to the uapi?

Why not just extend terse dump? I won't break user land unless you are
removing something from it.

> Should this not instead have been done as a set of flags to specify which
>  pieces of information the caller wanted in the dump, rather than a mode
>  flag selecting a pre-defined set?
>
> -ed

I considered that approach initially but decided against it for
following reasons:

- Generic data is covered by current terse dump implementation.
  Everything else will be act or cls specific which would result long
  list of flag values like: TCA_DUMP_FLOWER_KEY_ETH_DST,
  TCA_DUMP_FLOWER_KEY_ETH_DST, TCA_DUMP_FLOWER_KEY_VLAN_ID, ...,
  TCA_DUMP_TUNNEL_KEY_ENC_KEY_ID, TCA_DUMP_TUNNEL_KEY_ENC_TOS. All of
  these would require a lot of dedicated logic in act and cls dump
  callbacks. Also, it would be quite a challenge to test all possible
  combinations.

- It is hard to come up with proper validation for such implementation.
  In case of terse dump I just return an error if classifier doesn't
  implement the callback (and since current implementation only outputs
  generic action info, it doesn't even require support from
  action-specific dump callbacks). But, for example, how do we validate
  a case where user sets some flower and tunnel_key act dump flags from
  previous paragraph, but Qdisc contains some other classifier? Or
  flower classifier points to other types of actions? Or when flower
  classifier has and tunnel_key actions but also mirred? Should the
  implementation return an error on encountering any classifier or
  action that doesn't have any flags set for its type or just print all
  data like regular dump? What if user asks to dump some specific option
  that wasn't set for particular filter of action instance?

Overall, the more I think about such implementation the more it looks
like a mess to me.
Vlad Buslov May 19, 2020, 9:10 a.m. UTC | #11
On Mon 18 May 2020 at 21:46, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Sun, May 17, 2020 at 11:44 PM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>>
>> On Sun 17 May 2020 at 22:13, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > On Fri, May 15, 2020 at 4:40 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>> >>
>> >> Output rate of current upstream kernel TC filter dump implementation if
>> >> relatively low (~100k rules/sec depending on configuration). This
>> >> constraint impacts performance of software switch implementation that
>> >> rely on TC for their datapath implementation and periodically call TC
>> >> filter dump to update rules stats. Moreover, TC filter dump output a lot
>> >> of static data that don't change during the filter lifecycle (filter
>> >> key, specific action details, etc.) which constitutes significant
>> >> portion of payload on resulting netlink packets and increases amount of
>> >> syscalls necessary to dump all filters on particular Qdisc. In order to
>> >> significantly improve filter dump rate this patch sets implement new
>> >> mode of TC filter dump operation named "terse dump" mode. In this mode
>> >> only parameters necessary to identify the filter (handle, action cookie,
>> >> etc.) and data that can change during filter lifecycle (filter flags,
>> >> action stats, etc.) are preserved in dump output while everything else
>> >> is omitted.
>> >>
>> >> Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only
>> >> available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires
>> >> individual classifier support (new tcf_proto_ops->terse_dump()
>> >> callback). Support for action terse dump is implemented in act API and
>> >> don't require changing individual action implementations.
>> >
>> > Sorry for being late.
>> >
>> > Why terse dump needs a new ops if it only dumps a subset of the
>> > regular dump? That is, why not just pass a boolean flag to regular
>> > ->dump() implementation?
>> >
>> > I guess that might break user-space ABI? At least some netlink
>> > attributes are not always dumped anyway, so it does not look like
>> > a problem?
>> >
>> > Thanks.
>>
>> Hi Cong,
>>
>> I considered adding a flag to ->dump() callback but decided against it
>> for following reasons:
>>
>> - It complicates fl_dump() code by adding additional conditionals. Not a
>>   big problem but it seemed better for me to have a standalone callback
>>   because with combined implementation it is even hard to deduce what
>>   does terse dump actually output.
>
> This is not a problem, at least you can add a big if in fl_dump(),
> something like:
>
> if (terse) {
>   // do terse dump
>   return 0;
> }
> // normal dump

That is what I was trying to prevent with my implementation: having big
"superfunctions" that implement multiple things with branching. Why not
just have dedicated callbacks that do exactly one thing?

>
>>
>> - My initial implementation just called regular dump for classifiers
>>   that don't support terse dump, but in internal review Jiri insisted
>>   that cls API should fail if it can't satisfy user's request and having
>>   dedicated callback allows implementation to return an error if
>>   classifier doesn't define ->terse_dump(). With flag approach it would
>>   be not trivial to determine if implementation actually uses the flag.
>
> Hmm? For those not support terse dump, we can just do:
>
> if (terse)
>   return -EOPNOTSUPP;
> // normal dump goes here
>
> You just have to pass 'terse' flag to all implementations and let them
> to decide whether to support it or not.

But why duplicate the same code to all existing cls dump implementations
instead of having such check nicely implemented in cls API (via callback
existence or a flag)?

>
>
>>   I guess I could have added new tcf_proto_ops->flags value to designate
>>   terse dump support, but checking for dedicated callback existence
>>   seemed like obvious approach.
>
> This does not look necessary, as long as we can just pass the flag
> down to each ->dump().
>
> Thanks.
Edward Cree May 19, 2020, 2:30 p.m. UTC | #12
On 19/05/2020 10:04, Vlad Buslov wrote:
> On Mon 18 May 2020 at 18:37, Edward Cree <ecree@solarflare.com> wrote:
>> I.e. if next year it turns out that some
>>  user needs one parameter that's been omitted here, but not the whole dump,
>>  are they going to want to add another mode to the uapi?
> Why not just extend terse dump? I won't break user land unless you are
> removing something from it.
But then all terse dump users pay the performance cost for thatone
 app's extra need.

> - Generic data is covered by current terse dump implementation.
>   Everything else will be act or cls specific
Fair point.
I don't suppose something something BPF mumble solve this? I haven't
 been following the BPF dumping work in detail but it sounds like it
 might be a cheap way to get the 'more performant next step' that
 was mentioned in the subthread with David.  Just a thought.

-ed
Vlad Buslov May 19, 2020, 3:17 p.m. UTC | #13
On Tue 19 May 2020 at 17:30, Edward Cree <ecree@solarflare.com> wrote:
> On 19/05/2020 10:04, Vlad Buslov wrote:
>> On Mon 18 May 2020 at 18:37, Edward Cree <ecree@solarflare.com> wrote:
>>> I.e. if next year it turns out that some
>>>  user needs one parameter that's been omitted here, but not the whole dump,
>>>  are they going to want to add another mode to the uapi?
>> Why not just extend terse dump? I won't break user land unless you are
>> removing something from it.
> But then all terse dump users pay the performance cost for thatone
>  app's extra need.

Yes. However reducing amount of data per filter is only part of
optimization. Skipping fl_dump_key() in flower and completely omitting
calling any of the tc_action_ops callbacks is also a major improvement.
So as long as outputting new parameter doesn't require calling one of
those (and it shouldn't, otherwise the parameter represent something
very cls or action type specific) it shouldn't impact performance
significantly.

>
>> - Generic data is covered by current terse dump implementation.
>>   Everything else will be act or cls specific
> Fair point.
> I don't suppose something something BPF mumble solve this? I haven't
>  been following the BPF dumping work in detail but it sounds like it
>  might be a cheap way to get the 'more performant next step' that
>  was mentioned in the subthread with David.  Just a thought.

I'm very interested in ideas how to use BPF to improve dump performance
so any comments/suggestions from resident BPF experts are welcome. Don't
think it will as fast as what David suggested though.
Cong Wang May 19, 2020, 6:39 p.m. UTC | #14
On Tue, May 19, 2020 at 2:10 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
>
> On Mon 18 May 2020 at 21:46, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Sun, May 17, 2020 at 11:44 PM Vlad Buslov <vladbu@mellanox.com> wrote:
> >>
> >>
> >> On Sun 17 May 2020 at 22:13, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >> > On Fri, May 15, 2020 at 4:40 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> >> >>
> >> >> Output rate of current upstream kernel TC filter dump implementation if
> >> >> relatively low (~100k rules/sec depending on configuration). This
> >> >> constraint impacts performance of software switch implementation that
> >> >> rely on TC for their datapath implementation and periodically call TC
> >> >> filter dump to update rules stats. Moreover, TC filter dump output a lot
> >> >> of static data that don't change during the filter lifecycle (filter
> >> >> key, specific action details, etc.) which constitutes significant
> >> >> portion of payload on resulting netlink packets and increases amount of
> >> >> syscalls necessary to dump all filters on particular Qdisc. In order to
> >> >> significantly improve filter dump rate this patch sets implement new
> >> >> mode of TC filter dump operation named "terse dump" mode. In this mode
> >> >> only parameters necessary to identify the filter (handle, action cookie,
> >> >> etc.) and data that can change during filter lifecycle (filter flags,
> >> >> action stats, etc.) are preserved in dump output while everything else
> >> >> is omitted.
> >> >>
> >> >> Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only
> >> >> available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires
> >> >> individual classifier support (new tcf_proto_ops->terse_dump()
> >> >> callback). Support for action terse dump is implemented in act API and
> >> >> don't require changing individual action implementations.
> >> >
> >> > Sorry for being late.
> >> >
> >> > Why terse dump needs a new ops if it only dumps a subset of the
> >> > regular dump? That is, why not just pass a boolean flag to regular
> >> > ->dump() implementation?
> >> >
> >> > I guess that might break user-space ABI? At least some netlink
> >> > attributes are not always dumped anyway, so it does not look like
> >> > a problem?
> >> >
> >> > Thanks.
> >>
> >> Hi Cong,
> >>
> >> I considered adding a flag to ->dump() callback but decided against it
> >> for following reasons:
> >>
> >> - It complicates fl_dump() code by adding additional conditionals. Not a
> >>   big problem but it seemed better for me to have a standalone callback
> >>   because with combined implementation it is even hard to deduce what
> >>   does terse dump actually output.
> >
> > This is not a problem, at least you can add a big if in fl_dump(),
> > something like:
> >
> > if (terse) {
> >   // do terse dump
> >   return 0;
> > }
> > // normal dump
>
> That is what I was trying to prevent with my implementation: having big
> "superfunctions" that implement multiple things with branching. Why not
> just have dedicated callbacks that do exactly one thing?

1. Saving one unnecessary ops.
2. Easier to trace the code: all dumpings are in one implementation.

>
> >
> >>
> >> - My initial implementation just called regular dump for classifiers
> >>   that don't support terse dump, but in internal review Jiri insisted
> >>   that cls API should fail if it can't satisfy user's request and having
> >>   dedicated callback allows implementation to return an error if
> >>   classifier doesn't define ->terse_dump(). With flag approach it would
> >>   be not trivial to determine if implementation actually uses the flag.
> >
> > Hmm? For those not support terse dump, we can just do:
> >
> > if (terse)
> >   return -EOPNOTSUPP;
> > // normal dump goes here
> >
> > You just have to pass 'terse' flag to all implementations and let them
> > to decide whether to support it or not.
>
> But why duplicate the same code to all existing cls dump implementations
> instead of having such check nicely implemented in cls API (via callback
> existence or a flag)?

I am confused, your fl_terse_dump() also has duplication with fl_dump()...

Thanks.
Cong Wang May 19, 2020, 6:58 p.m. UTC | #15
On Tue, May 19, 2020 at 2:04 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> I considered that approach initially but decided against it for
> following reasons:
>
> - Generic data is covered by current terse dump implementation.
>   Everything else will be act or cls specific which would result long
>   list of flag values like: TCA_DUMP_FLOWER_KEY_ETH_DST,
>   TCA_DUMP_FLOWER_KEY_ETH_DST, TCA_DUMP_FLOWER_KEY_VLAN_ID, ...,
>   TCA_DUMP_TUNNEL_KEY_ENC_KEY_ID, TCA_DUMP_TUNNEL_KEY_ENC_TOS. All of
>   these would require a lot of dedicated logic in act and cls dump
>   callbacks. Also, it would be quite a challenge to test all possible
>   combinations.

Well, if you consider netlink dump as a database query, what Edward
proposed is merely "select COLUMN1 COLUMN2 from cls_db" rather
than "select * from cls_db".

No one said it is easy to implement, it is just more elegant than you
select a hardcoded set of columns for the user.

Think about it, what if another user wants a less terse dump but still
not a full dump? Would you implement ops->terse_dump2()? Or
what if people still think your terse dump is not as terse as she wants?
ops->mini_dump()? How many ops's we would end having?


>
> - It is hard to come up with proper validation for such implementation.
>   In case of terse dump I just return an error if classifier doesn't
>   implement the callback (and since current implementation only outputs
>   generic action info, it doesn't even require support from
>   action-specific dump callbacks). But, for example, how do we validate
>   a case where user sets some flower and tunnel_key act dump flags from
>   previous paragraph, but Qdisc contains some other classifier? Or
>   flower classifier points to other types of actions? Or when flower
>   classifier has and tunnel_key actions but also mirred? Should the

Each action should be able to dump selectively too. If you think it
as a database, it is just a different table with different schemas.


>   implementation return an error on encountering any classifier or
>   action that doesn't have any flags set for its type or just print all
>   data like regular dump? What if user asks to dump some specific option
>   that wasn't set for particular filter of action instance?

Undefined or error.

>
> Overall, the more I think about such implementation the more it looks
> like a mess to me.

This is what I think about your current implementation. You know once
we add this we can't remove it any longer, right? This is why we should
make it right and better in the first place, not after carrying it for even one
release.

Thanks.
Vlad Buslov May 20, 2020, 7:24 a.m. UTC | #16
On Tue 19 May 2020 at 21:58, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Tue, May 19, 2020 at 2:04 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>> I considered that approach initially but decided against it for
>> following reasons:
>>
>> - Generic data is covered by current terse dump implementation.
>>   Everything else will be act or cls specific which would result long
>>   list of flag values like: TCA_DUMP_FLOWER_KEY_ETH_DST,
>>   TCA_DUMP_FLOWER_KEY_ETH_DST, TCA_DUMP_FLOWER_KEY_VLAN_ID, ...,
>>   TCA_DUMP_TUNNEL_KEY_ENC_KEY_ID, TCA_DUMP_TUNNEL_KEY_ENC_TOS. All of
>>   these would require a lot of dedicated logic in act and cls dump
>>   callbacks. Also, it would be quite a challenge to test all possible
>>   combinations.
>
> Well, if you consider netlink dump as a database query, what Edward
> proposed is merely "select COLUMN1 COLUMN2 from cls_db" rather
> than "select * from cls_db".
>
> No one said it is easy to implement, it is just more elegant than you
> select a hardcoded set of columns for the user.

As I explained to Edward, having denser netlink packets with more
filters per packet is only part of optimization. Another part is not
executing some code at all. Consider fl_dump_key() which is 200 lines
function with bunch of conditionals like that:

static int fl_dump_key(struct sk_buff *skb, struct net *net,
		       struct fl_flow_key *key, struct fl_flow_key *mask)
{
	if (mask->meta.ingress_ifindex) {
		struct net_device *dev;

		dev = __dev_get_by_index(net, key->meta.ingress_ifindex);
		if (dev && nla_put_string(skb, TCA_FLOWER_INDEV, dev->name))
			goto nla_put_failure;
	}

	if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
			    mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
			    sizeof(key->eth.dst)) ||
	    fl_dump_key_val(skb, key->eth.src, TCA_FLOWER_KEY_ETH_SRC,
			    mask->eth.src, TCA_FLOWER_KEY_ETH_SRC_MASK,
			    sizeof(key->eth.src)) ||
	    fl_dump_key_val(skb, &key->basic.n_proto, TCA_FLOWER_KEY_ETH_TYPE,
			    &mask->basic.n_proto, TCA_FLOWER_UNSPEC,
			    sizeof(key->basic.n_proto)))
		goto nla_put_failure;

	if (fl_dump_key_mpls(skb, &key->mpls, &mask->mpls))
		goto nla_put_failure;

	if (fl_dump_key_vlan(skb, TCA_FLOWER_KEY_VLAN_ID,
			     TCA_FLOWER_KEY_VLAN_PRIO, &key->vlan, &mask->vlan))
		goto nla_put_failure;
    ...


Now imagine all of these are extended with additional if (flags &
TCA_DUMP_XXX). All gains from not outputting some other minor stuff into
netlink packet will be negated by it.


>
> Think about it, what if another user wants a less terse dump but still
> not a full dump? Would you implement ops->terse_dump2()? Or
> what if people still think your terse dump is not as terse as she wants?
> ops->mini_dump()? How many ops's we would end having?

User can discard whatever he doesn't need in user land code. The goal of
this change is performance optimization, not designing a generic
kernel-space data filtering mechanism.

>
>
>>
>> - It is hard to come up with proper validation for such implementation.
>>   In case of terse dump I just return an error if classifier doesn't
>>   implement the callback (and since current implementation only outputs
>>   generic action info, it doesn't even require support from
>>   action-specific dump callbacks). But, for example, how do we validate
>>   a case where user sets some flower and tunnel_key act dump flags from
>>   previous paragraph, but Qdisc contains some other classifier? Or
>>   flower classifier points to other types of actions? Or when flower
>>   classifier has and tunnel_key actions but also mirred? Should the
>
> Each action should be able to dump selectively too. If you think it
> as a database, it is just a different table with different schemas.

How is designing custom SQL-like query language (according to your
example at the beginning of the mail) for filter dump is going to
improve performance? If there is a way to do it in fast a generic manner
with BPF, then I'm very interested to hear the details. But adding
hundred more hardcoded conditionals is just not a solution considering
main motivations for this change is performance.
Vlad Buslov May 20, 2020, 7:33 a.m. UTC | #17
On Tue 19 May 2020 at 21:39, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Tue, May 19, 2020 at 2:10 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>>
>> On Mon 18 May 2020 at 21:46, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > On Sun, May 17, 2020 at 11:44 PM Vlad Buslov <vladbu@mellanox.com> wrote:
>> >>
>> >>
>> >> On Sun 17 May 2020 at 22:13, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> >> > On Fri, May 15, 2020 at 4:40 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>> >> >>
>> >> >> Output rate of current upstream kernel TC filter dump implementation if
>> >> >> relatively low (~100k rules/sec depending on configuration). This
>> >> >> constraint impacts performance of software switch implementation that
>> >> >> rely on TC for their datapath implementation and periodically call TC
>> >> >> filter dump to update rules stats. Moreover, TC filter dump output a lot
>> >> >> of static data that don't change during the filter lifecycle (filter
>> >> >> key, specific action details, etc.) which constitutes significant
>> >> >> portion of payload on resulting netlink packets and increases amount of
>> >> >> syscalls necessary to dump all filters on particular Qdisc. In order to
>> >> >> significantly improve filter dump rate this patch sets implement new
>> >> >> mode of TC filter dump operation named "terse dump" mode. In this mode
>> >> >> only parameters necessary to identify the filter (handle, action cookie,
>> >> >> etc.) and data that can change during filter lifecycle (filter flags,
>> >> >> action stats, etc.) are preserved in dump output while everything else
>> >> >> is omitted.
>> >> >>
>> >> >> Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only
>> >> >> available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires
>> >> >> individual classifier support (new tcf_proto_ops->terse_dump()
>> >> >> callback). Support for action terse dump is implemented in act API and
>> >> >> don't require changing individual action implementations.
>> >> >
>> >> > Sorry for being late.
>> >> >
>> >> > Why terse dump needs a new ops if it only dumps a subset of the
>> >> > regular dump? That is, why not just pass a boolean flag to regular
>> >> > ->dump() implementation?
>> >> >
>> >> > I guess that might break user-space ABI? At least some netlink
>> >> > attributes are not always dumped anyway, so it does not look like
>> >> > a problem?
>> >> >
>> >> > Thanks.
>> >>
>> >> Hi Cong,
>> >>
>> >> I considered adding a flag to ->dump() callback but decided against it
>> >> for following reasons:
>> >>
>> >> - It complicates fl_dump() code by adding additional conditionals. Not a
>> >>   big problem but it seemed better for me to have a standalone callback
>> >>   because with combined implementation it is even hard to deduce what
>> >>   does terse dump actually output.
>> >
>> > This is not a problem, at least you can add a big if in fl_dump(),
>> > something like:
>> >
>> > if (terse) {
>> >   // do terse dump
>> >   return 0;
>> > }
>> > // normal dump
>>
>> That is what I was trying to prevent with my implementation: having big
>> "superfunctions" that implement multiple things with branching. Why not
>> just have dedicated callbacks that do exactly one thing?
>
> 1. Saving one unnecessary ops.
> 2. Easier to trace the code: all dumpings are in one implementation.

Okay, I see your point.

>
>>
>> >
>> >>
>> >> - My initial implementation just called regular dump for classifiers
>> >>   that don't support terse dump, but in internal review Jiri insisted
>> >>   that cls API should fail if it can't satisfy user's request and having
>> >>   dedicated callback allows implementation to return an error if
>> >>   classifier doesn't define ->terse_dump(). With flag approach it would
>> >>   be not trivial to determine if implementation actually uses the flag.
>> >
>> > Hmm? For those not support terse dump, we can just do:
>> >
>> > if (terse)
>> >   return -EOPNOTSUPP;
>> > // normal dump goes here
>> >
>> > You just have to pass 'terse' flag to all implementations and let them
>> > to decide whether to support it or not.
>>
>> But why duplicate the same code to all existing cls dump implementations
>> instead of having such check nicely implemented in cls API (via callback
>> existence or a flag)?
>
> I am confused, your fl_terse_dump() also has duplication with fl_dump()...
>
> Thanks.

I meant duplicating the "if terse not supported return -EOPNOTSUPP" in
dump callback of every classifier implementation. With current
implementation cls API handles such case by checking whether classifier
implementation has ->terse_dump() defined and returns error otherwise.
This can also be achieved by having a new classifier flag, in case we
decide to proceed with folding both dump and terse_dump into single
->dump(bool terse) callback.
Vlad Buslov May 21, 2020, 2:36 p.m. UTC | #18
Hi Edward, Cong,

On Mon 18 May 2020 at 18:37, Edward Cree <ecree@solarflare.com> wrote:
> On 15/05/2020 12:40, Vlad Buslov wrote:
>> In order to
>> significantly improve filter dump rate this patch sets implement new
>> mode of TC filter dump operation named "terse dump" mode. In this mode
>> only parameters necessary to identify the filter (handle, action cookie,
>> etc.) and data that can change during filter lifecycle (filter flags,
>> action stats, etc.) are preserved in dump output while everything else
>> is omitted.
> I realise I'm a bit late, but isn't this the kind of policy that shouldn't
>  be hard-coded in the kernel?  I.e. if next year it turns out that some
>  user needs one parameter that's been omitted here, but not the whole dump,
>  are they going to want to add another mode to the uapi?
> Should this not instead have been done as a set of flags to specify which
>  pieces of information the caller wanted in the dump, rather than a mode
>  flag selecting a pre-defined set?
>
> -ed

I've been thinking some more about this. While the idea of making
fine-grained dump where user controls exact contents field-by-field is
unfeasible due to performance considerations, we can try to come up with
something more coarse-grained but not fully hardcoded (like current terse
dump implementation). Something like having a set of flags that allows
to skip output of groups of attributes.

For example, CLS_SKIP_KEY flag would skip the whole expensive classifier
key dump without having to go through all 200 lines of conditionals in
fl_dump_key() while ACT_SKIP_OPTIONS would skip outputting TCA_OPTIONS
compound attribute (and expensive call to tc_action_ops->dump()). This
approach would also leave the door open for further more fine-grained
flags, if the need arises. For example, new flags
CLS_SKIP_KEY_{L2,L3,L4} can be introduced to more precisely control
which parts of cls key should be skipped.

The main drawback of such approach is that it is impossible to come up
with universal set of flags that would be applicable for all
classifiers. Key (in some form) is applicable to most classifiers, but
it still doesn't make sense for matchall or bpf. Some classifiers have
'flags', some don't. Hardware-offloaded classifiers have in_hw_count.
Considering this, initial set of flags will be somewhat flower-centric.

What do you think?

Regards,
Vlad
Jakub Kicinski May 21, 2020, 5:02 p.m. UTC | #19
On Thu, 21 May 2020 17:36:12 +0300 Vlad Buslov wrote:
> Hi Edward, Cong,
> 
> On Mon 18 May 2020 at 18:37, Edward Cree <ecree@solarflare.com> wrote:
> > On 15/05/2020 12:40, Vlad Buslov wrote:  
> >> In order to
> >> significantly improve filter dump rate this patch sets implement new
> >> mode of TC filter dump operation named "terse dump" mode. In this mode
> >> only parameters necessary to identify the filter (handle, action cookie,
> >> etc.) and data that can change during filter lifecycle (filter flags,
> >> action stats, etc.) are preserved in dump output while everything else
> >> is omitted.  
> > I realise I'm a bit late, but isn't this the kind of policy that shouldn't
> >  be hard-coded in the kernel?  I.e. if next year it turns out that some
> >  user needs one parameter that's been omitted here, but not the whole dump,
> >  are they going to want to add another mode to the uapi?
> > Should this not instead have been done as a set of flags to specify which
> >  pieces of information the caller wanted in the dump, rather than a mode
> >  flag selecting a pre-defined set?
> >
> > -ed  
> 
> I've been thinking some more about this. While the idea of making
> fine-grained dump where user controls exact contents field-by-field is
> unfeasible due to performance considerations, we can try to come up with
> something more coarse-grained but not fully hardcoded (like current terse
> dump implementation). Something like having a set of flags that allows
> to skip output of groups of attributes.
> 
> For example, CLS_SKIP_KEY flag would skip the whole expensive classifier
> key dump without having to go through all 200 lines of conditionals in

Do you really need to dump classifiers? If you care about stats 
the actions could be sufficient if the offload code was fixed
appropriately... Sorry I had to say that.

> fl_dump_key() while ACT_SKIP_OPTIONS would skip outputting TCA_OPTIONS
> compound attribute (and expensive call to tc_action_ops->dump()). This
> approach would also leave the door open for further more fine-grained
> flags, if the need arises. For example, new flags
> CLS_SKIP_KEY_{L2,L3,L4} can be introduced to more precisely control
> which parts of cls key should be skipped.

L2, L3, etc. will be meaningless for a lot of classifiers.

> The main drawback of such approach is that it is impossible to come up
> with universal set of flags that would be applicable for all
> classifiers. Key (in some form) is applicable to most classifiers, but
> it still doesn't make sense for matchall or bpf. Some classifiers have
> 'flags', some don't. Hardware-offloaded classifiers have in_hw_count.
> Considering this, initial set of flags will be somewhat flower-centric.
> 
> What do you think?

Simplest heuristic is to dump everything that can't get changed without
a notification. Which I think you're quite close to already..
Vlad Buslov May 22, 2020, 4:16 p.m. UTC | #20
On Thu 21 May 2020 at 20:02, Jakub Kicinski <kuba@kernel.org> wrote:
> On Thu, 21 May 2020 17:36:12 +0300 Vlad Buslov wrote:
>> Hi Edward, Cong,
>> 
>> On Mon 18 May 2020 at 18:37, Edward Cree <ecree@solarflare.com> wrote:
>> > On 15/05/2020 12:40, Vlad Buslov wrote:  
>> >> In order to
>> >> significantly improve filter dump rate this patch sets implement new
>> >> mode of TC filter dump operation named "terse dump" mode. In this mode
>> >> only parameters necessary to identify the filter (handle, action cookie,
>> >> etc.) and data that can change during filter lifecycle (filter flags,
>> >> action stats, etc.) are preserved in dump output while everything else
>> >> is omitted.  
>> > I realise I'm a bit late, but isn't this the kind of policy that shouldn't
>> >  be hard-coded in the kernel?  I.e. if next year it turns out that some
>> >  user needs one parameter that's been omitted here, but not the whole dump,
>> >  are they going to want to add another mode to the uapi?
>> > Should this not instead have been done as a set of flags to specify which
>> >  pieces of information the caller wanted in the dump, rather than a mode
>> >  flag selecting a pre-defined set?
>> >
>> > -ed  
>> 
>> I've been thinking some more about this. While the idea of making
>> fine-grained dump where user controls exact contents field-by-field is
>> unfeasible due to performance considerations, we can try to come up with
>> something more coarse-grained but not fully hardcoded (like current terse
>> dump implementation). Something like having a set of flags that allows
>> to skip output of groups of attributes.
>> 
>> For example, CLS_SKIP_KEY flag would skip the whole expensive classifier
>> key dump without having to go through all 200 lines of conditionals in
>
> Do you really need to dump classifiers? If you care about stats 
> the actions could be sufficient if the offload code was fixed
> appropriately... Sorry I had to say that.

Technically I need neither classifier nor action. All I need is cookie
and stats of single terminating action attached to filter (redirect,
drop, etc.). This can be achieved by making terse dump output that data
for last extension on filter. However, when I discussed my initial terse
dump idea with Jamal he asked me not to ossify such behavior to allow
for implementation of offloaded shared actions in future.

Speaking about shared action offload, I remember looking at some RFC
patches by Edward implementing such functionality and allowing action
stats update directly from act, as opposed to current design that relies
on classifier to update action stats from hardware. Is that what you
mean by appropriately fixing offload code? With such implementation,
just dumping relevant action types would also work. My only concern is
that the only way to dump actions is per-namespace as opposed to
per-Qdisc of filters, which would clash with any other cls/act users on
same machine/hypervisor.


[...]
Cong Wang May 22, 2020, 7:33 p.m. UTC | #21
On Wed, May 20, 2020 at 12:24 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
>
> On Tue 19 May 2020 at 21:58, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Tue, May 19, 2020 at 2:04 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> >> I considered that approach initially but decided against it for
> >> following reasons:
> >>
> >> - Generic data is covered by current terse dump implementation.
> >>   Everything else will be act or cls specific which would result long
> >>   list of flag values like: TCA_DUMP_FLOWER_KEY_ETH_DST,
> >>   TCA_DUMP_FLOWER_KEY_ETH_DST, TCA_DUMP_FLOWER_KEY_VLAN_ID, ...,
> >>   TCA_DUMP_TUNNEL_KEY_ENC_KEY_ID, TCA_DUMP_TUNNEL_KEY_ENC_TOS. All of
> >>   these would require a lot of dedicated logic in act and cls dump
> >>   callbacks. Also, it would be quite a challenge to test all possible
> >>   combinations.
> >
> > Well, if you consider netlink dump as a database query, what Edward
> > proposed is merely "select COLUMN1 COLUMN2 from cls_db" rather
> > than "select * from cls_db".
> >
> > No one said it is easy to implement, it is just more elegant than you
> > select a hardcoded set of columns for the user.
>
> As I explained to Edward, having denser netlink packets with more
> filters per packet is only part of optimization. Another part is not
> executing some code at all. Consider fl_dump_key() which is 200 lines
> function with bunch of conditionals like that:
>
> static int fl_dump_key(struct sk_buff *skb, struct net *net,
>                        struct fl_flow_key *key, struct fl_flow_key *mask)
> {
>         if (mask->meta.ingress_ifindex) {
>                 struct net_device *dev;
>
>                 dev = __dev_get_by_index(net, key->meta.ingress_ifindex);
>                 if (dev && nla_put_string(skb, TCA_FLOWER_INDEV, dev->name))
>                         goto nla_put_failure;
>         }
>
>         if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
>                             mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
>                             sizeof(key->eth.dst)) ||
>             fl_dump_key_val(skb, key->eth.src, TCA_FLOWER_KEY_ETH_SRC,
>                             mask->eth.src, TCA_FLOWER_KEY_ETH_SRC_MASK,
>                             sizeof(key->eth.src)) ||
>             fl_dump_key_val(skb, &key->basic.n_proto, TCA_FLOWER_KEY_ETH_TYPE,
>                             &mask->basic.n_proto, TCA_FLOWER_UNSPEC,
>                             sizeof(key->basic.n_proto)))
>                 goto nla_put_failure;
>
>         if (fl_dump_key_mpls(skb, &key->mpls, &mask->mpls))
>                 goto nla_put_failure;
>
>         if (fl_dump_key_vlan(skb, TCA_FLOWER_KEY_VLAN_ID,
>                              TCA_FLOWER_KEY_VLAN_PRIO, &key->vlan, &mask->vlan))
>                 goto nla_put_failure;
>     ...
>
>
> Now imagine all of these are extended with additional if (flags &
> TCA_DUMP_XXX). All gains from not outputting some other minor stuff into
> netlink packet will be negated by it.

Interesting, are you saying a bit test is as expensive as appending
an actual netlink attribution to the dumping? I am surprised.


>
>
> >
> > Think about it, what if another user wants a less terse dump but still
> > not a full dump? Would you implement ops->terse_dump2()? Or
> > what if people still think your terse dump is not as terse as she wants?
> > ops->mini_dump()? How many ops's we would end having?
>
> User can discard whatever he doesn't need in user land code. The goal of
> this change is performance optimization, not designing a generic
> kernel-space data filtering mechanism.

You optimize the performance by reducing the dump size, which is
already effectively a data filtering. This doesn't have to be your goal,
you are implementing it anyway.


>
> >
> >
> >>
> >> - It is hard to come up with proper validation for such implementation.
> >>   In case of terse dump I just return an error if classifier doesn't
> >>   implement the callback (and since current implementation only outputs
> >>   generic action info, it doesn't even require support from
> >>   action-specific dump callbacks). But, for example, how do we validate
> >>   a case where user sets some flower and tunnel_key act dump flags from
> >>   previous paragraph, but Qdisc contains some other classifier? Or
> >>   flower classifier points to other types of actions? Or when flower
> >>   classifier has and tunnel_key actions but also mirred? Should the
> >
> > Each action should be able to dump selectively too. If you think it
> > as a database, it is just a different table with different schemas.
>
> How is designing custom SQL-like query language (according to your
> example at the beginning of the mail) for filter dump is going to
> improve performance? If there is a way to do it in fast a generic manner
> with BPF, then I'm very interested to hear the details. But adding
> hundred more hardcoded conditionals is just not a solution considering
> main motivations for this change is performance.

I still wonder how a bit test is as expensive as you claim, it does
not look like you actually measure it. This of course depends on the
size of the dump, but if you look at other netlink dump in kernel,
not just tc filters, we already dump a lot of attributes per record.

Thanks.
Cong Wang May 22, 2020, 7:41 p.m. UTC | #22
On Thu, May 21, 2020 at 7:36 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
> Hi Edward, Cong,
>
> On Mon 18 May 2020 at 18:37, Edward Cree <ecree@solarflare.com> wrote:
> > On 15/05/2020 12:40, Vlad Buslov wrote:
> >> In order to
> >> significantly improve filter dump rate this patch sets implement new
> >> mode of TC filter dump operation named "terse dump" mode. In this mode
> >> only parameters necessary to identify the filter (handle, action cookie,
> >> etc.) and data that can change during filter lifecycle (filter flags,
> >> action stats, etc.) are preserved in dump output while everything else
> >> is omitted.
> > I realise I'm a bit late, but isn't this the kind of policy that shouldn't
> >  be hard-coded in the kernel?  I.e. if next year it turns out that some
> >  user needs one parameter that's been omitted here, but not the whole dump,
> >  are they going to want to add another mode to the uapi?
> > Should this not instead have been done as a set of flags to specify which
> >  pieces of information the caller wanted in the dump, rather than a mode
> >  flag selecting a pre-defined set?
> >
> > -ed
>
> I've been thinking some more about this. While the idea of making
> fine-grained dump where user controls exact contents field-by-field is
> unfeasible due to performance considerations, we can try to come up with
> something more coarse-grained but not fully hardcoded (like current terse
> dump implementation). Something like having a set of flags that allows
> to skip output of groups of attributes.
>
> For example, CLS_SKIP_KEY flag would skip the whole expensive classifier
> key dump without having to go through all 200 lines of conditionals in
> fl_dump_key() while ACT_SKIP_OPTIONS would skip outputting TCA_OPTIONS
> compound attribute (and expensive call to tc_action_ops->dump()). This
> approach would also leave the door open for further more fine-grained
> flags, if the need arises. For example, new flags
> CLS_SKIP_KEY_{L2,L3,L4} can be introduced to more precisely control
> which parts of cls key should be skipped.
>
> The main drawback of such approach is that it is impossible to come up
> with universal set of flags that would be applicable for all
> classifiers. Key (in some form) is applicable to most classifiers, but
> it still doesn't make sense for matchall or bpf. Some classifiers have
> 'flags', some don't. Hardware-offloaded classifiers have in_hw_count.
> Considering this, initial set of flags will be somewhat flower-centric.
>
> What do you think?

This looks like a reverse filtering to me, so essentially the same.
Please give me some time to think about this, it is definitely not
easy.

The only thing I worry is that once you add terse dump, we cannot
simply remove it any more. (Otherwise I wouldn't even want to push
you on this.)

Thanks.
Jamal Hadi Salim May 23, 2020, 11:06 a.m. UTC | #23
On 2020-05-22 12:16 p.m., Vlad Buslov wrote:
> On Thu 21 May 2020 at 20:02, Jakub Kicinski <kuba@kernel.org> wrote:
>> On Thu, 21 May 2020 17:36:12 +0300 Vlad Buslov wrote:
>>> Hi Edward, Cong,
>>>

>> Do you really need to dump classifiers? If you care about stats
>> the actions could be sufficient if the offload code was fixed
>> appropriately... Sorry I had to say that.
> 
> Technically I need neither classifier nor action. All I need is cookie
> and stats of single terminating action attached to filter (redirect,
> drop, etc.). This can be achieved by making terse dump output that data
> for last extension on filter. However, when I discussed my initial terse
> dump idea with Jamal he asked me not to ossify such behavior to allow
> for implementation of offloaded shared actions in future.
> 


Trying to recollect our discussion (please forgive me if i am
rehashing). old skule hardware model typically is ACL style with one
action - therefore concept of tying a counter with with
a classifier is common.

Other hardware i am familiar with tends to have a table of counters.
More an array with indices.
In the shared case using the same counter index from multiple
tables implies it is shared. Note "old skule" does not have
a concept of sharing.

So i was more worried about assuming the "old skule" model
at the expense of other hardware models.
We should be able to dump different tables from hardware.
Mostly these could be tables of actions. And counter tables
look like a gact action.

 From s/w:
If what is needed is to just dump explicit stats a gact
action with a cookie and an index would suffice.
i.e tc match foobar \
action continue cookie blah index 15 \
action ...
action mirred redirect ...

of course this now adds extra cycles in the s/w datapath but
advantage is it means you can cheaply either get
individual counters (get action gact index 15) or dump all gact actions
and filter in user space for your cookie. Or introduce a dump
cookie filter in the kernel (similar to the "time since" action
dump filter).

cheers,
jamal

> Speaking about shared action offload, I remember looking at some RFC
> patches by Edward implementing such functionality and allowing action
> stats update directly from act, as opposed to current design that relies
> on classifier to update action stats from hardware. Is that what you
> mean by appropriately fixing offload code? With such implementation,
> just dumping relevant action types would also work. My only concern is
> that the only way to dump actions is per-namespace as opposed to
> per-Qdisc of filters, which would clash with any other cls/act users on
> same machine/hypervisor.
> 
> 
> [...]
>
Vlad Buslov May 25, 2020, 11:38 a.m. UTC | #24
On Fri 22 May 2020 at 22:33, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Wed, May 20, 2020 at 12:24 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>>
>> On Tue 19 May 2020 at 21:58, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > On Tue, May 19, 2020 at 2:04 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>> >> I considered that approach initially but decided against it for
>> >> following reasons:
>> >>
>> >> - Generic data is covered by current terse dump implementation.
>> >>   Everything else will be act or cls specific which would result long
>> >>   list of flag values like: TCA_DUMP_FLOWER_KEY_ETH_DST,
>> >>   TCA_DUMP_FLOWER_KEY_ETH_DST, TCA_DUMP_FLOWER_KEY_VLAN_ID, ...,
>> >>   TCA_DUMP_TUNNEL_KEY_ENC_KEY_ID, TCA_DUMP_TUNNEL_KEY_ENC_TOS. All of
>> >>   these would require a lot of dedicated logic in act and cls dump
>> >>   callbacks. Also, it would be quite a challenge to test all possible
>> >>   combinations.
>> >
>> > Well, if you consider netlink dump as a database query, what Edward
>> > proposed is merely "select COLUMN1 COLUMN2 from cls_db" rather
>> > than "select * from cls_db".
>> >
>> > No one said it is easy to implement, it is just more elegant than you
>> > select a hardcoded set of columns for the user.
>>
>> As I explained to Edward, having denser netlink packets with more
>> filters per packet is only part of optimization. Another part is not
>> executing some code at all. Consider fl_dump_key() which is 200 lines
>> function with bunch of conditionals like that:
>>
>> static int fl_dump_key(struct sk_buff *skb, struct net *net,
>>                        struct fl_flow_key *key, struct fl_flow_key *mask)
>> {
>>         if (mask->meta.ingress_ifindex) {
>>                 struct net_device *dev;
>>
>>                 dev = __dev_get_by_index(net, key->meta.ingress_ifindex);
>>                 if (dev && nla_put_string(skb, TCA_FLOWER_INDEV, dev->name))
>>                         goto nla_put_failure;
>>         }
>>
>>         if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
>>                             mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
>>                             sizeof(key->eth.dst)) ||
>>             fl_dump_key_val(skb, key->eth.src, TCA_FLOWER_KEY_ETH_SRC,
>>                             mask->eth.src, TCA_FLOWER_KEY_ETH_SRC_MASK,
>>                             sizeof(key->eth.src)) ||
>>             fl_dump_key_val(skb, &key->basic.n_proto, TCA_FLOWER_KEY_ETH_TYPE,
>>                             &mask->basic.n_proto, TCA_FLOWER_UNSPEC,
>>                             sizeof(key->basic.n_proto)))
>>                 goto nla_put_failure;
>>
>>         if (fl_dump_key_mpls(skb, &key->mpls, &mask->mpls))
>>                 goto nla_put_failure;
>>
>>         if (fl_dump_key_vlan(skb, TCA_FLOWER_KEY_VLAN_ID,
>>                              TCA_FLOWER_KEY_VLAN_PRIO, &key->vlan, &mask->vlan))
>>                 goto nla_put_failure;
>>     ...
>>
>>
>> Now imagine all of these are extended with additional if (flags &
>> TCA_DUMP_XXX). All gains from not outputting some other minor stuff into
>> netlink packet will be negated by it.
>
> Interesting, are you saying a bit test is as expensive as appending
> an actual netlink attribution to the dumping? I am surprised.

It is not just adding a clause to all those conditionals. Some functions
are not called at all with current terse dump design. In the case of
fl_dump_key() it is just a bunch of conditionals (and maybe price of
cache misses to access struct fl_flow_key in a first place). In case of
tc_action_ops->dump() it is also obtaining a spinlock, some atomic ops,
etc. But I agree, "negated" is too strong of a word, "significantly
impacted" is more correct.

>
>
>>
>>
>> >
>> > Think about it, what if another user wants a less terse dump but still
>> > not a full dump? Would you implement ops->terse_dump2()? Or
>> > what if people still think your terse dump is not as terse as she wants?
>> > ops->mini_dump()? How many ops's we would end having?
>>
>> User can discard whatever he doesn't need in user land code. The goal of
>> this change is performance optimization, not designing a generic
>> kernel-space data filtering mechanism.
>
> You optimize the performance by reducing the dump size, which is
> already effectively a data filtering. This doesn't have to be your goal,
> you are implementing it anyway.
>
>
>>
>> >
>> >
>> >>
>> >> - It is hard to come up with proper validation for such implementation.
>> >>   In case of terse dump I just return an error if classifier doesn't
>> >>   implement the callback (and since current implementation only outputs
>> >>   generic action info, it doesn't even require support from
>> >>   action-specific dump callbacks). But, for example, how do we validate
>> >>   a case where user sets some flower and tunnel_key act dump flags from
>> >>   previous paragraph, but Qdisc contains some other classifier? Or
>> >>   flower classifier points to other types of actions? Or when flower
>> >>   classifier has and tunnel_key actions but also mirred? Should the
>> >
>> > Each action should be able to dump selectively too. If you think it
>> > as a database, it is just a different table with different schemas.
>>
>> How is designing custom SQL-like query language (according to your
>> example at the beginning of the mail) for filter dump is going to
>> improve performance? If there is a way to do it in fast a generic manner
>> with BPF, then I'm very interested to hear the details. But adding
>> hundred more hardcoded conditionals is just not a solution considering
>> main motivations for this change is performance.
>
> I still wonder how a bit test is as expensive as you claim, it does
> not look like you actually measure it. This of course depends on the
> size of the dump, but if you look at other netlink dump in kernel,
> not just tc filters, we already dump a lot of attributes per record.
>
> Thanks.

I agree that I didn't specify which parts of the change constitute what
fraction of the dump rate increase. Lets stage a simple test to verify
the cost of calling just two functions (fl_dump_key() and
tc_act_ops->dump() callback) and instantly discarding their results from
packet (patch attached).
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 8ac7eb0a8309..267ee76d3ddb 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -771,6 +771,9 @@ tcf_action_dump_terse(struct sk_buff *skb, struct tc_action *a)
 {
 	unsigned char *b = skb_tail_pointer(skb);
 	struct tc_cookie *cookie;
+	unsigned char *c;
+	struct nlattr *nest;
+	int err;
 
 	if (nla_put_string(skb, TCA_KIND, a->ops->kind))
 		goto nla_put_failure;
@@ -787,6 +790,16 @@ tcf_action_dump_terse(struct sk_buff *skb, struct tc_action *a)
 	}
 	rcu_read_unlock();
 
+	c = skb_tail_pointer(skb);
+	nest = nla_nest_start_noflag(skb, TCA_OPTIONS);
+	if (nest == NULL)
+		goto nla_put_failure;
+	err = tcf_action_dump_old(skb, a, 0, 0);
+	if (err > 0) {
+		nla_nest_end(skb, nest);
+	}
+	nlmsg_trim(skb, c);
+
 	return 0;
 
 nla_put_failure:
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 0c574700da75..1bc6294c5c9b 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -2771,8 +2771,10 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh,
 static int fl_terse_dump(struct net *net, struct tcf_proto *tp, void *fh,
 			 struct sk_buff *skb, struct tcmsg *t, bool rtnl_held)
 {
+	struct fl_flow_key *key, *mask;
 	struct cls_fl_filter *f = fh;
 	struct nlattr *nest;
+	unsigned char *b;
 	bool skip_hw;
 
 	if (!f)
@@ -2786,8 +2788,15 @@ static int fl_terse_dump(struct net *net, struct tcf_proto *tp, void *fh,
 
 	spin_lock(&tp->lock);
 
+	key = &f->key;
+	mask = &f->mask->key;
 	skip_hw = tc_skip_hw(f->flags);
 
+	b = skb_tail_pointer(skb);
+	if (fl_dump_key(skb, net, key, mask))
+		goto nla_put_failure_locked;
+	nlmsg_trim(skb, b);
+
 	if (f->flags && nla_put_u32(skb, TCA_FLOWER_FLAGS, f->flags))
 		goto nla_put_failure_locked;
Result for terse dumping 1m simple rules (flower with L2 key + gact
drop) on current net-next:

$ time sudo tc -s filter show terse dev ens1f0 ingress >/dev/null

real    0m3.445s
user    0m2.087s
sys     0m1.298s

With patch applied:

$ time sudo tc -s filter show terse dev ens1f0_0 ingress >/dev/null

real    0m5.035s
user    0m3.289s
sys     0m1.687s

As we can see this leads to 30% overhead in kernel space execution time.

Now with more complex rules (flower 5tuple + act tunnel key + act
mirred) on current net-next:

$ time sudo tc -s filter show terse dev ens1f0 ingress >/dev/null

real    0m4.052s
user    0m2.065s
sys     0m1.937s

Same rules with patch applied:

$ time sudo tc -s filter show terse dev ens1f0_0 ingress >/dev/null

real    0m6.346s
user    0m3.166s
sys     0m3.108s

With more complex rules performance impact on kernel space execution
time get more severe (60%). Overall, this looks like significant
degradation.