STABILITY OF NEURAL ORDINARY DIFFERENTIAL EQUATIONS WITH POWER NONLINEARITIES

The article presents a study of solutions of ODEs system with a special nonlinear part, which is a continuous analogue of an arbitrary recurrent neural network (neural ODEs). As a nonlinear part of the mentioned system of differential equations, we used sums of piecewise continuous functions, where each term is a power function. (These are activation functions.) The use of power activation functions (PAF) in neural networks is a generalization of well-known the rectified linear units (ReLU). In the present time ReLU are commonly used to increase the depth of trained of a neural network. Therefore, the introduction of PAF into neural networks significantly expands the possibilities of ReLU. Note that the purpose of introducing power activation functions is that they allow one to obtain verifiable Lyapunov stability conditions for solutions of the system differential equations simulating the corresponding dynamic processes. In turn, Lyapunov stability is one of the guarantees of the adequacy of the neural network model for the process under study. In addition, from the global stability (or at least the boundedness) of continuous analog solutions it follows that learning process of the corresponding neural network will not diverge for any training sample.


Introduction
One of the fundamental goals of machine learning is modeling and understanding real-world phenomena from observations. In the process of modeling two main problems are considered: (i) the problem of minimizing deviations between the trajectory of the described model and the real dynamics that the system under study demonstrates; (ii) the problem of increasing the degree of convergence of the procedure for minimizing deviations, which is solved in the problem (i). However, there is also a third problem: (iii) if a neural network models a certain dynamic process, then how to guarantee the stability or boundedness of solutions of a system of differential equations describing a continuous analog of the aforementioned neural network?
The solution of all three problems substantially depends on the choice of activation functions included in the architecture of a neural network. In this work we will primarily study the problem (iii) with respect to the choice of activation functions in the domain of modeling dynamical systems.
A wide variety of naturally occurring phenomena can be approximated by power laws: from frequency distribution of words in natural language [11] to the size distribution of neuronal avalanches in cortical networks [8,9,15]. Moreover, it is common for chaotic processes to contain unbounded nonlinearities while describing a bounded and stable process [1, 3,4,10,15]. Therefore, it is only natural to question the fit and expressivity of power functions as activation functions. In order to solve the numerous problems associated with the stability of neural networks, it is necessary to expand the class of activation functions. At present, this set already includes unbounded activation functions [13]. In this regard, the creation of neural networks with unbounded activation functions requires guaranteeing the stability or boundedness conditions of solutions of the equations describing the neural networks [6] - [9].
The key contributions of this article are: • Section 3 introduces the concept of a piecewise continuous power activation function (PAF) and presents its main properties; • The conditions for the global stability of neural ODEs with PAFs, found with the help of the well-known method of Lyapunov functions, are given in Sections 4 and 5. Note that checking these conditions is reduced only to checking the negative definiteness of two known matrices.
In the future, these results will be used to construct new types of chaotic neural networks.

The relationship between neural networks and differential equations
In recent years, an interesting idea has appeared to interpret a system of ordinary differential equations in the form of a suitable neural network (residual network) [3] - [6]. The essence of this idea is as follows.
Consider the following neural network (this is a system of difference equations): x(t + 1) = x(t) + H(x(t), Ω), x(0) = x 0 ; t = 1, ..., N. (1.1) Here x ∈ R n is a vector of states, Ω ∈ R k is a vector of parameters, H(x, Ω) : R n × R k → R n is a vector field of continuous functions. (The number N in neurodynamics denotes the number of "layers"in the neural network (1.1).) Now we rewrite relation (1.1) in the following form: , Ω).
If we consider function x(t) as a function of a continuous argument on some interval [x 0 , x N ], then the last equation can be rewritten in the following form: x(t + ∆t) − x(t) ∆t = H(x(t), Ω).
If now we direct the number of "layers"N → ∞ and we assume ∆t → 0, then we get the following system of ordinary differential equationṡ So we can say that neural network (1.1) is the well-known Euler discretization procedure of system (1.2): x(t + ∆t) − x(t) = ∆t · (H(x(t), Ω)), x(0) = x 0 , (1. 3) where ∆t is the discretization step. It is clear that sequence (1.1) can be viewed as a neural network with N − 1 hidden layers, input layer x 0 and output layer x N . The architecture of such neural network is determined by the vector H(x(t), Ω) [4,5].
Thus, in some cases, we can replace the study of neural network (1.1) with its continuous analog (1.2). In this case, the stability of neural network (1.1) at N → ∞ will follow from the stability of the system of differential equations (1.2). (If it is known that the process described by equation (1.2) converges, then by increasing the number of layers N of neural network (1.1) or decreasing step ∆t in equation (1.3), we can achieve convergence of the learning process of this neural network [4,5].) Therefore, in the future, this article will be devoted to the study of the stability of system (1.2), which in neurodynamics is called the system of neural ODEs [4,5]. (Note that a new application of neural ODEs using invariant theory was proposed in [2].)

Mathematical preliminaries
Now we give three well-known results of the nonlinear theory of systems, which will be used later.
be a system of ordinary autonomous differential equations and let x(t, x 0 ) be a trajectory of this system with initial data x 0 ∈ R n . Here G(x) : R n → R n is a continuous vector-function; In the future, we will assume that point x = 0 is the equilibrium point of system (2.1) .
asymptotically stable if it is stable and δ can be chosen such that x 0 < δ ⇒ lim t→∞ x(t) = 0. Theorem 2.1. (Lyapunov's Theorem [9]). Let x = 0 be an equilibrium point for system (2.1) and D ⊂ R n be a domain containing x = 0. Let V : R n → R be a continuously differentiable function, such that Then, x = 0 is stable. Moreover, ifV (x) < 0 in D − 0, then the equilibrium point x = 0 is asymptotically stable and if it is unique, then x = 0 is globally asymptotically stable.
Definition 2.2. [9] A continuously differentiable function V (x) satisfying the conditions Lyapunov's Theorem is called a Lyapunov function.
A set M ⊂ R n is said to be a positively invariant set with respect to (2.
A point s ∈ R n is said to be a positive limit point of The set L + of all positive limit points of x(t, x 0 ) is called the positive limit set of x(t, x 0 ).
Let D ⊂ R n be a compact set and let solution x(t, x 0 ) is bounded and belongs to D. Then its positive limit set L + is a nonempty, compact, invariant set. Moreover, x(t, x 0 ) → L + as t → ∞ (see Lemma 3.1 [9]).
From here the following result can be obtained.
Theorem 2.2. (LaSalle's Theorem [9]). Let H ⊂ R n be a compact set that is positively invariant with respect to (2.1). Let V : R n → R be a continuously differentiable function such thatV (x) ≤ 0 in H. Let E be the set of all points in H whereV (x) = 0. Let M be the largest invariant set in E. Then every solution starting in H approaches M as t → +∞.
In the future, one more important result will be used.
Theorem 2.3. (Comparison Principle [9]). Consider the real scalar differential equationv t (t) = g(t, v), v(t 0 ) = v 0 , where the function g(t, v) is continuous in t and locally Lipschitz in v, for all t ≥ 0 and all v(t) ∈ R. Let [t 0 , ∞) be the interval of existence of the solution v(t). Let w(t) be a differentiable function satisfies the differential inequalityẇ t (t) ≤ g(t, w(t)), Several special lemmas will also be used in the proof of the main results.
Lemma 2.1. Suppose that real numbers A, B, α, β, p, q, x, and y satisfy the follo- is true. In addition, if p = 0 (or q = 0), then inequality (2.2) becomes inequality Introduce the variable u = x/y > 0. Then inequality (2.2) can be rewritten in the following form: It is easy to check that the function f (u) has only one positive root: u = 1. Let's do some simple calculations: From here it follows that the point u = 1 is the single minimum of function f (u) on the interval (0, ∞). The last statement leads to inequality u α+β + 1 ≥ u α + u β ≥ pu α + qu β or where 0 ≤ p 1 ≤ 1, 0 ≤ q 1 ≤ 1.
2. Now let u = y/x ≤ 1. Divide both sides of inequality (2.5) by a positive value x r . Then, we have It is easy to check that if r ≥ k 1 + m 1 , then r ≤ k 2 + m 2 . Since u = y/x ≤ 1, then 1 − u m 2 ≥ 1 − u r−k 2 and inequality (2.7) is obvious. The proof of case u = y/x ≥ 1 is similar to the proof given in item 1.
Let u = 0. Then we can consider the equation Since p(0) = 0 and q(0) = γ 2 > 0, then v = 1 is only one positive solution of the equation p(v) = q(v). This means that u γ 3 = 1 and the point u = 1 is the single minimum of function f (u) on the interval (0, ∞). The inequality (2.9) for i = 1, j = 2 is proved. The proof of the remaining inequalities for 1 ≤ i < j ≤ n is similar.
Definition 3.1. The fraction r is called odd if the numbers m and n are odd. (Otherwise, the fraction r is called even.) The function f (x, α ∨ β) is called an strictly odd activation function if α and β are odd. (If at least one of the numbers α or β is even, then the function f (x, α ∨ β) is called an strictly even activation function.) From an applied point of view, the representation function f (x, α ∨ β) in the form (3.1) is not very convenient. In order to make this representation more applied, we prove the following lemma.
Lemma 3.1. The fraction r can be approximated with any preassigned accuracy by either an odd or even fraction.
Thus, we again have Furter reasoning is obvious. This completes the proof of Lemma 3.1.
Thus, any irreducible rational even fraction with any preassigned accuracy can be represented by a rational odd fraction and vice verca.
Since any finite or periodic decimal fraction can be represented as a rational fraction, Lemma 1 allows us to consider odd and even activation functions f (x, α∨ β) with powers in the form of decimal fractions.
Thus, in computer simulation, instead of (3.1), we can use the following representation of the activation function: where α ≥ 0, β ≥ 0 are finite decimal fractions. Similarly, instead (3.2), we can use the following representation of the activation function: It is clear that representations (3.3) and (3.4) generalize representations (3.1) and (3.2). Therefore, the following definition is justified. .4)) is called an odd (even) activation function.
The last remark allows us to use the tools of well-known mathematical packages to calculate power functions. Below, these packages were used to plot several activation functions. (It should be noted that these graphs are substantially differ from each other at 0 < α < 1, 0 < β < 1 and α > 1, β > 1. This difference is the main reason for different approaches when proving stability in case 0 < α < 1, 0 < β < 1 and case α > 1, β > 1.) Important note. Functions (3.3) and (3.4) may have properties that functions (3.1) and (3.2) do not have. For example, f (x) = x 2 is the even function, while the function f (x) = x 201/99 close to it is odd. The classic view of whether a function f (x) = x p is even or odd is that the number p must be a positive integer. We have tried to extend these concepts to arbitrary (not integer) positive degrees. In our work, evenness or oddness is defined as follows: any power function f (x) = x p at x ≥ 0 is nonnegative. But at x < 0 , function f (x) = x p is either positive (even) or negative (odd). A natural question arises: why are such functions the subject of research in this article? In our opinion, the most comprehensive answer is given in [13]. It was said in [13] that ReLU [14] - [16] became a new building block of deep neural networks instead of traditional bounded activation functions. Therefore, we hope that generalization (3.1) (or (3.3)) and (3.2) (or (3.4)) will lead to deeper results than those obtained using ReLU.

Even and odd activation functions
The concepts of even and odd activation functions introduced above can be extended as follows.
Let g(x) and h(x) be two nonnegative increasing functions continuously differentiable on the interval (r, ∞). Suppose also that each of these functions has a single root

Consider the following piecewise continuous functions on an open interval
and 6)) is called an extended odd ( extended even) activation function.
Let the root x * = r be the same for all odd and even functions. Then it is easy to check the following properties of these functions: (i) the product of two odd or even functions is even.
(ii) the product of even and odd functions is odd.
(iii) the derivative of an even function is odd and vice versa.
In what follows, we will assume that r = 0 and g(x) = ax α , h(x) = bx β are power activation functions. Here a > 0, b > 0, α > 0, and β > 0. In the future, systemsẋ(t) = H(x) (x ∈ R n , H : R n → R n ) of neural ODEs of order n of the following two types will be considered.
(T1) Each of the n equations of systemẋ(t) = H(x) includes no more than n power functions. These functions are the same for each equation of the system.
(T2) Each of the n equations of systemẋ(t) = H(x) includes only one power function. Moreover, all the functions mentioned are, generally speaking, different.
Thus, in both cases (T1) and (T2), the systemẋ(t) = H(x) contains no more than n different power functions.
The sense of introducing such types of ODE is as follows. All neurons of a neural network located in one layer are divided into n groups, each of which contains no more than n neurons.
In the case of (T1), the signals entering the neurons of each group are transformed according to different power laws. At the same time, these laws are the same for different groups.
In the case of (T2), the signals entering the neurons of each group are transformed only according to one power law. At the same time, these laws are, generally speaking, different for different groups.
In the future, two different algorithms for learning neural networks will be designed. One of them will be based on property (T1) and the other on property (T2). Here it is important to find out for which of the algorithms the learning problem is solved more efficiently. (For example, for which of the algorithms a higher rate of convergence can be achieved.) The rest of the article will be devoted to the study of the stability problem.
Then any solution of system (4.1) is bounded. In addition, if in case (i) A = 0 and c = 0 or in case (ii) B = 0 and c = 0, then the equilibrium point 0 is globally asymptotically stable.
We define the applicant for the role of the Lyapunov function for system (4.1) at c = 0 the real function V (x 1 , ..., x n ) by the following rule: where ∀i ∈ {1, ..., n} (i1) The case of strictly odd function f (x i , α i ∨ β i ) and c = 0; i = 1, ..., n.
Since fractions β i +1 and α i +1 have an even numerator and odd denominator, then the function V (x 1 , ..., x n ) will be positive definite.
Further, from the definition of function V (x 1 , ..., x n ) and system (4.1) it follows thatV . . . Introduce the norm of matrix Q = {q ij } ∈ R n by the following formula: Similarly, we define the norm of vector u = (u 1 , ..., u n ) T : Now we estimate the derivativeV t (x 1 (t), ..., x n (t)) of function V (x 1 (t), ..., x n (t)), taking into account the fact that matrix S is negative definite: where λ max (S) denotes the maximal eigenvalue of symmetric matrix S.
Denote by S the ball centered at the origin whose surface passes through point T u ∈ S. In this case H ⊂ S. Therefore, by virtue of (4.6) and according to LaSalle's Theorem, there is a moment T s > T u such that x(T s ) ∈ H. Again we get that solution x(t) of system (4.2) starting at S belongs to S. Now we show that system (4.2) with the strictly odd activation function f (x, α i ∨ β i ) can be reduced to case c = 0; i = 1, ..., n.
Then instead of inequality (4.4) we have the following inequality: To study the solutions of inequality (4.11), we use the same reasonings as in the study of inequality (4.4). Then, taking into account (4.10), in the integrand of formula (4.5), we get Now the completion of the proof of item (i) in case c = 0 completely repeats the proof in case c = 0.
Consider the differential equationẋ(t) = x γ (t), where the function x γ is odd. Introduce the function V (x) = x γ+1 = x γ · x, where the function x is odd. It is clear that the function V (x) is even. We Thus, the functionV t is also even. From here it follows that all the results obtained in item (i1) for strictly odd functions remain valid for odd functions.
We return to equation (4.7). We assume that the activation functions in this equation are odd. Then, according to Lemma 3.1, any odd function with parameters α o and β o can be approximated with any given accuracy > 0 by a strictly odd function whose parameters are ordinary fractions α so and β so . Let x * so (α so ∨ β so ) ∈ R n be a root of system (4.7) with strictly odd activation functions. Then, by the principle of the continuous dependence of the solution of system (4.7) on parameters α and β , there must be a solution x * o (α o ∨ β o ) ∈ R n of system (4.7) with odd activation functions such that This completes the proof of item (i4) and the whole item (i).
It is clear that the set of all points H ⊂ R n satisfying condition −aV (x 1 , ..., x n ) +bH(x 1 , ..., x n ) ≤ 0 is the compact positively invariant set with respect to (4.2). Therefore, according to LaSalle's Theorem, if a solution x(t) of system (4.2) belongs to H, then it is bounded. It means that the solution V (x 1 , ..., x n ) of equation (4.13) also should be bounded.
We use Comparison Principle. Now it again remain to compare the solution V (x 1 , ..., x n ) of equation (4.12) and a similar solution of equation (4.13). From here it follows the boundedness of solution x(t) of system (4.2) for any initial condition x 0 ∈ R n . Finally, if B = 0 and c = 0, then H = 0. In this case system (4.2) has only zero equilibrium point and any trajectory of this system is attracted to the origin (see again Lyapunov's Theorem). This completes the proof of case (ii1).
The further proof for c = 0 repeats the proof of item (i3) in Theorem 4.1.
If we replace α i < 1 and β i < 1 with α i > 1 and β i > 1, then the proof of these items verbatim repeat the proof of items (i3) and (i4).
(iii) In general (with the exception of a few minor details), the proof of item (iii) repeats the proofs of items (i) and (ii) As before, we again define the applicant for the role of the Lyapunov function for system (4.2) at c = 0 the real function V (x 1 , ..., x n ) by the following rule: It is clear that the functions V 1 (x 1 , ..., x n ) and V 2 (x 1 , ..., x n ) are positive definite. Then, as follows from the conditions of Theorem 4.1, the functionsV 1t (x 1 (t), ..., x n (t)) andV 2t (x 1 (t), ..., x n (t)) are negative definite. Hence the functionV t (x 1 (t), ..., x n (t)) will also be negative definite. Therefore, the function V (x 1 , ..., x n ) is the Lyapunov function. Thus, item (iii) of Theorem 4.1 is a simple corollary of items (i) and (ii). If we follow the logic of the proofs of items (i) and (ii), then the following positive definite function U (x 1 , ..., x n ) could claim the role of the Lyapunov function: It is clear that all n terms of function U (x 1 , ..., x n ) are some terms of function V (x 1 , ..., x n ). However, the negative definiteness of functionV t (x 1 (t), ..., x n (t)) does not imply negative definiteness of functionU t (x 1 (t), ..., x n (t)). (Only inequa-lityV t (x 1 (t), ..., x n (t)) ≤U t (x 1 (t), ..., x n (t)) holds, but not the inequalityV t (x 1 (t), ..., x n (t)) ≤U t (x 1 (t), ..., x n (t)) ≤ 0.) Therefore, the negative definiteness of func-tionU t (x 1 (t), ..., x n (t)) still needs to be proved. But since the moments of switching .., n} are unknown, a constructive verification of the negative definiteness of functioṅ U t (x 1 (t), ..., x n (t)) is impossible. Therefore, the conditions of item (iii) guaranteeing the existence of the Lyapunov function V (x 1 , ..., x n ) are more restrictive than the conditions for the existence of the Lyapunov function U (x 1 , ..., x n ). However, these conditions are easily verified for the coefficients of system (4.2), in contrast to the non-constructive conditions for checking the existence of the Lyapunov function U (x 1 , ..., x n ).
The proof of all parts of Theorem 4.1 is complete.

Generalization of Theorem 4.1 for classical neural ODEs
If we construct a continuous analogue for some recurrent neural network (see Subsection 1.1), then we can get neural ODEs of the following form: where f i (...) is an activation function; i = 1, ..., n. (It is one of the common forms of neural ODE [6,7].) Suppose that for system (4.1) A = 0 and det B = 0. Then, using the change of variables x(t) → B −1 x(t), we can go from system (4.1) to system of type (4.16).
Let 5. Second use of power functions in neural ODEs: Architecture (T2) Consider the following system of real differential equations with initial values x i (0) = x i0 . Here α i ≥ 0, β i ≥ 0; c i , a ij , b ij ∈ R; i, j = 1, ..., n.
Let a ij = 0, c i = 0. In this case we rewrite system (5.1) in the following form .., n. Thus, each right-hand side of the equations of system (5.1) is composed only of odd functions.
Let n be an even number. We will assume that all equations of system (5.2) are composed of antisymmetric odd functions f i (x j ): α i = β i = γ i ; i, j = 1, ..., n.
Corollary. The assertion of Theorem 5.1 is preserved if condition (5.5) is replaced by the inequality b 12 b 21 < 0.
Corollary. Suppose that under the conditions of Theorem 5.3 all diagonal elements b ii of matrix B are approximately equal to each other: b ii ≈ −∆, where ∆ > 0. We will also consider that all degrees of activation functions are greater than or equal to 1 and they are also approximately equal to each other. Then system (5.2) is stable. Moreover, we have |b ij | < ∆/(n − 1); i, j = 1, ..., n; i = j.

Conclusion
Two variants of neural ODEs with power nonlinearities corresponding to two different architectures of a recurrent neural network are considered. Sufficient conditions for the global stability of the mentioned neural ODEs are found. Note that the obtained stability conditions can be interesting by themselves. For the