38

I'm not sure I quite understand the extent to which undefined behavior can jeopardize a program.

Let's say I have this code:

#include <stdio.h>

int main()
{
    int v = 0;
    scanf("%d", &v);
    if (v != 0)
    {
        int *p;
        *p = v;  // Oops
    }
    return v;
}

Is the behavior of this program undefined for only those cases in which v is nonzero, or is it undefined even if v is zero?

user541686
  • 205,094
  • 128
  • 528
  • 886
  • why does the value of v matter? – Jim Rhodes Oct 31 '11 at 23:53
  • 5
    @JimRhodes: because if `v` is zero the offending piece of code is not executed. – Matteo Italia Oct 31 '11 at 23:54
  • @MatteoItalia: I know that but the code is bad so who cares if v is 0 – Jim Rhodes Oct 31 '11 at 23:57
  • 5
    @JimRhodes: the matter here is not if the code is good or bad, it is if, as for the standard, it exhibits undefined behavior regardless of the value inserted by the user. – Matteo Italia Nov 01 '11 at 00:01
  • @MatteoItalia: what is the point of this question? The code presented has a path that (I believe no one argues with this) is invalid/ill-formed. Arguing whether the program is valid _in circumstances that don't exercise this path_ doesn't give you anything. Those invalid paths exist. The program is, in its whole, invalid. – Mat Nov 01 '11 at 01:29
  • 8
    @Mat: as with all "language-lawyer" questions the point is about standard nitpickery, not usefulness. Also, `int a, b, c; scanf("%d", "%d", &a, &b); c=a+b;`. Is this code valid? You'd say so, but *in particular circumstances* (where a+b overflows) this exhibits undefined behavior. Does this mean that the program, in its whole is invalid, in any circumstance? – Matteo Italia Nov 01 '11 at 01:33
  • 3
    The behavior of the program is undefined only when the statement invoking undefined behavior is executed. Yes, it is bad programming practice to have reachable code paths which invoke UB, but no UB is invoked as long as zero or non-numeric data is read from stdin. Keep in mind **many** real-world programs have similar cases of conditional invocation of UB due to failure to check the return value of `malloc`, or similar issues (UB is invoked when the pointer is dereferenced **only if** `malloc` returned 0). – R.. GitHub STOP HELPING ICE Nov 01 '11 at 05:02
  • @R..: The behavior becomes undefined as soon as the execution state is such that a compiler would be entitled to assume that Undefined Behavior will be invoked. If a loop has no side-effects, a compiler is allowed to propagate the effects of Undefined Behavior that would be inevitable after the loop to code before the loop without having to show that the loop itself terminates. – supercat Apr 24 '15 at 03:20

8 Answers8

16

I'd say that the behavior is undefined only if the users inserts any number different from 0. After all, if the offending code section is not actually run the conditions for UB aren't met (i.e. the non-initialized pointer is not created neither dereferenced).

A hint of this can be found into the standard, at 3.4.3:

behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements

This seems to imply that, if such "erroneous data" was instead correct, the behavior would be perfectly defined - which seems pretty much applicable to our case.


Additional example: integer overflow. Any program that does an addition with user-provided data without doing extensive check on it is subject to this kind of undefined behavior - but an addition is UB only when the user provides such particular data.

Matteo Italia
  • 123,740
  • 17
  • 206
  • 299
  • The pointer may still be allocated, depending on the compiler. However, this isnt a problem as, yes, it wont be dereferenced. – chacham15 Nov 01 '11 at 00:06
  • 1
    Is it just me, or does it seem contradicting to say its UB only if the user enters 0. If the user entered 0, the "offending code" wouldn't be ran, and the "conditions for UB aren't met" – Nick Rolando Nov 01 '11 at 00:15
  • 1
    I got confused for a sec about your example but I see what you mean now, thanks. +1 – user541686 Nov 01 '11 at 01:25
  • 1
    Another great example: any program that calls `gets` when not at EOF on a stream with unknown content. Depending on the input, this program may have undefined behavior. – R.. GitHub STOP HELPING ICE Nov 01 '11 at 15:50
  • I think this is correct in the sense that the program itself is well-defined when the user enters 0. However, the section with undefined behaviour could still cause unexpected things at compile time, e.g., the compiler could notice that `v != 0` leads to UB and remove the entire `if` under the assumption that therefore always `v == 0`. – Arkku Sep 17 '18 at 17:11
14

Since this has the tag, I have an extremely nitpicking argument that the program's behavior is undefined regardless of user input, but not for the reasons you might expect -- though it can be well-defined (when v==0) depending on the implementation.

The program defines main as

int main()
{
    /* ... */
}

C99 5.1.2.2.1 says that the main function shall be defined either as

int main(void) { /* ... */ }

or as

int main(int argc, char *argv[]) { /* ... */ }

or equivalent; or in some other implementation-defined manner.

int main() is not equivalent to int main(void). The former, as a declaration, says that main takes a fixed but unspecified number and type of arguments; the latter says it takes no arguments. The difference is that a recursive call to main such as

main(42);

is a constraint violation if you use int main(void), but not if you use int main().

For example, these two programs:

int main() {
    if (0) main(42); /* not a constraint violation */
}


int main(void) {
    if (0) main(42); /* constraint violation, requires a diagnostic */
}

are not equivalent.

If the implementation documents that it accepts int main() as an extension, then this doesn't apply for that implementation.

This is an extremely nitpicking point (about which not everyone agrees), and is easily avoided by declaring int main(void) (which you should do anyway; all functions should have prototypes, not old-style declarations/definitions).

In practice, every compiler I've seen accepts int main() without complaint.

To answer the question that was intended:

Once that change is made, the program's behavior is well defined if v==0, and is undefined if v!=0. Yes, the definedness of the program's behavior depends on user input. There's nothing particularly unusual about that.

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
  • Whoa, I never knew that. So how is `int main()` different from `int main(...)`? – user541686 Nov 01 '11 at 00:56
  • 3
    @Mehrdad: `int main(...)` is a syntax error; a variadic function must have at least one named parameter. More generally, variadic functions (like `printf`) may legitimately be called with variable numbers and types of arguments. Functions with old-style non-prototype declarations must be called with the exact number and type(s) of arguments specified by the definition -- but the compiler won't diagnose calls with bad arguments. That's why prototypes were added to the language. – Keith Thompson Nov 01 '11 at 01:14
  • Huh, interesting; +1 from me. – user541686 Nov 01 '11 at 01:21
  • `int main()` is equivalent to `int main(void)` except that the former does not provide a prototype. Both declare a function taking no arguments, and `int main()` is perfectly valid by the "or equivalent". – R.. GitHub STOP HELPING ICE Nov 01 '11 at 04:59
  • 1
    @R..: "... except that the former does not provide a prototype". Then *they're not equivalent*. See the examples I just added to my answer. Why do you assume that the phrase "or equivalent" permits this difference? – Keith Thompson Nov 01 '11 at 07:59
  • If you call `main(42)`, the version used determines whether it's a constraint violation or simply UB, but either way it's an erroneous program. Defining `main` with `int main()` is still perfectly valid, and it's "equivalent" because either way `main` is defined as a function taking no arguments and returning an `int`. – R.. GitHub STOP HELPING ICE Nov 01 '11 at 15:47
  • @R..: Two programs that differ only in whether `main` has a `void` parameter declaration are not equivalent; one violates a constraint, the other does not. The two definitions are *similar* in that they both define a function taking no arguments and returning an int, but they have other differences. How do you know that the phrase "or equivalent" refers *only* to arguments and return type? Is `int foo(void)` "equivalent" to `int main(void)`? (I've updated the examples.) – Keith Thompson Nov 01 '11 at 18:54
  • OK let's look at the actual language: `main` "shall be defined with a return type of int and with no parameters... or...". The footnote ("Thus, int can be replaced by a typedef name defined as int, or the type of argv can be written as char ** argv, and so on.") for "equivalent" indicates that it's irrelevant to our discussion. If I define `main` as `int main() {...}`, I have, as required by the standard, defined `main` as a function returning `int` and taking no parameters. – R.. GitHub STOP HELPING ICE Nov 01 '11 at 19:12
  • BTW, as a thought experiment for you, what would you say about `int main(void); int main() { }`? – R.. GitHub STOP HELPING ICE Nov 01 '11 at 19:12
  • @R..: You're ignoring the `int main(void) { /* ... */ }` By the same argument, you can ignore the `int main(int argc, char *argv[]) { /* ... */ }` and define `int main(char *argc, double argv) { /* ... */ }`. The definitions shown aren't just examples, they're requirements. As for your thought experiment, `main` is not *defined* in one of the two explicitly permitted manners. The combination of the declaration and the definition is, as far as I can tell, equivalent to a proper definition -- but the *definition* by itself isn't, and that's what the standard talks about. – Keith Thompson Nov 01 '11 at 19:55
  • The only difference in `int main() {...}` and `int main(void) {...}` is the *declarations* they provide. As definitions, they are identical. If you want to push this further, go bother someone on the committee. I'm considering the matter closed. – R.. GitHub STOP HELPING ICE Nov 01 '11 at 20:02
  • @R..: I'll grant you that the standard *could* be interpreted in the way you describe; you're not the only one to do so. I believe my interpretation is more consistent with the actual wording, and at the very least is not contradicted by it. As for the intent, I suspect the committee just didn't think about old-style definitions when they wrote that section; if they had, they probably would have covered that case explicitly. – Keith Thompson Nov 01 '11 at 20:31
  • More than 3 years ago, I wrote: "Two programs that differ only in whether `main` has a `void` parameter declaration are not equivalent; one violates a constraint, the other does not." That was unclear. I think what I meant is that two such programs *with a call such as `main(42)`* differ in that one violates a constraint and the other does not. – Keith Thompson Nov 19 '14 at 18:17
  • To be more nit-picky, your nit-pick is actually wrong. The distiction between `()` and `(void)` only applies "in a function declarator that is not part of a definition of that function" (C99 section 6.7.5.3 paragraph 14). In this case, it *is* part of a function definition, so `int main() { ... }` and `int main(void) { ... }` are identical. – Chris Dodd Sep 18 '16 at 20:19
  • @ChrisDodd: No, they're not. They're identical *as definitions*, but the definition also provides a declaration, and the declarations differ. This: `int main(void) { return main(42); }` is a constraint violation. This: `int main() { return main(42); }` is not, but it has undefined behavior. – Keith Thompson Sep 18 '16 at 21:14
  • @KeithThompson: The second is a constraint violation as well, as `()` in declaration that is also a definition specifies no argumemnts, the same as `(void)`. It *only* specifies an unspecified number of arguments in a declaration that is not also a definition -- see 6.7.5.3 – Chris Dodd Sep 18 '16 at 23:23
  • @ChrisDodd: I think you're referring to C99 6.7.5.3p14 (C11/[N1570](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf)) 6.7.6.3p14). Yes, it specifies that the function has no *parameters*, but like a standalone declaration `int foo()`, it's not a prototype and it doesn't specify that the function expects no *arguments* in a call. And `gcc -std=c11 -pedantic` doesn't diagnose `int main() { return main(42); }`, but it does diagnose `int main(void) { return main(42); }`; likewise for clang. Apparently the authors of gcc and clang share my interpretation. – Keith Thompson Sep 19 '16 at 15:08
9

Let me give an argument for why I think this is still undefined.

First, the responders saying this is "mostly defined" or somesuch, based on their experience with some compilers, are just wrong. A small modification of your example will serve to illustrate:

#include <stdio.h>

int
main()
{
    int v;
    scanf("%d", &v);
    if (v != 0)
    {
        printf("Hello\n");
        int *p;
        *p = v;  // Oops
    }
    return v;
}

What does this program do if you provide "1" as input? If you answer is "It prints Hello and then crashes", you are wrong. "Undefined behavior" does not mean the behavior of some specific statement is undefined; it means the behavior of the entire program is undefined. The compiler is allowed to assume that you do not engage in undefined behavior, so in this case, it may assume that v is non-zero and simply not emit any of the bracketed code at all, including the printf.

If you think this is unlikely, think again. GCC may not perform this analysis exactly, but it does perform very similar ones. My favorite example that actually illustrates the point for real:

int test(int x) { return x+1 > x; }

Try writing a little test program to print out INT_MAX, INT_MAX+1, and test(INT_MAX). (Be sure to enable optimization.) A typical implementation might show INT_MAX to be 2147483647, INT_MAX+1 to be -2147483648, and test(INT_MAX) to be 1.

In fact, GCC compiles this function to return a constant 1. Why? Because integer overflow is undefined behavior, therefore the compiler may assume you are not doing that, therefore x cannot equal INT_MAX, therefore x+1 is greater than x, therefore this function can return 1 unconditionally.

Undefined behavior can and does result in variables that are not equal to themselves, negative numbers that compare greater than positive numbers (see above example), and other bizarre behavior. The smarter the compiler, the more bizarre the behavior.

OK, I admit I cannot quote chapter and verse of the standard to answer the exact question you asked. But people who say "Yeah yeah, but in real life dereferencing NULL just gives a seg fault" are more wrong than they can possibly imagine, and they get more wrong with every compiler generation.

And in real life, if the code is dead you should remove it; if it is not dead, you must not invoke undefined behavior. So that is my answer to your question.

Nemo
  • 70,042
  • 10
  • 116
  • 153
  • 2
    I don't see how your example supports the point you are trying to make. Your example executes unconditionally. My code executes conditionally. How are the two related? – user541686 Nov 01 '11 at 00:44
  • My `printf` example is more to the point. The undefined construct can affect the behavior of the program _before it even executes_. (In this case, it can effectively prevent the `printf` from executing at all.) I do not think it is too much of a stretch from "before it even executes" to "even if it does not execute". Although I do not think anybody has answered your question conclusively with reference to the spec so far. – Nemo Nov 01 '11 at 04:34
  • 4
    @Nemo: It is a stretch. Nearly every program ever written in C (for example, every one using pointers) has unreachable code paths that invoke UB. But the program's behavior is well-defined as long as those paths are never taken. The only thing confusing the issue in this question is that whether the path is taken depends on input. This makes the program *insecure*, but the UB is still conditional on receiving a "bad" input. – R.. GitHub STOP HELPING ICE Nov 01 '11 at 05:09
  • 2
    @Nemo: A compiler can produce code to do anything it wants as soon as a program has received a combination of inputs which will, unavoidably, cause undefined behavior. As far as the C standard is concerned, the code could even build a time machine and erase everyone's memory of the program having done anything sensible before such input was received. Nonetheless, if there exists any combination of input values which will yield defined behavior, the compiler's generated code must produce well-defined behavior if those input values are in fact supplied. – supercat Nov 04 '11 at 21:06
  • Would the behavior of a program like this be undefined if its output were fed to a broken pipe and a flush was done after the printf, causing the program to terminate before the UB was executed? – supercat Apr 24 '15 at 03:22
  • Necropost but @supercat, I believe the compiler would still be allowed to reorder the print (and flush) to take place after the UB, since in absence of UB (which the compiler may assume) the difference would not be detectable. The reordering could be done first and the UB then detected, e.g., causing the entire `if` body to be removed. Now, I'm not entirely sure if you had the `int *p` in _global_ scope, then in theory you could assign a valid address externally (e.g., a debugger that does it upon seeing the "hello" from output), but I think it might still be UB. At least without `volatile`. – Arkku Sep 17 '18 at 17:01
  • @Arkku: I would think the legitimacy of the reordering would be determined by whether implementation-defined aspects of how I/O is done recognize the possibility of I/O raising a synchronous signal or otherwise disrupting program execution. Given `sig_atomic_t flag;` at global scope, the a compiler given `printf("Raises SIGPIPE"); flag=1;` would be allowed to reorder the write to `flag` before the `printf` if it does not recognize the possibility of a synchronous signal from `printf`, but if it does recognize that possibility, and the signal from `printf` occurs... – supercat Sep 17 '18 at 17:16
  • @supercat Actually it seems that by the standard 5.1.2.3.2 calling a function that modifies a file or accesses a `volatile` object is a side-effect, so if `printf` does either (and `stdout` probably counts as a file, so it does), the reordering would not technically be allowed even in your earlier case. (In practice I think compilers might take liberties, especially with a standard library function like `printf`.) – Arkku Sep 17 '18 at 17:25
  • ...and causes an `exit(1);` never returns, then anything following the `printf` should be regarded like code that follows an `if(hardToRecognizeCondition) exit(1)`--as code which might be required execute, but might not be allowed to execute. Unfortunately, reordering operations across actions that might raise implementation-defined synchronous signals falls in the category of behaviors that some compiler writers would regard as sufficiently dumb that they feel no need to explicitly say they refrain from such foolishness, but which other compiler writers view as a "useful optimization". – supercat Sep 17 '18 at 17:26
  • @Arkku: The Standard does not in general require that implementations recognize the possibility that a volatile-qualified access or I/O operation might affect any aspects of the abstract machine state, including raising a synchronous signal, because such a requirement would needlessly impair optimizations in cases where such effects could not actually occur, and it expects that compiler writers should be familiar enough with their target platforms and application fields to know whether such recognition would be appropriate, whether the Standard mandates it or not. – supercat Sep 17 '18 at 17:32
  • 1
    @Arkku: To put things another way, if volatile write operations were to record the addresses and data on some medium whose content could not be observed by the running program (e.g. a printout), the Standard would require that optimizations not affect the sequence of records produced, but would not require that implementations recognize any side-effects beyond that. – supercat Sep 17 '18 at 17:36
  • @supercat Yes, I think you are right, and simply having a volatile operation before UB would therefore not suffice if the UB itself is non-volatile and can thus be ordered before. But what if it was a global volatile pointer (with only `NULL` assigned in the program), and preceded by some volatile operation that would trigger an external program to change the pointer's value to a valid address? =) – Arkku Sep 17 '18 at 17:40
  • @Arkku: So far as I can tell, unless optimizations are *completely* disabled, neither gcc nor clang will allow for the possibility that an operation on a `volatile`-qualified object might interact with the values of objects that were accessed previously or will be accessed afterward, even when targeting hardware platforms which define mechanisms via which that could occur. IMHO, a quality implementation suitable for low-level programming on a single-core machine should not require anything beyond `volatile` to implement an interleaved-access mutex that guards non-qualified objects, but... – supercat Sep 17 '18 at 17:48
  • ...gcc and clang don't support such semantics except when optimizations are disabled. – supercat Sep 17 '18 at 17:48
  • @supercat: You are kind of missing my point. In this example, the compiler is allowed to to assume that `v` *is* zero... Because otherwise, it can see you would invoke UB, which it is allowed to assume never happens. This has nothing to do with re-ordering of operations. – Nemo Sep 17 '18 at 22:46
  • @Nemo: An implementation which is conforming but not intended to be suitable for purposes involving low-level programming behave in ways that make it unsuitable for low-level programming. A good quality implementation intended for low-level programming cannot do so [since that would make it unsuitable for that purpose]. If an action raises a synchronous signal, and the signal handler calls `exit` or `longjmp`, then code following the action that triggered the signal **will not execute** and thus cannot affect any aspect of program behavior. An implementation that recognizes... – supercat Sep 17 '18 at 22:52
  • ...the possibility that an action might raise a synchronous signal cannot blindly assume that such a signal won't occur. An implementation can be conforming without requiring the possibility of synchronous signals, but that doesn't mean that it's possible for a compiler to be a *high-quality implementation suitable for low-level programming* on a platform where such signals may occur without such recognition. – supercat Sep 17 '18 at 22:55
2

If v is 0, your random pointer assignment never gets executed, and the function will return zero, so it is not undefined behaviour

Peter
  • 29,498
  • 21
  • 89
  • 122
1

When you declare variables (especially explicit pointers), a piece of memory is allocated (usually an int). This peace of memory is being marked as free to the system but the old value stored there is not cleared (this depends on the memory allocation being implemented by the compiler, it might fill the place with zeroes) so your int *p will have a random value (junk) which it has to interpret as integer. The result is the place in memory where p points to (p's pointee). When you try to dereference (aka. access this piece of the memory), it will be (almost every time) occupied by another process/program, so trying to alter/modify some others memory will result in access violation issues by the memory manager.

So in this example, any other value then 0 will result in undefined behavior, because no one knows what *p will point to at this moment.

I hope this explanation is of any help.

Edit: Ah, sorry, again few answers ahead of me :)

ludesign
  • 1,353
  • 7
  • 12
  • It's not the matter of how actually the compiler implements pointers & co., here we are discussing about the "abstract machine" of the standard. By the way, on any modern OS with memory isolation between processes the "random address" won't be "occupied by another process", because each process has *its own* virtual address space, and pointers can refer only to this. – Matteo Italia Nov 01 '11 at 00:13
  • What do you mean by "(usually an int)"? – Keith Thompson Nov 01 '11 at 00:38
  • Sorry, I wanted to say "all variables are actually pointers, all pointers are actually an integer variables holding the address of their pointee"; However, I should stop browsing stackoverflow late at night, trying to express myself in a way I am not able to. :) – ludesign Nov 01 '11 at 12:07
1

It is simple. If a piece of code doesn't execute, it doesn't have a behavior!!!, whether defined or not.

If input is 0, then the code inside if doesn't run, so it depends on the rest of the program to determine whether the behavior is defined (in this case it is defined).

If input is not 0, you execute code that we all know is a case of undefined behavior.

Shahbaz
  • 46,337
  • 19
  • 116
  • 182
  • 2
    I'm not sure that's necessarily true. What about "if (input==9) {int foo[1000000000]; foo[0]=1;}" That could cause undefined behavior regardless of the value of 'input'. – supercat Nov 04 '11 at 21:02
0

I would say it makes the whole program undefined.

The key to undefined behavior is that it is undefined. The compiler can do whatever it wants to when it sees that statement. Now, every compiler will handle it as expected, but they still have every right to do whatever they want to - including changing parts unrelated to it.

For example, a compiler may choose to add a message "this program may be dangerous" to the program if it detects undefined behavior. This would change the output whether or not v is 0.

Pubby
  • 51,882
  • 13
  • 139
  • 180
  • 3
    "_whole program undefined_" That doesn't mean anything. **A program execution can have undefined behaviour.** It only means that the behaviour of this program execution is not defined by the standard. "_including changing parts unrelated to it._" Pure nonsense. – curiousguy Nov 01 '11 at 02:01
-1

Your program is pretty-well defined. If v == 0 then it returns zero. If v != 0 then it splatters over some random point in memory.

p is a pointer, its initial value could be anything, since you don't initialise it. The actual value depends on the operating system (some zero memory before giving it to your process, some don't), your compiler, your hardware and what was in memory before you ran your program.

The pointer assignment is just writing into a random memory location. It might succeed, it might corrupt other data or it might segfault - it depends on all of the above factors.

As far as C goes, it's pretty well defined that unintialised variables do not have a known value, and your program (though it might compile) will not be correct.

Adam Hawes
  • 5,439
  • 1
  • 23
  • 30
  • "_unintialised variables do not have a known value_" The uninitialised variable does not have a value **at all**. – curiousguy Nov 01 '11 at 02:02
  • Therein, you are wrong. In C, all variables have value. Uninitialized ones have "random" (or unknown) values. Consider this fragment: "int i; printf("%d\n", i);". Compile it with warnings disabled and see what it prints! – Adam Hawes Nov 14 '11 at 02:25
  • 1
    No, _you_ are wrong. In C and C++, variables that have not been written to cannot be read. The behaviour of a read of an uninitialised variable **is not defined**. "_In C, all variables have value._" What is the value of an uninitialised variable? **An uninitialised variable does not have a value.** You cannot say that its value is either zero or non-zero because it has no value. – curiousguy Nov 19 '11 at 04:32