How do I / O streams store information?


Please explain why when you type in the console, say, the letter ' t ' and the subsequent outputs System.in.read() and new BufferedReader(new InputStreamReader(System.in)).read(), you get the same result when you cast to the type int, namely: 116. After all, the standard read () of a byte stream should read one byte at a time, and output zero (as the first byte of a two-byte record ‘t’), and the character BufferedReader outputs everything correctly.
What's the matter?

Author: default locale, 2018-06-06

2 answers

The console sends the finished bytes.

System.in - this is the input stream of a Java process that is managed by the OS. The console passes the already converted bytes to this stream.

What happens when you enter data in the console:

  • The console converts the entered characters to bytes according to the encoding set for the console.
    Setting the encoding depends on the individual console / OS.
    Most likely, your console has an encoding that converts t to a single byte.
  • System.in.read honestly reads the received byte.
  • InputStreamReader converts the received bytes to characters in accordance with the default encoding, since You have selected a constructor with an unset encoding. The default encoding is set by the JVM parameters.
    You can check the encoding of your reader with the following code:

        InputStreamReader reader = new InputStreamReader(System.in);
        System.out.println("Encoding: "+reader.getEncoding());
    

    The encoding can be set explicitly using one of the constructors.
    When reading the reader reads bytes from the input stream and converts them according to the encoding to the numeric value of the corresponding character (char). According to the Java specification (§3.1 Unicode), characters are stored in UTF-16 encoding. Let's say we use a single-byte encoding, enter the character " s " and process it:

        char value = (char) reader.read();  
    

    Reader will read one byte from the input stream, understand what this character is, find this character in UTF-16 tables, and return its numeric value. The result is this code will be equivalent to:

       char value = 'ы';
    

    Regardless of the encoding.

Two-byte encodings

To see the two-byte encoding of the standard input, you need to set the appropriate encoding. In order to save space, the characters of the English alphabet in most encodings are encoded in one byte.

... must read one byte at a time, and output zero (as the first bit of a two-bit ‘t ' record)

So the characters are converted to two-byte encodings with order from the highest to the lowest (for example, UTF16-BE). Accordingly, to see such a transformation via System.in.read You need to submit the input data in this encoding.

This can be done in a number of ways, for example:

  1. Read the console documentation, find out how to set the encoding for it, and whether UTF16-BE encoding is supported. If so, set the encoding. If not, then this method will not work.

  2. Don't enter data from the console, and send a text file to the input. Save the file in UTF16-BE encoding, then run a command like:

        java Main < encoded.txt
    

    Note that when saving, the first character in the file will be BOM and skip it.

 4
Author: default locale, 2018-06-06 13:50:50

When explicitly casting a string to an array of bytes:

import java.util.Arrays;
public class App {
    public static void main(String[] args) {
        String str = "t";
        byte[] b = str.getBytes();
        System.out.println(b.length); //длина массива
        System.out.println(Arrays.toString(b)); //значение
    }
}

We will see in the console

1
[116]

The length of the array from the string "t" is 1, and the desired byte is 116.

Indeed, as @default locale answered above, read() honestly gives this byte away.

P.S.

In the case of a two-byte character, such as "ы", the getBytes() method gives an array of two bytes [-47, -117], and if we read this character from the stream using the System.in.read() method, we also get two bytes 209 and 139. Why so, because what (quoting from "Programming in Java", Patrick Niemeyer, Daniel Leuk, 4th edition 2014, p. 568)

On the Java platform, this method uses a special value to allocate the end of the stream, following the standard from the C language. Bytes are returned as unsigned integers in the range from 0 to 255;

The pattern of byte value displacement is simple - an explicit cast int to byte.

(byte) 209 and (byte) 139 respectively -47 and -117.

If we use the read() method to read from a stream of type BufferedReader, we return the character encoding, as stated in the documentation:

The character read, as an integer in the range 0 to 65535 (0x00-0xffff)

 1
Author: Олег Сухих, 2018-06-08 12:03:58