Computer Networking: 2011

Monday, 21 November 2011

TCP Transport Layer Protocol

Transport Layer Protocol

What is TCP?

TCP was specifically designed to provide a reliable end to end byte stream over an unreliable internetwork. Each machine supporting TCP has a TCP transport entity either a user process or part of the kernel that manages TCP streams and interface to IP layer. A TCP entity accepts user data streams from local processes, breaks them up into pieces not exceeding 64KB and sends each piece as a separate IP datagram. Client Server mechanism is not necessary for TCP to behave properly.

The IP layer gives no guarantee that datagram will be delivered properly, so it is up to TCP to timeout and retransmit, if needed. Duplicate, lost and out of sequence packets are handled using the sequence number, acknowledgements, retransmission, timers, etc to provide a reliable service. Connection is a must for this service.Bit errors are taken care of by the CRC checksum. One difference from usual sequence numbering is that each byte is given a number instead of each packet. This is done so that at the time of transmission in case of loss, data of many small packets can be combined together to get a larger packet, and hence smaller overhead.

TCP connection is a duplex connection. That means there is no difference between two sides once the connection is established.

TCP Connection establishment

The "three-way handshake" is the procedure used to establish a connection. This procedure normally is initiated by one TCP and responded to by another TCP. The procedure also works if two TCP simultaneously initiate the procedure. When simultaneous attempt occurs, each TCP receives a "SYN" segment which carries no acknowledgment after it has sent a "SYN". Of course, the arrival of an old duplicate "SYN" segment can potentially make it appear, to the recipient, that a simultaneous connection initiation is in progress. Proper use of "reset" segments can disambiguate these cases.
The three-way handshake reduces the possibility of false connections. It is the implementation of a trade-off between memory and messages to provide information for this checking.
The simplest three-way handshake is shown in figure below. The figures should be interpreted in the following way. Each line is numbered for reference purposes. Right arrows (-->) indicate departure of a TCP segment from TCP A to TCP B, or arrival of a segment at B from A. Left arrows (<--), indicate the reverse. Ellipsis (...) indicates a segment which is still in the network (delayed). TCP states represent the state AFTER the departure or arrival of the segment (whose contents are shown in the center of each line). Segment contents are shown in abbreviated form, with sequence number, control flags, and ACK field. Other fields such as window, addresses, lengths, and text have been left out in the interest of clarity.

TCP A                                                TCP B

  1.  CLOSED                                               LISTEN

  2.  SYN-SENT    --> <SEQ=100><CTL=SYN>               --> SYN-RECEIVED

  3.  ESTABLISHED <-- <SEQ=300><ACK=101><CTL=SYN,ACK>  <-- SYN-RECEIVED

  4.  ESTABLISHED --> <SEQ=101><ACK=301><CTL=ACK>       --> ESTABLISHED

  5.  ESTABLISHED --> <SEQ=101><ACK=301><CTL=ACK><DATA> --> ESTABLISHED

          Basic 3-Way Handshake for Connection Synchronisation

In line 2 of above figure, TCP A begins by sending a SYN segment indicating that it will use sequence numbers starting with sequence number 100. In line 3, TCP B sends a SYN and acknowledges the SYN it received from TCP A. Note that the acknowledgment field indicates TCP B is now expecting to hear sequence 101, acknowledging the SYN which occupied sequence 100.
At line 4, TCP A responds with an empty segment containing an ACK for TCP B's SYN; and in line 5, TCP A sends some data. Note that the sequence number of the segment in line 5 is the same as in line 4 because the ACK does not occupy sequence number space (if it did, we would wind up ACKing ACK's!).

Simultaneous initiation is only slightly more complex, as is shown in figure below. Each TCP cycles from CLOSED to SYN-SENT to SYN-RECEIVED to ESTABLISHED.

TCP A                                            TCP B

  1.  CLOSED                                           CLOSED

  2.  SYN-SENT     --> <SEQ=100><CTL=SYN>              ...

  3.  SYN-RECEIVED <-- <SEQ=300><CTL=SYN>              <-- SYN-SENT

  4.               ... <SEQ=100><CTL=SYN>              --> SYN-RECEIVED

  5.  SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ...

  6.  ESTABLISHED  <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED

  7.               ... <SEQ=101><ACK=301><CTL=ACK>     --> ESTABLISHED

                Simultaneous Connection Synchronisation

Question: Why is three-way handshake needed? What is the problem if we send only two packets and consider the connection established? What will be the problem from application's point of view? Will the packets be delivered to the wrong application?
Problem regarding 2-way handshake
The only real problem with a 2-way handshake is that duplicate packets from a previous connection( which has been closed) between the two nodes might still be floating on the network. After a SYN has been sent to the responder, it might receive a duplicate packet of a previous connection and it would regard it as a packet from the current connection which would be undesirable.
Again spoofing is another issue of concern if a two way handshake is used.Suppose there is a node C which sends connection request to B saying that it is A.Now B sends an ACK to A which it rejects & asks B to close connection.Beteween these two events C can send a lot of packets which will be delievered to the application..

The first two figures show how a three way handshake deals with problems of duplicate/delayed connection requests and duplicate/delayed connection acknowledgements in the network.The third figure highlights the problem of spoofing associated with a two way handshake. Some Conventions
1. The ACK contains 'x+1' if the sequence number received is 'x'.
2. If 'ISN' is the sequence number of the connection packet then 1st data packet has the seq number 'ISN+1'
3. Seq numbers are 32 bit.They are byte seq number(every byte has a seq number).With a packet 1st seq number and length of the packet is sent.
4. Acknowlegements are cummulative.
5. Acknowledgements have a seq number of their own but with a length 0.So the next data packet have the seq number same as ACK.

Connection Establish

The sender sends a SYN packet with serquence numvber say 'x'.
The receiver on receiving SYN packet responds with SYN packet with sequence number 'y' and ACK with seq number 'x+1'
On receiving both SYN and ACK packet, the sender responds with ACK packet with seq number 'y+1'
The receiver when receives ACK packet, initiates the connection.

Connection Release

The initiator sends a FIN with the current sequence and acknowledgement number.
The responder on receiving this informs the application program that it will receive no more data and sends an acknowledgement of the packet. The connection is now closed from one side.
Now the responder will follow similar steps to close the connection from its side. Once this is done the connection will be fully closed.

Image References

http://www.renoir.vill.edu/~cassel/4900/transport.html
www.uga.edu/~ucns/lans/tcpipsem/close.conn.gif
www.uga.edu/~ucns/lans/tcpipsem/establish.conn.gif

CP connection is a duplex connection. That means there is no difference between two sides once the connection is established.
Salient Features of TCP

Piggybacking of acknowledments:The ACK for the last received packet need not be sent as a new packet, but gets a free ride on the next outgoing data frame(using the ACK field in the frame header). The technique is temporarily delaying outgoing ACKs so that they can be hooked on the next outgoing data frame is known as piggybacking. But ACK can't be delayed for a long time if receiver(of the packet to be acknowledged) does not have any data to send.
Flow and congestion control:TCP takes care of flow control by ensuring that both ends have enough resources and both can handle the speed of data transfer of each other so that none of them gets overloaded with data. The term congestion control is used in almost the same context except that resources and speed of each router is also taken care of. The main concern is network resources in the latter case.
Multiplexing / Demultiplexing: Many application can be sending/receiving data at the same time. Data from all of them has to be multiplexed together. On receiving some data from lower layer, TCP has to decide which application is the recipient. This is called demultiplexing. TCP uses the concept of port number to do this.

TCP segment header:

Explanation of header fields:

Source and destination port :These fields identify the local endpoint of the connection. Each host may decide for itself how to allocate its own ports starting at 1024. The source and destination socket numbers together identify the connection.
Sequence and ACK number : This field is used to give a sequence number to each and every byte transferred. This has an advantage over giving the sequence numbers to every packet because data of many small packets can be combined into one at the time of retransmission, if needed. The ACK signifies the next byte expected from the source and not the last byte received. The ACKs are cumulative instead of selective.Sequence number space is as large as 32-bit although 17 bits would have been enough if the packets were delivered in order. If packets reach in order, then according to the following formula:
(sender's window size) + (receiver's window size) < (sequence number space)
the sequence number space should be 17-bits. But packets may take different routes and reach out of order. So, we need a larger sequence number space. And for optimisation, this is 32-bits.
Header length :This field tells how many 32-bit words are contained in the TCP header. This is needed because the options field is of variable length.
Flags : There are six one-bit flags.
1. URG : This bit indicates whether the urgent pointer field in this packet is being used.
2. ACK :This bit is set to indicate the ACK number field in this packet is valid.
3. PSH : This bit indicates PUSHed data. The receiver is requested to deliver the data to the application upon arrival and not buffer it until a full buffer has been received.
4. RST : This flag is used to reset a connection that has become confused due to a host crash or some other reason.It is also used to reject an invalid segment or refuse an attempt to open a connection. This causes an abrupt end to the connection, if it existed.
5. SYN : This bit is used to establish connections. The connection request(1st packet in 3-way handshake) has SYN=1 and ACK=0. The connection reply (2nd packet in 3-way handshake) has SYN=1 and ACK=1.
6. FIN : This bit is used to release a connection. It specifies that the sender has no more fresh data to transmit. However, it will retransmit any lost or delayed packet. Also, it will continue to receive data from other side. Since SYN and FIN packets have to be acknowledged, they must have a sequence number even if they do not contain any data.
Window Size : Flow control in TCP is handled using a variable-size sliding window. The Window Size field tells how many bytes may be sent starting at the byte acknowledged. Sender can send the bytes with sequence number between (ACK#) to (ACK# + window size - 1) A window size of zero is legal and says that the bytes up to and including ACK# -1 have been received, but the receiver would like no more data for the moment. Permission to send can be granted later by sending a segment with the same ACK number and a nonzero Window Size field.
Checksum : This is provided for extreme reliability. It checksums the header, the data, and the conceptual pseudoheader. The pseudoheader contains the 32-bit IP address of the source and destination machines, the protocol number for TCP(6), and the byte count for the TCP segment (including the header).Including the pseudoheader in TCP checksum computation helps detect misdelivered packets, but doing so violates the protocol hierarchy since the IP addresses in it belong to the IP layer, not the TCP layer.
Urgent Pointer : Indicates a byte offset from the current sequence number at which urgent data are to be found. Urgent data continues till the end of the segment. This is not used in practice. The same effect can be had by using two TCP connections, one for transferring urgent data.
Options : Provides a way to add extra facilities not covered by the regular header. eg,
- Maximum TCP payload that sender is willing to handle. The maximum size of segment is called MSS (Maximum Segment Size). At the time of handshake, both parties inform each other about their capacity. Minimum of the two is honoured. This information is sent in the options of the SYN packets of the three way handshake.
- Window scale option can be used to increase the window size. It can be specified by telling the receiver that the window size should be interpreted by shifting it left by specified number of bits. This header option allows window size up to 230.
Data : This can be of variable size. TCP knows its size by looking at the IP size header.

Topics to be Discussed relating TCP

Maximum Segment Size : It refers to the maximum size of segment ( MSS ) that is acceptable to both ends of the connection. TCP negotiates for MSS using OPTION field. In Internet environment MSS is to be selected optimally. An arbitrarily small segment size will result in poor bandwith utilization since Data to Overhead ratio remains low. On the other hand extremely large segment size will necessitate large IP Datagrams which require fragmentation. As there are finite chances of a fragment getting lost, segment size above "fragmentation threshold " decrease the Throughput. Theoretically an optimum segment size is the size that results in largest IP Datagram, which do not require fragmentation anywhere enroute from source to destination. However it is very difficult to find such an optimum segmet size. In system V a simple technique is used to identify MSS. If H1 and H2 are on the same network use MSS=1024. If on different networks then MSS=5000.
Flow Control : TCP uses Sliding Window mechanism at octet level. The window size can be variable over time. This is achieved by utilizing the concept of "Window Advertisement" based on :
1. Buffer availabilty at the receiver
2. Network conditions ( traffic load etc.)
In the former case receiver varies its window size depending upon the space available in its buffers. The window is referred as RECEIVE WINDOW (Recv_Win). When receiver buffer begin to fill it advertises a small Recv_Win so that the sender does'nt send more data than it can accept. If all buffers are full receiver sends a "Zero" size advertisement. It stops all transmission. When buffers become available receiver advertises a Non Zero widow to resume retransmission. The sender also periodically probes the "Zero" window to avoid any deadlock if the Non Zero Window advertisement from receiver is lost. The Variable size Recv_Win provides efficient end to end flow control.
The second case arises when some intermediate node ( e.g. a router ) controls the source to reduce transmission rate. Here another window referred as COGESTION WINDOW (C_Win) is utilized. Advertisement of C_Win helps to check and avoid congestion.
Congestion Control : Congestion is a condition of severe delay caused by an overload of datagrams at any intermediate node on the Internet. If unchecked it may feed on itself and finally the node may start dropping arriving datagrams.This can further aggravate congestion in the network resulting in congestion collapse. TCP uses two techniques to check congestion.
1. Slow Start : At the time of start of a connection no information about network conditios is available. A Recv_Win size can be agreed upon however C_Win size is not known. Any arbitrary C_Win size can not be used because it may lead to congestion. TCP acts as if the window size is equal to the minimum of ( Recv_Win & C_Win). So following algorithm is used.
  1. Recv_Win=X
  2. SET C_Win=1
  3. for every ACK received C_Win++
2. Multiplicative decrease : This scheme is used when congestion is encountered ( ie. when a segment is lost ). It works as follows. Reduce the congestion window by half if a segment is lost and exponentially backoff the timer ( double it ) for the segments within the reduced window. If the next segment also gets lost continue the above process. For successive losses this scheme reduces traffic into the connection exponentially thus allowing the intermediate nodes to clear their queues. Once congestion ends SLOW START is used to scale up the transmission.
Congestion Avoidance : This procedure is used at the onset of congestion to minimize its effect on the network. When transmission is to be scaled up it should be done in such a way that it does'nt lead to congestion again. Following algorithm is used .
1. At loss of a segment SET C_Win=1
2. SET SLOW START THRESHOLD (SST) = Send_Win / 2
3. Send segment
4. If ACK Received, C_Win++ till C_Win <= SST
5. else for each ACK C_Win += 1 / C_Win
Time out and Retransmission : Following two schemes are used :
1. Fast Retransmit
2. Fast Recovery
When a source sends a segment TCP sets a timer. If this value is set too low it will result in many unnecessary treransmissions. If set too high it results in wastage of banwidth and hence lower throughput. In Fast Retransmit scheme the timer value is set fairly higher than the RTT. The sender can therefore detect segment loss before the timer expires. This scheme presumes that the sender will get repeated ACK for a lost packet.
Round Trip Time (RTT) : In Internet environment the segments may travel across different intermediate networks and through multiple routers. The networks and routers may have different delays, which may vary over time. The RTT therefore is also variable. It makes difficult to set timers. TCP allows varying timers by using an adaptive retransmission algorithm. It works as follows.
1. Note the time (t1) when a segment is sent and the time (t2) when its ACK is received.
2. Compute RTT(sample) = (t 2 - t 1 )
3. Again Compute RTT(new) for next segment.
4. Compute Average RTT by weighted average of old and new values of RTT
5. RTT(est) = a *RTT(old) + (1-a) * RTT (new) where 0 < a < 1
  A high value of 'a' makes the estimated RTT insensitive to changes that last for a short time and RTT relies on the history of the network. A low value makes it sensitive to current state of the network. A typical value of 'a' is 0.75
6. Compute Time Out = b * RTT(est) where b> 1
  A low value of 'b' will ensure quick detection of a packet loss. Any small delay will however cause unnecessary retransmission. A typical value of 'b' is kept at .2

Image References

http://plato.acadiau.ca/courses/comp/Eberbach/comp4343/lectures/transport/Com-TCP/f20_6.gif

Transport Layer Protocol- Implementation Issues

In this class we discussed about the TCP from the implementation point of view and addressed various issues like state diagram and other details which TCP Standard does not define but supported by commercial implementations.

State Diagram

The state diagram approach to view the TCP connection establishment and closing simplifies the design of TCP implementation. The idea is to represent the TCP connection state, which progresses from one state to other as various messages are exchanged. To simplify the matter, we considered two state diagrams, viz., for TCP connection establishment and TCP connection closing.
Fig 1 shows the state diagram for the TCP connection establishment and associated table briefly explains each state.

TCP Connection establishment The table gives brief description of each state of the above diagram.

State Description Table 1.
Listen	Represents the state when waiting for connection request from any remote host and port. This specifically applies to a Server. From this state, the server can close the service or actively open a connection by sending SYN.
Syn-Sent	Represents waiting for a matching for a connection request after having sent a connection request. This applies to both server and client side. Even though server is considered as the one with passive open, it can also send a SYN packet actively.
Syn_Rcvd	Represents waiting for a confirmation connection request acknowledgment after having both received and sent connection request.
Estab	Represents an open connection. Data transfer can take place from this point onwards.

After the connection has been established, two end-points will exchange useful information and terminate the connection. Fig. 2 shows the state diagram for terminating an active connection.

Fig 2. TCP Connection termination

State Description Table 2
FIN-WAIT-1	Represents connection termination request from the remote TCP peer, or an acknowledgment of the connection termination request previously sent. This state is entered when server issues close call.
FIN-WAIT-2	Represents waiting for a connection termination request from the remote TCP.
CLOSING	Represents connection termination request acknowledgment from the remote TCP.
TIME_WAIT	This represents waiting time enough for the packets to reach their destination. This waiting time is usually 4 min.
CLOSE_WAIT	Represents a state when the server receives a FIN from the remote TCP , sends ACK and issues close call sending FIN
LAST_ACK	Represents waiting for an ACK for the previously sent FIN-ACK to the remote TCP
CLOSE	Represents a closed TCP connection having received all the ACKs

Other implementation details

Quite Time

It might happen that a host currently in communication crashes and reboots. At startup time, all the data structures and timers will be reset to an initial value. To make sure that earlier connection packets are gracefully rejected, the local host is not allowed to make any new connection for a small period at startup. This time will be set in accordance with reboot time of the operating system.

Initial Sequence number :

Initial sequence number used in the TCP communication will be initialized at boot time randomly, rather than to 0. This is to ensure that packets from old connection should not interfere with a new connection. So the recommended method is to

Initialize the ISN at boot time by a random number
For every 500 ms, increment ISN by 64K
With every SYN received, increment ISN by 64K

Maximum Request backlog at server

As we have seen in Unix Networking programming, listen(sd,n), sets a maximum to the number of requests to be obliged by the server at any time. So if there are already n requests for connection, and n+1 request comes, two things can be done.

Drop the packet silently
Ask the peer to send the request later.

The first option is recommended here because, the assumption is that this queue for request is a coincident and some time later, the server should be free to process the new request. Hence if we drop the packet, the client will go through the time-out and retransmission and server will be free to process it. Also, Standard TCP does not define any strategy/option of knowing who requested the connection. Only Solaris 2.2 supports this option.

Delayed Acknowledgment

TCP will piggyback the acknowledgment with its data. But if the peer does not have the any data to send at that moment, the acknowledgment should not be delayed too long. Hence a timer for 200 ms will be used. At every 200 ms, TCP will check for any acknowledgment to be sent and send them as individual packets.

Small packets

TCP implementation discourages small packets. Especially if a previous relatively large packet has been sent and no acknowledgment has been received so far, then this small packet will be stored in the buffer until the situation improves. But there are some applications for which delayed data is worse than bad data. For example, in telnet, each key stroke will be processed by the server and hence no delay should be introduced. As we have seen in Unix Networking programming, options for the socket can be set as NO_DELAY, so that small packets are not discouraged.

ICMP Source Quench

We have seen in ICMP that ICMP Source Quench message will be send for the peer to slow down. Some implementations discard this message, but few set the current window size to 1.
But this is not a very good idea.

Retransmission Timeout

In some implementation (E.g.. Linux), RTO = RTT + 4 * delay variance is used to instead of constant 2. Also instead of calculating RTT(est) from the scratch, cache will be used to store the history from which new values are calculated as discussed in the previous classes.
Standard values for Maximum Segment Life (MSL) will be between 0.5 to 2 minutes and Time wait state = f(MSL)

Keep Alive Time

Another important timer in TCP is keep alive timer. It is basically used by a TCP peer to check whether the other end is up or down. It periodically checks this connection. If the other end did not respond, then that connection will be closed.

Persist Timer

As we saw in TCP window management, when source sends one full window of packets, it will set its window size to 0 and expects an ACK from remote TCP to increase its window size. Suppose such an ACK has been sent and is lost. Hence source will have current window size = 0 and cannot send & destination is expecting next byte. To avoid such a deadlock, a Persist Timer will be used. When this timer goes off, the source will send the last one byte again. So we hope that situation has improved and an ACK to increase the current window size will be received.

References

http://www.ssfnet.org/Exchange/tcp/tcpTutorialNotes.html

ARP,RARP,ICMP Protocols

Address Resolution Protocol

If a machine talks to another machine in the same network, it requires its physical or MAC address. But ,since the application has given the destination's IP address it requires some mechanism to bind the IP address with its MAC address.This is done through Address Resolution protocol (ARP).IP address of the destination node is broadcast and the destination node informs the source of its MAC address.

Assume broadcast nature of LAN
Broadcast IP address of the destination
Destination replies it with its MAC address.
Source maintains a cache of IP and MAC address bindings

But this means that every time machine A wants to send packets to machine B, A has to send an ARP packet to resolve the MAC address of B and hence this will increase the traffic load too much, so to reduce the communication cost computers that use ARP maintains a cache of recently acquired IP_to_MAC address bindings, i.e. they dont have to use ARP repeatedly. ARP Refinements Several refinements of ARP are possible: When machine A wants to send packets to macine B, it is possible that machine B is going to send packets to machine A in the near future.So to avoid ARP for machine B, A should put its IP_to_MAC address binding in the special packet while requesting for the MAC address of B. Since A broadcasts its initial request for the MAC address of B, every machine on the network should extract and store in its cache the IP_to_MAC address binding of A When a new machine appears on the network (e.g. when an operating system reboots) it can broadcast its IP_to_MAC address binding so that all other machines can store it in their caches. This will eliminate a lot of ARP packets by all other machines, when they want to communicate with this new machine.

Example displaying the use of Address Resolution Protocol:

Consider a scenario where a computer tries to contact some remote machine using ping program, assuming that there has been no exchange of IP datagrams previously between the two machines and therefore arp packet must be sent to identify the MAC address of the remote machine.

The arp request message (who is A.A.A.A tell B.B.B.B where the two are IP addresses) is broadcast on the local area network with an Ethernet protocol type 0x806. The packet is discarded by all the machines except the target machine which responds with an arp response message (A.A.A.A is hh:hh:hh:hh:hh:hh where hh:hh:hh:hh:hh:hh is the Ethernet source address). This packet is unicast to the machine with IP address B.B.B.B. Since the arp request message included the hardware address (Ethernet source address) of the requesting computer, target machine doesn't require another arp message to figure it out.

Reverse Address Resolution Protocol

RARP is a protocol by which a physical machine in a local area network can request to learn its IP address from a gateway server's Address Resolution Protocol table or cache. This is needed since the machine may not have permanently attacded disk where it can store its IP address permanently. A network administrator creates a table in a local area network's gateway router that maps the physical machine (or Medium Access Control - MAC) addresses to corresponding Internet Protocol addresses. When a new machine is set up, its RARP client program requests from the RARP server on the router to be sent its IP address. Assuming that an entry has been set up in the router table, the RARP server will return the IP address to the machine which can store it for future use.

Detailed Mechanism

Both the machine that issues the request and the server that responds use physical network addresses during their brief communication. Usually, the requester does not know the physical address. So, the request is broadcasted to all the machines on the network. Now, the requester must identify istelf uniquely to the server. For this either CPU serial number or the machine's physical network address can be used. But using the physical address as a unique id has two advantages.

These addresses are always available and do not have to be bound into bootstrap code.
Because the identifying information depends on the network and not on the CPU vendor, all machines on a given network will supply unique identifiers.

Request:
Like an ARP message, a RARP message is sent from one machine to the another encapsulated in the data portion of a network frame. An ethernet frame carrying a RARP request has the usual preamle, Ethernet source and destination addresses, and packet type fields in front of the frame. The frame conatins the value 8035 (base 16) to identify the contents of the frame as a RARP message. The data portion of the frame contains the 28-octet RARP message. The sender braodcasts a RARP request that specifies itself as both the sender and target machine, and supplies its physical network address in the target hardware address field. All machines on the network receive the request, but only those authorised to supply the RARP services process the request and send a reply, such machines are known informally as RARP servers. For RARP to succeed, the network must contain at least one RARP server.
Reply:
Servers answers request by filling in the target protocol address field, changing the message type from request to reply, and sending the reply back directly to the machine making the request.

Timing RARP Transactions
Since RARP uses the physical network directly, no other protocol software will time the response or retransmit the request. RARP software must handle these tasks. Some workstations that rely on RARP to boot, choose to retry indefinitely until the receive a response. Other implementations announce failure after only a few tries to avoid flooding the network with unnecessary broadcast.

Mulitple RARP Servers
Advantage: More reliability. Diadvantage: Overloading may result when all servers respond. So, to get away with disadvantage we have primary and secondary servers. Each machine that makes RARP request is assigned a primary server. Normally, the primary server responds but if it fails, then requester may time out and rebroadcast the request.Whenever a secondary server receives a second copy of the request within a short time of the first, it responds. But, still there might be a problem that all secondary servers respond, thus overloading the network. So, the solution adopted is to avoid having all secondary servers transmit responses simultaneously. Each secondary server that receives the request computes a random delay and then sends a response.

Drawbacks of RARP

Since it operates at low level, it requires direct addresss to the network which makes it difficult for an application programmer to build a server.
It doesn't fully utilizes the capability of a network like ethernet which is enforced to send a minimum packet size since the reply from the server contains only one small piece of information, the 32-bit internet address.

RARP is formally described in RFC903.

ICMP

This protocol discusses a mechanism that gateways and hosts use to communicate control or error information.The Internet protocol provides unreliable,connectionless datagram service,and that a datagram travels from gateway to gateway until it reaches one that can deliver it directly to its final destination. If a gateway cannot route or deliver a datagram,or if the gateway detects an unusual condition, like network congestion, that affects its ability to forward the datagram, it needs to instruct the original source to take action to avoid or correct the problem. The Internet Control Message Protocol allows gateways to send error or control messages to other gateways or hosts;ICMP provides communication between the Internet Protocol software on one machine and the Internet Protocol software on another. This is a special purpose message mechanism added by the designers to the TCP/IP protocols. This is to allow gateways in an internet to report errors or provide information about unexpecter circumstances. The IP protocol itself contains nothing to help the sender test connectivity or learn about failures.

Error Reporting vs Error Correction
ICMP only reports error conditions to the original source; the source must relate errors to individual application programs and take action to correct problems. It provides a way for gateway to report the error It does not fully specify the action to be taken for each possible error. ICMP is restricted to communicate with the original source but not intermediate sources.
ICMP Message Delivery
ICMP messages travel across the internet in the data portion of an IP datagram,which itself travels across the internet in the data portion of an IP datagram,which itself travels across each physical network in the data portion of a frame.Datagrams carryin ICMP messages are routed exactly like datagrams carrying information for users;there is no additional reliability or priority.An exception is made to the error handling procedures if an IP datagram carrying an ICMP messages are not generated for errors that result from datagrams carrying ICMP error messages.
ICMP Message Format
It has three fields;an 8-bit integer message TYPE field that identifies the message,an 8-bit CODE field that provides further information about the message type,and a 16-bit CHECKSUM field(ICMP uses the same additive checksum algorithm as IP,but the ICMP checksum only covers the ICMP message).In addition , ICMP messages that report errors always include the header and first 64 data bits of the datagram causing the problem. The ICMP TYPE field defines the meaning of the message as well as its format.
The Types include : TYPE FIELD                          ICMP MESSAGE TYPE

        0                                             ECHO REPLY
        3                                             DESTINATION UNREACHABLE
        4                                             SOURCE QUENCH
        5                                             REDIRECT(CHANGE A ROUTE)
        8                                             ECHO REQUEST
       11                                            TIME EXCEEDED FOR A DATAGRAM
       12                                            PARAMETER PROBLEM ON A DATAGRAM
       13                                            TIMESTAMP REQUEST
       14                                            TIMESTAMP REPLY
       15                                             INFORMATION REQUEST(OBSOLETE)
       16                                             INFORMATION REPLY(OBSOLETE)
       17                                            ADDRESS MASK REQUEST
       18                                            ADDRESS MASK REPLY TESTING DESTINATION

Reachabilty and Status :
TCP/IP protocols provide facilities to help network managers or users identify network problems.One of the most frequently used debugging tools invokes the ICMP echo request and echo reply messages.A host or gateway sends an ICMP echo request message to a specified destination.Any machine that receives an echo request formulates an echo reply and returns to the original sender.The request contains an optional data area; the reply contains a copy of the data sent in the request.The echo request and associated reply can be used to test whether a destination is reachable and responding.Because both the request and reply travel in IP datagrams,successful receipt of a reply verifies that major pieces of the transport system work.
1.1 : IP software on the source must route the datagram
2.2 : Intermediate gateways between the source and destination must be operating and must route datagram correctly.
3.3 : The destination machine must be running , and both ICMP and IP software must be working.
4.4 : Routes in gateways along the return path must be correct.

Echo Request and Reply
The field listed OPTIONAL DATA is a variable length field that contains data to be returned to the sender.An echo reply always returns exactly the same data as was received in the request.Fields IDENTIFIER and SEQUENCE NUMBER are used by the sender to match replies to request.The value of the TYPE field specifies whether the message is a request(8) or a reply(0).

Reports of Unreachable Destinations
The Code field in a destination unreachable message contains an integer that further describes th problem.Possible values are :
CODE VALUE                                 MEANING

           0                                 NETWORK UNREACHABLE
           1                                 HOST UNREACHABLE
           2                                 PROTOCOL UNREACHABLE
           3                                 PORT UNREACHABLE
           4                                 FRAGMENTATION NEEDED AND DF SET
           5                                 SOURCE ROOT FAILED
           6                                DESTINATION NETWORK UNKNOWN
           7                                DESTINATION HOST UNKNOWN
           8                                 SOURCE HOST ISOLATED
           9                                COMMUNICATION WITH DESTINATION NETWORK ADMINISTRATIVELY PROHIBITED
         10                                COMMUNICATION WTTH DESTINATION HOST ADMINISTRATIVELY PROHIBITED
         11                                NETWORK UNREACHABLE FOR TYPE OF SERVICE
         12                                HOST UNREACHABLE FOR TYPE OF SERVICE

Whenever an error prevents a gateway from routing or delivering a datagram, the gateway sends a destination unreachable message back to the source and then drops the datagram.Network unreachable errors usually imply roting failures ; host unreachable errors imply delivery failures.Because the message contains a short prefix of the datagram that caused the problem, the source will know exactly which address is unreachable. Destinations may be unreachable because hardware is temporarily out of service, because the sender specified a nonexistent destination address, or because the gateway does not have a route to the destination network. Although gateways send destination unreachable messages if they cannot route or deliver datagrams, not all such errors can be detected.If the datagram contains the source route option with an incorrect route, it may trigger a source route failure message.If a gateway needs to fragment adatagram but the "don't fragment" bit is set, the gateway sends a fragmentation needed message back to the source.

Congestion and Datagram Flow Control :
Gateways cannot reserve memory or communication resources in advance of receiving datagrams because IP is connectionless. The result is, gateways can overrun with traffic, a condition known as congestion.Congestion arises due to two reasons :

A high speed computer may be able to generate traffic faster than a network can transfer it .
If many computers sumultaneously need to send datagrams through a single gateway , the gateway can experience congestion, even though no single source causes the problem.

When datagrams arrive too quickly for a host or a gateway to process, it enqueues them in memory temporarily.If the traffic continues, the host or gateway eventually exhausts menory ans must discard additional datagrams that arrive. A machine uses ICMP source quench messages to releive congestion. A source quench message is a request for the source to reduce its current rate of datagram transmission.
There is no ICMP messages to reverse the effect of a source quench.
Source Quench :
Source quench messages have a field that contains a datagram prefix in addition to the usual ICMP TYPE,CODE,CHECKSUM fields.Congested gateways send one source quench message each time they discard a datagram; the datagram prefix identifies the datagram that was dropped.

Route Change Requests From Gateways :
Internet routing tables are initialized by hosts from a configuration file at system startup, and system administrators seldom make routing changes during normal operations.Gateways exchange routing information periodically to accomadate network changes and keep their routes up-to-date.The general rule is , Gateways are assumed to know correct routes; host begin wint minimal routing information and learn new routes from gateways. The GATEWAY INTERNET ADDRESS field contains the address of a gateway that the host is to use to reach the destination mentioned in the datagram header. The INTERNET HEADER field contains IP header plus the next 64 bits of the datagram that triggered the message.The CODE field of an ICMP redirect message further specifies how to interpret the destination address, based on values assigned as follows :

Code Value                           Meaning
0                                   REDIRECT DATAGRAMS FOR THE NET
1                                   REDIRECT DATAGRAMS FOR THE HOST
2                                   REDIRECT DATAGRAMS FOR THE TYPE OF SERVICE AND NET
3                                   REDIRECT DATAGRAMS FOR THE TYPE OF SERVICE AND HOST
Gateways only send ICMP redirect requests to hosts and not to other gateways.

Detecting Circular or Excessively Long Routes :
Internet gateways compute a next hop using local tables, errors in routing tables can produce a routing cycle for some destination. A routing cycle can consist of two gateways that each route a datagram for a particular destination to other, or it can consist of several gateways.To prevent datagrams from circling forever in a TCP/IP internet, each IP datagram contains a time-to-live counter , sometimes called a hop count. A gateway decrements the time-to-live counter whenever it processes the datagram and discards the datagram when the count reaches zero. Whenever a gateway discards a datagram because its hop count has reached zero or because a timeout occured while waiting for fragments of a datagram ,it sends an ICMP time exceeded message back to the datagram's source, A gateway sends this message whenever a datagram is discarded because the time-to-live field in the datagram header has reached zero or because its reassembly timer expired while waiting for fragments.
The code field explains the nature of the timeout :
Code Value                           Meaning
0                                    TIME-TO-LIVE COUNT EXCEEDED
1                                    FRAGMENT REASSEMBLY TIME EXCEEDED
Fragment reassembly refers to the task of collecting all the fragments from a datagram.

Reprting Other Problems :
When a gateway or host finds problems with a datagram not covered by previous ICMP error messages it sends a parameter problem message to the original source.To make the message unambigous, the sender uses the POINTER field in the message header to identify the octet in the datagram that caused the problem. Code 1 is used to report that a required option is missing; the POINTER field is not used for code 1.

Clock Synchronization nd Transmit the estimation :
ICMP messages are used to obtain the time from another machine.A requesting machine sends an ICMP timestamp request message to another machine, asking that the second machine return its current value of the time of day. The receiving machine returns a timestamp reply back to the machine making the request. TCP/IP protocol suite includes several protocols that can be used to synchronize clocks. This is one of the simplest techniques used by TCP/IP. The TYPE field idintifies the message as a request (13 ) or a reply ( 14 ); the IDENTIFIER and SEQUENCE NUMBER fields are used by the source to associate replies with requests.The ORIGINATE TIMESTAMP filed is filled in by the original sendet just before the packet is transmitted, the RECEIVE TIMESTAMP field is filled immediately upon receipt of a request, and the TRANSMIT TIMESTAMP field is filled immediately before the reply is transmitted. Hosts use the three timestamp fields to compute estimates of the delay time between them and to synchronize their clock.A host can compute the total time required for a request to travel to a destination, be transformed into a reply, and return. In practice, accurate estimation of round-trip delay can be difficult and substantially restirct the utility of ICMP timestanp messages.To obtain an accurate estimate to round trip delay one must take many measurements and average them.

Obtaining a Subnet Mask:
Subnet addressing is used by the hosts to extract some bits in the hostid portion of their IP address to identify a physical network.To participate in subnet addressing, hosts need to know which bits of the 32-bit internet address correspond to the physical network and which correspond to host identifiers. The information needed to interpret the address is represented in a 32-bit quatity called the subnet mask. To learn the subnet mask used for the local network, a machine can send an address mask request message to a gateway and receive an address mask reply. The TYPE field in an address mask message specifies whether the message is a request ( 17 ) or a reply ( 18 ). A reply contains the nework's subnet address mask in the ADDRESS MASK field.The IDENTIFIER and SEQUENCE NUMBER fields allow a machine to associate replies with requests.

Routing Algorithms

Non-Hierarchical Routing

In this type of routing, interconnected networks are viewed as a single network, where bridges, routers and gateways are just additional nodes.

Every node keeps information about every other node in the network
In case of adaptive routing, the routing calculations are done and updated for all the nodes.

The above two are also the disadvantages of non-hierarchical routing, since the table sizes and the routing calculations become too large as the networks get bigger. So this type of routing is feasible only for small networks.

Hierarchical Routing

This is essentially a 'Divide and Conquer' strategy. The network is divided into different regions and a router for a particular region knows only about its own domain and other routers. Thus, the network is viewed at two levels:

The Sub-network level, where each node in a region has information about its peers in the same region and about the region's interface with other regions. Different regions may have different 'local' routing algorithms. Each local algorithm handles the traffic between nodes of the same region and also directs the outgoing packets to the appropriate interface.
The Network Level, where each region is considered as a single node connected to its interface nodes. The routing algorithms at this level handle the routing of packets between two interface nodes, and is isolated from intra-regional transfer.

Networks can be organized in hierarchies of many levels; e.g. local networks of a city at one level, the cities of a country at a level above it, and finally the network of all nations. In Hierarchical routing, the interfaces need to store information about:

All nodes in its region which are at one level below it.
Its peer interfaces.
At least one interface at a level above it, for outgoing packages.

Advantages of Hierarchical Routing :

Smaller sizes of routing tables.
Substantially lesser calculations and updates of routing tables.

Disadvantage :

Once the hierarchy is imposed on the network, it is followed and possibility of direct paths is ignored. This may lead to sub optimal routing.

Source Routing

Source routing is similar in concept to virtual circuit routing. It is implemented as under:

Initially, a path between nodes wishing to communicate is found out, either by flooding or by any other suitable method.
This route is then specified in the header of each packet routed between these two nodes. A route may also be specified partially, or in terms of some intermediate hops.

Advantages:

Bridges do not need to lookup their routing tables since the path is already specified in the packet itself.
The throughput of the bridges is higher, and this may lead to better utilization of bandwidth, once a route is established.

Disadvantages:

Establishing the route at first needs an expensive search method like flooding.
To cope up with dynamic relocation of nodes in a network, frequent updates of tables are required, else all packets would be sent in wrong direction. This too is expensive.

Policy Based Routing

In this type of routing, certain restrictions are put on the type of packets accepted and sent. e.g.. The IIT- K router may decide to handle traffic pertaining to its departments only, and reject packets from other routes. This kind of routing is used for links with very low capacity or for security purposes.

Shortest Path Routing

Here, the central question dealt with is 'How to determine the optimal path for routing ?' Various algorithms are used to determine the optimal routes with respect to some predetermined criteria. A network is represented as a graph, with its terminals as nodes and the links as edges. A 'length' is associated with each edge, which represents the cost of using the link for transmission. Lower the cost, more suitable is the link. The cost is determined depending upon the criteria to be optimized. Some of the important ways of determining the cost are:

Minimum number of hops: If each link is given a unit cost, the shortest path is the one with minimum number of hops. Such a route is easily obtained by a breadth first search method. This is easy to implement but ignores load, link capacity etc.
Transmission and Propagation Delays: If the cost is fixed as a function of transmission and propagation delays, it will reflect the link capacities and the geographical distances. However these costs are essentially static and do not consider the varying load conditions.
Queuing Delays: If the cost of a link is determined through its queuing delays, it takes care of the varying load conditions, but not of the propagation delays.

Ideally, the cost parameter should consider all the above mentioned factors, and it should be updated periodically to reflect the changes in the loading conditions. However, if the routes are changed according to the load, the load changes again. This feedback effect between routing and load can lead to undesirable oscillations and sudden swings.

Routing Algorithms

As mentioned above, the shortest paths are calculated using suitable algorithms on the graph representations of the networks. Let the network be represented by graph G ( V, E ) and let the number of nodes be 'N'. For all the algorithms discussed below, the costs associated with the links are assumed to be positive. A node has zero cost w.r.t itself. Further, all the links are assumed to be symmetric, i.e. if d_i,j = cost of link from node i to node j, then d _i,j = d _j,i . The graph is assumed to be complete. If there exists no edge between two nodes, then a link of infinite cost is assumed. The algorithms given below find costs of the paths from all nodes to a particular node; the problem is equivalent to finding the cost of paths from a source to all destinations.

Bellman-Ford Algorithm

This algorithm iterates on the number of edges in a path to obtain the shortest path. Since the number of hops possible is limited (cycles are implicitly not allowed), the algorithm terminates giving the shortest path. Notation:
    d _i,j        =   Length of path between nodes i and j, indicating the cost of the link.
    h            =   Number of hops.
    D[ i,h]   =   Shortest path length from node i to node 1, with upto 'h' hops.
    D[ 1,h] =   0 for all h .

Algorithm :

    Initial condition :                 D[ i, 0] = infinity, for all i ( i != 1 )
    Iteration             :                 D[i, h+1] = min{ d_i,j + D[j,h] }     over all values of j .
    Termination        :                The algorithm terminates when
                                                D[i, h] = D [ i, h+1]     for all i .
Principle:
For zero hops, the minimum length path has length of infinity, for every node. For one hop the shortest-path length associated with a node is equal to the length of the edge between that node and node 1. Hereafter, we increment the number of hops allowed, (from h to h+1 ) and find out whether a shorter path exists through each of the other nodes. If it exists, say through node 'j', then its length must be the sum of the lengths between these two nodes (i.e. d_i,j ) and the shortest path between j and 1 obtainable in upto h paths. If such a path doesn't exist, then the path length remains the same. The algorithm is guaranteed to terminate, since there are utmost N nodes, and so N-1 paths. It has time complexity of O ( N³ ) .

Dijkstra's Algorithm

Notation:
D_i =     Length of shortest path from node 'i' to node 1.
d_i,j =     Length of path between nodes i and j .
Algorithm
Each node j is labeled with Dj, which is an estimate of cost of path from node j to node 1. Initially, let the estimates be infinity, indicating that nothing is known about the paths. We now iterate on the length of paths, each time revising our estimate to lower values, as we obtain them. Actually, we divide the nodes into two groups ; the first one, called set P contains the nodes whose shortest distances have been found, and the other Q containing all the remaining nodes. Initially P contains only the node 1. At each step, we select the node that has minimum cost path to node 1. This node is transferred to set P. At the first step, this corresponds to shifting the node closest to 1 in P. Its minimum cost to node 1 is now known. At the next step, select the next closest node from set Q and update the labels corresponding to each node using :
D_j   =   min [ D_j , D_i + d_j,i ] Finally, after N-1 iterations, the shortest paths for all nodes are known, and the algorithm terminates.

Principle
Let the closest node to 1 at some step be i. Then i is shifted to P. Now, for each node j , the closest path to 1 either passes through i or it doesn't. In the first case Dj remains the same. In the second case, the revised estimate of D_j is the sum D_i + d_i,j. So we take the minimum of these two cases and update D_jaccordingly. As each of the nodes get transferred to set P, the estimates get closer to the lowest possible value. When a node is transferred, its shortest path length is known. So finally all the nodes are in P and the D_j's represent the minimum costs. The algorithm is guaranteed to terminate in N-1 iterations and its complexity is O( N² ).

The Floyd Warshall Algorithm

This algorithm iterates on the set of nodes that can be used as intermediate nodes on paths. This set grows from a single node ( say node 1 ) at start to finally all the nodes of the graph. At each iteration, we find the shortest path using given set of nodes as intermediate nodes, so that finally all the shortest paths are obtained. Notation
D_i,j [n]     =     Length of shortest path between the nodes i and j using only the nodes 1,2,....n as intermediate nodes.
Initial Condition
Di,j[0]     =     d_i,j       for all nodes i,j .
Algorithm
Initially, n = 0.      At each iteration, add next node to n. i.e.   For n = 1,2, .....N-1 ,

Di,j[n + 1]    = min { D_i,j[n] ,   D_i,n+1[n] + D_n+1,j[n] }
Principle
Suppose the shortest path between i and j using nodes 1,2,...n is known. Now, if node n+1 is allowed to be an intermediate node, then the shortest path under new conditions either passes through node n+1 or it doesn't. If it does not pass through the node n+1, then D_i,j[n+1] is same as D_i,j[n] . Else, we find the cost of the new route, which is obtained from the sum, D_i,n+1[n] + D_n+1,j[n]. So we take the minimum of these two cases at each step. After adding all the nodes to the set of intermediate nodes, we obtain the shortest paths between all pairs of nodes together. The complexity of Floyd-Warshall algorithm is O ( N³ ).
It is observed that all the three algorithms mentioned above give comparable performance, depending upon the exact topology of the network.

Network Layer(cont) - Routing Algorithms Classification

Network Layer (Continued...)

The network layer is concerned with getting packets from the source all the way to the destnation. The packets may require to make many hops at the intermediate routers while reaching the destination. This is the lowest layer that deals with end to end transmission. In order to achieve its goals, the network later must know about the topology of the communication network. It must also take care to choose routes to avoid overloading of some of the communication lines while leaving others idle. The main functions performed by the network layer are as follows:

Routing
Congestion Control
Internetwokring

Routing

Routing is the process of forwarding of a packet in a network so that it reaches its intended destination. The main goals of routing are:

Correctness: The routing should be done properly and correctly so that the packets may reach their proper destination.
Simplicity: The routing should be done in a simple manner so that the overhead is as low as possible. With increasing complexity of the routing algorithms the overhead also increases.
Robustness: Once a major network becomes operative, it may be expected to run continuously for years without any failures. The algorithms designed for routing should be robust enough to handle hardware and software failures and should be able to cope with changes in the topology and traffic without requiring all jobs in all hosts to be aborted and the network rebooted every time some router goes down.
Stability: The routing algorithms should be stable under all possible circumstances.
Fairness: Every node connected to the network should get a fair chance of transmitting their packets. This is generally done on a first come first serve basis.
Optimality: The routing algorithms should be optimal in terms of throughput and minimizing mean packet delays. Here there is a trade-off and one has to choose depending on his suitability.

Classification of Routing Algorithms

The routing algorithms may be classified as follows:

Adaptive Routing Algorithm: These algorithms change their routing decisions to reflect changes in the topology and in traffic as well. These get their routing information from adjacent routers or from all routers. The optimization parameters are the distance, number of hops and estimated transit time. This can be further classified as follows:
1. Centralized: In this type some central node in the network gets entire information about the network topology, about the traffic and about other nodes. This then transmits this information to the respective routers. The advantage of this is that only one node is required to keep the information. The disadvantage is that if the central node goes down the entire network is down, i.e. single point of failure.
2. Isolated: In this method the node decides the routing without seeking information from other nodes. The sending node does not know about the status of a particular link. The disadvantage is that the packet may be send through a congested route resulting in a delay. Some examples of this type of algorithm for routing are:
  - Hot Potato: When a packet comes to a node, it tries to get rid of it as fast as it can, by putting it on the shortest output queue without regard to where that link leads. A variation of this algorithm is to combine static routing with the hot potato algorithm. When a packet arrives, the routing algorithm takes into account both the static weights of the links and the queue lengths.
  - Backward Learning: In this method the routing tables at each node gets modified by information from the incoming packets. One way to implement backward learning is to include the identity of the source node in each packet, together with a hop counter that is incremented on each hop. When a node receives a packet in a particular line, it notes down the number of hops it has taken to reach it from the source node. If the previous value of hop count stored in the node is better than the current one then nothing is done but if the current value is better then the value is updated for future use. The problem with this is that when the best route goes down then it cannot recall the second best route to a particular node. Hence all the nodes have to forget the stored informations periodically and start all over again.
3. Distributed: In this the node receives information from its neighbouring nodes and then takes the decision about which way to send the packet. The disadvantage is that if in between the the interval it receives information and sends the paket something changes then the packet may be delayed.
Non-Adaptive Routing Algorithm: These algorithms do not base their routing decisions on measurements and estimates of the current traffic and topology. Instead the route to be taken in going from one node to the other is computed in advance, off-line, and downloaded to the routers when the network is booted. This is also known as static routing. This can be further classified as:
1. Flooding: Flooding adapts the technique in which every incoming packet is sent on every outgoing line except the one on which it arrived. One problem with this method is that packets may go in a loop. As a result of this a node may receive several copies of a particular packet which is undesirable. Some techniques adapted to overcome these problems are as follows:
  - Sequence Numbers: Every packet is given a sequence number. When a node receives the packet it sees its source address and sequence number. If the node finds that it has sent the same packet earlier then it will not transmit the packet and will just discard it.
  - Hop Count: Every packet has a hop count associated with it. This is decremented(or incremented) by one by each node which sees it. When the hop count becomes zero(or a maximum possible value) the packet is dropped.
  - Spanning Tree: The packet is sent only on those links that lead to the destination by constructing a spanning tree routed at the source. This avoids loops in transmission but is possible only when all the intermediate nodes have knowledge of the network topology.
  Flooding is not practical for general kinds of applications. But in cases where high degree of robustness is desired such as in military applications, flooding is of great help.
2. Random Walk: In this method a packet is sent by the node to one of its neighbours randomly. This algorithm is highly robust. When the network is highly interconnected, this algorithm has the property of making excellent use of alternative routes. It is usually implemented by sending the packet onto the least queued link.

Delta Routing

Delta routing is a hybrid of the centralized and isolated routing algorithms. Here each node computes the cost of each line (i.e some functions of the delay, queue length, utilization, bandwidth etc) and periodically sends a packet to the central node giving it these values which then computes the k best paths from node i to node j. Let Cij1 be the cost of the best i-j path, Cij2 the cost of the next best path and so on.If Cijn - Cij1 < delta, (Cijn - cost of n'th best i-j path, delta is some constant) then path n is regarded equivalent to the best i-j path since their cost differ by so little. When delta -> 0 this algorithm becomes centralized routing and when delta -> infinity all the paths become equivalent.

Multipath Routing

In the above algorithms it has been assumed that there is a single best path between any pair of nodes and that all traffic between them should use it. In many networks however there are several paths between pairs of nodes that are almost equally good. Sometimes in order to improve the performance multiple paths between single pair of nodes are used. This technique is called multipath routing or bifurcated routing. In this each node maintains a table with one row for each possible destination node. A row gives the best, second best, third best, etc outgoing line for that destination, together with a relative weight. Before forwarding a packet, the node generates a random number and then chooses among the alternatives, using the weights as probabilities. The tables are worked out manually and loaded into the nodes before the network is brought up and not changed thereafter.

Hierarchical Routing

In this method of routing the nodes are divided into regions based on hierarchy. A particular node can communicate with nodes at the same hierarchial level or the nodes at a lower level and directly under it. Here, the path from any source to a destination is fixed and is exactly one if the heirarchy is a tree.

Network Layer

What is Network Layer?

The network layer is concerned with getting packets from the source all the way to the destination. The packets may require to make many hops at the intermediate routers while reaching the destination. This is the lowest layer that deals with end to end transmission. In order to achieve its goals, the network layer must know about the topology of the communication network. It must also take care to choose routes to avoid overloading of some of the communication lines while leaving others idle. The network layer-transport layer interface frequently is the interface between the carrier and the customer, that is the boundary of the subnet. The functions of this layer include :

Routing - The process of transferring packets received from the Data Link Layer of the source network to the Data Link Layer of the correct destination network is called routing. Involves decision making at each intermediate node on where to send the packet next so that it eventually reaches its destination. The node which makes this choice is called a router. For routing we require some mode of addressing which is recognized by the Network Layer. This addressing is different from the MAC layer addressing.
Inter-networking - The network layer is the same across all physical networks (such as Token-Ring and Ethernet). Thus, if two physically different networks have to communicate, the packets that arrive at the Data Link Layer of the node which connects these two physically different networks, would be stripped of their headers and passed to the Network Layer. The network layer would then pass this data to the Data Link Layer of the other physical network..
Congestion Control - If the incoming rate of the packets arriving at any router is more than the outgoing rate, then congestion is said to occur. Congestion may be caused by many factors. If suddenly, packets begin arriving on many input lines and all need the same output line, then a queue will build up. If there is insufficient memory to hold all of them, packets will be lost. But even if routers have an infinite amount of memory, congestion gets worse, because by the time packets reach to the front of the queue, they have already timed out (repeatedly), and duplicates have been sent. All these packets are dutifully forwarded to the next router, increasing the load all the way to the destination. Another reason for congestion are slow processors. If the router's CPUs are slow at performing the bookkeeping tasks required of them, queues can build up, even though there is excess line capacity. Similarly, low-bandwidth lines can also cause congestion.

We will now look at these function one by one.

Addressing Scheme
IP addresses are of 4 bytes and consist of :
i) The network address, followed by
ii) The host address
The first part identifies a network on which the host resides and the second part identifies the particular host on the given network. Some nodes which have more than one interface to a network must be assigned separate internet addresses for each interface. This multi-layer addressing makes it easier to find and deliver data to the destination. A fixed size for each of these would lead to wastage or under-usage that is either there will be too many network addresses and few hosts in each (which causes problems for routers who route based on the network address) or there will be very few network addresses and lots of hosts (which will be a waste for small network requirements). Thus, we do away with any notion of fixed sizes for the network and host addresses.
We classify networks as follows:

Large Networks : 8-bit network address and 24-bit host address. There are approximately 16 million hosts per network and a maximum of 126 ( 2^7 - 2 ) Class A networks can be defined. The calculation requires that 2 be subtracted because 0.0.0.0 is reserved for use as the default route and 127.0.0.0 be reserved for the loop back function. Moreover each Class A network can support a maximum of 16,777,214 (2^24 - 2) hosts per network. The host calculation requires that 2 be subtracted because all 0's are reserved to identify the network itself and all 1s are reserved for broadcast addresses. The reserved numbers may not be assigned to individual hosts.
Medium Networks : 16-bit network address and 16-bit host address. There are approximately 65000 hosts per network and a maximum of 16,384 (2^14) Class B networks can be defined with up to (2^16-2) hosts per network.
Small networks : 24-bit network address and 8-bit host address. There are approximately 250 hosts per network.

You might think that Large and Medium networks are sort of a waste as few corporations/organizations are large enough to have 65000 different hosts. (By the way, there are very few corporations in the world with even close to 65000 employees, and even in these corporations it is highly unlikely that each employee has his/her own computer connected to the network.) Well, if you think so, you're right. This decision seems to have been a mistak

Address Classes

The IP specifications divide addresses into the following classes :

Class A - For large networks

0 7 bits of the network address 24 bits of host address
Class B - For medium networks

1 0 14 bits of the network address 16 bits of host address
Class C - For small networks

1 1 0 21 bits of the network address 8 bits of host address
Class D - For multi-cast messages ( multi-cast to a "group" of networks )

1 1 1 0 28 bits for some sort of group address
Class E - Currently unused, reserved for potential uses in the future

1 1 1 1 28 bits

Internet Protocol

Special Addresses : There are some special IP addresses :

Broadcast Addresses They are of two types :
(i) Limited Broadcast : It consists of all 1's, i.e., the address is 255.255.255.255 . It is used only on the LAN, and not for any external network.
(ii) Directed Broadcast : It consists of the network number + all other bits as1's. It reaches the router corresponding to the network number, and from there it broadcasts to all the nodes in the network. This method is a major security problem, and is not used anymore. So now if we find that all the bits are 1 in the host no. field, then the packet is simply dropped. Therefore, now we can only do broadcast in our own network using Limited Broadcast.
Network ID = 0
It means we are referring to this network and for local broadcast we make the host ID zero.
Host ID = 0
This is used to refer to the entire network in the routing table.
Loop-back Address
Here we have addresses of the type 127.x.y.z It goes down way upto the IP layer and comes back to the application layer on the same host. This is used to test network applications before they are used commercially.

Subnetting
Sub netting means organizing hierarchies within the network by dividing the host ID as per our network. For example consider the network ID : 150.29.x.y
We could organize the remaining 16 bits in any way, like :
4 bits - department
4 bits - LAN
8 bits - host
This gives some structure to the host IDs. This division is not visible to the outside world. They still see just the network number, and host number (as a whole). The network will have an internal routing table which stores information about which router to send an address to. Now consider the case where we have : 8 bits - subnet number, and 8 bits - host number. Each router on the network must know about all subnet numbers. This is called the subnet mask. We put the network number and subnet number bits as 1 and the host bits as 0. Therefore, in this example the subnet mask becomes : 255.255.255.0 . The hosts also need to know the subnet mask when they send a packet. To find if two addresses are on the same subnet, we can AND source address with subnet mask, and destination address with with subnet mask, and see if the two results are the same. The basic reason for sub netting was avoiding broadcast. But if at the lower level, our switches are smart enough to send directed messages, then we do not need sub netting. However, sub netting has some security related advantages.

Supernetting
This is moving towards class-less addressing. We could say that the network number is 21 bits ( for 8 class C networks ) or say that it is 24 bits and 7 numbers following that. For example : a.b.c.d / 21 This means only look at the first 21 bits as the network address.

Addressing on IITK Network
If we do not have connection with the outside world directly then we could have Private IP addresses ( 172.31 ) which are not to be publicised and routed to the outside world. Switches will make sure that they do not broadcast packets with such addressed to the outside world. The basic reason for implementing subnetting was to avoid broadcast. So in our case we can have some subnets for security and other reasons although if the switches could do the routing properly, then we do not need subnets. In the IITK network we have three subnets -CC, CSE building are two subnets and the rest of the campus is one subset Packet Structure

Version Number (4 bits)	Header Length (4 bits)	Type of Service (8 bits)	Total Length (16 bits)
ID (16 bits)			Flags (3bits)	Flag Offset (13 bits)
Time To Live (8 bits)		Protocol (8 bits)	Header Checksum (16 bits)
Source (32 bits)
Destination (32 bits)
Options

Version Number : The current version is Version 4 (0100).

Header Length : We could have multiple sized headers so we need this field. Header will always be a multiple of 4bytes and so we can have a maximum length of the field as 15, so the maximum size of the header is 60 bytes ( 20 bytes are mandatory ).
Type Of Service (ToS) : This helps the router in taking the right routing decisions. The structure is :
First three bits : They specify the precedences i.e. the priority of the packets.
Next three bits :
- T bit - T stands for throughput. This tells us that this particular operation is throughput sensitive.
- R bit - R stands for reliability. This tells us that we should route this packet through a more reliable network.
Last two bits: The last two bits are never used. Unfortunately, no router in this world looks at these bits and so no application sets them nowadays. The second word is meant for handling fragmentations. If a link cannot transmit large packets, then we fragment the packet and put sufficient information in the header for recollection at the destination.
ID Field : The source and ID field together will represent the fragments of a unique packet. So each fragment will have a different ID.
Offset : It is a 13 bit field that represents where in the packet, the current fragment starts. Each bit represents 8 bytes of the packet. So the packet size can be at most 64 kB. Every fragment except the last one must have its size in bytes as a multiple of 8 in order to ensure compliance with this structure. The reason why the position of a fragment is given as an offset value instead of simply numbering each packet is because refragmentation may occur somewhere on the path to the other node. Fragmentation, though supported by IPv4 is not encouraged. This is because if even one fragment is lost the entire packet needs to be discarded. A quantity M.T.U (Maximum Transmission Unit) is defined for each link in the route. It is the size of the largest packet that can be handled by the link. The Path-M.T.U is then defined as the size of the largest packet that can be handled by the path. It is the smallest of all the MTUs along the path. Given information about the path MTU we can send packets with sizes smaller than the path MTU and thus prevent fragmentation. This will not completely prevent it because routing tables may change leading to a change in the path.
Flags :It has three bits -
- M bit : If M is one, then there are more fragments on the way and if M is 0, then it is the last fragment
- DF bit : If this bit is sent to 1, then we should not fragment such a packet.
- Reserved bit : This bit is not used.
Reassembly can be done only at the destination and not at any intermediate node. This is because we are considering Datagram Service and so it is not guaranteed that all the fragments of the packet will be sent thorough the node at which we wish to do reassembly.
Total Length : It includes the IP header and everything that comes after it.
Time To Live (TTL) : Using this field, we can set the time within which the packet should be delivered or else destroyed. It is strictly treated as the number of hops. The packet should reach the destination in this number of hops. Every router decreases the value as the packet goes through it and if this value becomes zero at a particular router, it can be destroyed.
Protocol : This specifies the module to which we should hand over the packet ( UDP or TCP ). It is the next encapsulated protocol.
Value                                    Protocol
0                                    Pv6 Hop-by-Hop Option.
1                                    ICMP, Internet Control Message Protocol.
2                                    IGMP, Internet Group Management Protocol. RGMP, Router-port Group Management Protocol.
3                                    GGP, Gateway to Gateway Protocol.
4                                    IP in IP encapsulation.
5                                    ST, Internet Stream Protocol.
6                                    TCP, Transmission Control Protocol.
7                                    UCL, CBT.
8                                    EGP, Exterior Gateway Protocol.
9                                    IGRP.
10                                    BBN RCC Monitoring.
11                                    NVP, Network Voice Protocol.
12                                    PUP.
13                                    ARGUS.
14                                    EMCON, Emission Control Protocol.
15                                    XNET, Cross Net Debugger.
16                                    Chaos.
17                                    UDP, User Datagram Protocol.
18                                    TMux, Transport Multiplexing Protocol.
19                                    DCN Measurement Subsystems.

-
-
255
Header Checksum : This is the usual checksum field used to detect errors. Since the TTL field is changing at every router so the header checksum ( upto the options field ) is checked and recalculated at every router.
Source : It is the IP address of the source node
Destination : It is the IP address of the destination node.
IP Options : The options field was created in order to allow features to be added into IP as time passes and requirements change. Currently 5 options are specified although not all routers support them. They are:
- Securtiy: It tells us how secret the information is. In theory a military router might use this field to specify not to route through certain routers. In practice no routers support this field.
- Source Routing: It is used when we want the source to dictate how the packet traverses the network. It is of 2 types
  -> Loose Source Record Routing (LSRR): It requires that the packet traverse a list of specified routers, in the order specified but the packet may pass though some other routers as well.
  -> Strict Source Record Routing (SSRR): It requires that the packet traverse only the set of specified routers and nothing else. If it is not possible, the packet is dropped with an error message sent to the host.
  The above is the format for SSRR. For LSRR the code is 131.
- Record Routing : In this the intermediate routers put there IP addresses in the header, so that the destination knows the entire path of the packet. Space for storing the IP address is specified by the source itself. The pointer field points to the position where the next IP address has to be written. Length field gives the number of bytes reserved by the source for writing the IP addresses. If the space provided for storing the IP addresses of the routers visited, falls short while storing these addresses, then the subsequent routers do not write their IP addresses.
- Time Stamp Routing : It is similar to record route option except that nodes also add their timestamps to the packet. The new fields in this option are
  -> Flags: It can have the following values
  - 0- Enter only timestamp.
  - 1- The nodes should enter Timestamp as well as their IP.
  - 3 - The source specifies the IPs that should enter their timestamp. A special point of interest is that only if the IP is the same as that at the pointer then the time is entered. Thus if the source specifies IP1 and IP2 but IP2 is first in the path then the field IP2 is left empty, even after having reached IP2 but before reaching IP1.
  -> Overflow: It stores the number of nodes that were unable to add their timestamps to the packet. The maximum value is 15.
- Format of the type/code field
  
  Copy Bit Type of option Option Number.
  - Copy bit: It says whether the option is to be copied to every fragment or not. a value of 1 stands for copying and 0 stands for not copying.
  - Type: It is a 2 bit field. Currently specified values are 0 and 2. 0 means the option is a control option while 2 means the option is for measurement
  - Option Number: It is a 5 bit field which specifies the option number.
  For all options a length field is put in order that a router not familiar with the option will know how many bytes to skip. Thus every option is of the form
- TLV: Type/Length/Value. This format is followed in not only in IP but in nearly all major protocols.