Scrum – Agile Method for Project Management

25 01 2009

Scrum is an agile method for Project Management mainly used on Software Development Projects. I heard about this method since last year but I only faced a great success case last week at Campus Party 2009. Yahoo! showed the way they use Scrum in their development project and the improvement enabled by this method.

Basically, the purpose of this method is to get everyone in the team involved with the project and aware of its status, so we can highlight the caractheristic of collaboration present in this agile practice.

I found this interisting video that explains the way Scrum works:





Science of Service – A new challenge

24 01 2009

“Ciências de Serviços – Um novo desafio”

I am currently studying the new business models mainly emerged from the technologies of  information. Of course, Service Delivery is one highlighted practice.

We have been seen some changes in business models for many companies around the world. Those companies are moving from the Product Core Model to Service Delivery Model. However, this is not a easy task for a traditional “Product Company”. That is the reason we can find some “best practice guides” and methodologies released to improve the quality of the service delivered, such as ITIL, eTOM and even SIx Sigma.

However, during the itSMF Conferece 2008 in Brazil I heard from the fist time the term “Science of Service”. A researcher from IBM show us his research in the area.

He focus in the definition of what is really a service and how is the “best” way to deliver this service for individual customers. That is true that there still have been many companies delivering service as they used to deliver product in the past.

Science of Service is a concept, or better, it is a science that is getting so relevant in the academic researches and corporate environment. According to itSMF, this should be a great opportunity for the people interesting to do research about “Services” because the trend shows a new career model coming in this area.





White House is Copyleft

24 01 2009

The content available on the White House Site is now Copyleft!!!

The content is available under Creative Commons licence, of course, unless it has another license defined.

That is one more important impact of the called President 2.0. A president who knows how to explore the power of the digital networks. Nobody can disagree, he showed this during the whole election period!

whitehouse





ITIL V3 and SOA

29 10 2008

I attended a seminar about ITIL at London, UK two weeks ago. We discussed a little bit about the new version ITIL v3 and its influence specially in Storage Management and SOA.

I noticed it will probably be very useful for Service Oriented Architecture (SOA) once ITIL v3 is now focusing in service management across the organization and that is exactly what the SOA was developed to do.

About 3 months ago I went to a SOA event at Oracle São Paulo and I found there people working on how was the best way to implement SOA based on ITIL v2. You can implement SOA using ITIL v2 framework, however that was not a framework developed to specially treat services. I believe ITIL v3 launch will improve the SOA Management goading the organization to become more “Service Oriented”.

Let’s see how ITIL v3 will be handled on SOA and Storage Management Projects in the nearly future. There is a good article regarding ITIL v3 and SOA @ http://searchsoa.techtarget.com/news/article/0,,sid26_gci1277548,00.html.





CERN – Datacenter Cooling Technique

11 10 2008

I told in my previous post that I visited CERN Computer Center during the LHC Grid research event at Geneva. One of the most interesting fact I noticed there was the technique used for cooling in the Data Center. They have a lot of physical hardwares that includes, robots storage, disk array and many processors to build-up the TIER-0 of Worldwide LHC Computing Grid.

After many researches about cooling, they realized that the best option to optimize the cooling process and minimize the power consumption was using less than a half of a Rack capacity and also putting the machines as near the floor as possible. To improve the cooling quality they put two racks, one in front of the other and they lock the space between those Racks using glasses. In this case, the cooling was coming from air-conditioner entry located in the floor.

I found it very interesting because usually Data Center Analysts put as many server as the rack can support in other to optimize resources, however that is not the best option for cooling. That’s is a very useful information for cooling problems in a Data Center. I added the photo below taken from the CERN Computer Center:





Veritas Storage – How to create DG, Volumes and Filesystem

13 08 2008

Title in portuguese: “Como Criar Volumes no Veritas Storage Foundation”.

In this article I will show some RAID level types as well as how to create Disk Groups, Volumes and Filesystems on Veritas Storage Foundation.

RAID is an acronym of Redundant Array of Independent Disks. RAID is an storage propose to manage an array of disks.

RAID configuration types are classified according to the RAID level, which is defined numbers of Disks, the way data is spanned across the disks and redundancy method.

RAID LEVELS

- DISK SPANNING (RAID 0)

It is the technique to combine in one logical point space from many physical disks. There are two method for Disk Spanning:

– Concatenation: It is a linear data allocation across two or more disks.
– Example:
The volume is composed by 2 disks (A and B) using concatenation layout. Then you will write data     sequentially across the disks. System will start to write data in disk A and it will go to disk B as soon as disk A is full.
– Stripping: It is an alternating equal-size data allocation across multiple disks.
- Example:
The volume is composed by 2 disks (A and B) using stripping layout. System spreads data across multiple disks.

- DATA REDUNDANCY (RAID 1)

RAID 1 type is focused to protect data against disk failure.

- Mirroring: It is the method to have two or more copies of data in different physical disks.

- RESILIENCE

Resilient volume combines two layout to build a volume.

- RAID 0 + 1: It is a combination of Level 0  (Stripping) and Level 1 (Mirroring)

- RAID 1 + 0: It is a combination of Level 1 (Mirroring) and Level 0 (Stripping).

*There are more RAID Levels that were not mentioned here.

I am going to show how to create a mirroring volume and a stripping volume on Veritas Storage Foundation.(I am going to use Veritas SF 5.0 running on Solaris 10).

The first step is to check quantity of disks you have available on the server. A simple way to check this on solaris is using format utility:

bash-3.00# format

Searching for disks…done

AVAILABLE DISK SELECTIONS:

0. c1t0d0 <DEFAULT cyl 4092 alt 2 hd 128 sec 32>

/pci@0,0/pci15ad,1976@10/sd@0,0

1. c1t1d0 <DEFAULT cyl 7 alt 2 hd 64 sec 32>

/pci@0,0/pci15ad,1976@10/sd@1,0

2. c1t2d0 <DEFAULT cyl 7 alt 2 hd 64 sec 32>

/pci@0,0/pci15ad,1976@10/sd@2,0

3. c1t3d0 <DEFAULT cyl 2 alt 2 hd 64 sec 32>

/pci@0,0/pci15ad,1976@10/sd@3,0

Also, you can check disks available to Veritas Storage Foundation using vxdisk command:

bash-3.00# vxdisk -o alldgs list

DEVICE TYPE DISK GROUP STATUS

c1t0d0s2 auto:none – – online invalid

c1t1d0s2 auto:none – – online invalid

c1t2d0s2 auto:none – – online invalid

c1t3d0s2 auto:none – – online invalid

You can see above that there are 4 disks on the server that are available to Veritas but they have not yet been initialized by Veritas (invalid status). To use a disk on Veritas SF you need to initialize this using Veritas utilities.

NOTE: If you are going to use a disk on Veritas, pay attention that you should give this whole disk to Veritas. Disk will be formatted and you will lose all data in the disk when you are allocating a disk to Veritas Storage.

In this example the only disk that is in use for O.S Solaris is the first one. (c1t0d0s2).

We can use those 3 others disks to add on Veritas Storage.

Caution: If for a mistake we add the first disk (c1t0d0s2) to Veritas Storage, it will format the disk and erase Solaris info. We need to pay attention to get the right disks.

Let’s start allocating (initializing) those 3 disks to solaris:

bash-3.00# vxdisksetup -i c1t1d0

bash-3.00# vxdisksetup -i c1t2d0

bash-3.00# vxdisksetup -i c1t3d0

NOTE: online status means that the disk was initialized and can be used on Veritas Storage Foundation.

We have those 3 disks initialized on Veritas, then the next step is to create a Disk Group.

Disk Group
Disk Group is a collection of disks. Disk Group is very useful for management and isolation purpose.

Lets create a DG using only the fist disk initialized on Veritas (c1t1d0). We are using DG1 for the name of Disk Group.

bash-3.00# vxdg init DG1 c1t1d0

Check if  DG1 was created successfully:

bash-3.00# vxdg list

NAME STATE ID

DG1 enabled,cds 1218633322.13.vrt2

Also, check if the disk is properly assigned to DG1:

bash-3.00# vxdisk -o alldgs list

DEVICE TYPE DISK GROUP STATUS

c1t0d0s2 auto:none – – online invalid

c1t1d0s2 auto:cdsdisk c1t1d0 DG1 online

c1t2d0s2 auto:cdsdisk – – online

c1t3d0s2 auto:cdsdisk – – online

Let’s add more 2 disks to DG1:

bash-3.00# vxdg -g DG1 adddisk c1t2d0s2 c1t3d0s2

Check if the disks are properly assigned to DG1:

bash-3.00# vxdisk -o alldgs list

DEVICE TYPE DISK GROUP STATUS

c1t0d0s2 auto:none – – online invalid

c1t1d0s2 auto:cdsdisk c1t1d0 DG1 online

c1t2d0s2 auto:cdsdisk c1t2d0s2 DG1 online

c1t3d0s2 auto:cdsdisk c1t3d0s2 DG1 online

At this point we have added 3 disks into Disk Group DG1.

Next step we will create 2 different volumes in the DG1.

Volumes

A volume is a virtual storage that is used as an physical disk. Volume can be composed by many disks and have many layouts.

In this example, we are going to create two Volumes:

Volume VolS – Stripping layout using c1t1d0 and c1t2d0 disks (RAID 0).

Volume VolM – Mirroring layout using c1t2d0 and c1t3d0 (RAID 1).

To create a Stripping Volume VolS (Size=10m):

bash-3.00# vxassist -g DG1 make VolS 10m layout=stripe c1t1d0 c1t2d0s2

To check if volume VolS was created successfully:

bash-3.00# vxprint -g DG1

TY NAME ASSOC KSTATE LENGTH PLOFFS STATE TUTIL0 PUTIL0

dg DG1 DG1 – – – – – -

dm c1t1d0 c1t1d0s2 – 159488 – – – -

dm c1t2d0s2 c1t2d0s2 – 159488 – – – -

dm c1t3d0s2 c1t3d0s2 – 159488 – – – -

v VolS fsgen ENABLED 20480 – ACTIVE – -

pl VolS-01 VolS ENABLED 20480 – ACTIVE – -

sd c1t1d0-01 VolS-01 ENABLED 10240 0 – – -

sd c1t2d0s2-01 VolS-01 ENABLED 10240 0 – – -

To create a Mirroring Volume VolM (Size=10m):

bash-3.00# vxassist -g DG1 make VolM 10m layout=mirror c1t2d0s2 c1t3d0s2

To check if Volume VolM was created successfully:

bash-3.00# vxprint -g DG1

TY NAME ASSOC KSTATE LENGTH PLOFFS STATE TUTIL0 PUTIL0

dg DG1 DG1 – – – – – -

dm c1t1d0 c1t1d0s2 – 159488 – – – -

dm c1t2d0s2 c1t2d0s2 – 159488 – – – -

dm c1t3d0s2 c1t3d0s2 – 159488 – – – -

v VolM fsgen ENABLED 20480 – ACTIVE – -

pl VolM-01 VolM ENABLED 20480 – ACTIVE – -

sd c1t3d0s2-01 VolM-01 ENABLED 20480 0 – – -

pl VolM-02 VolM ENABLED 20480 – ACTIVE – -

sd c1t2d0s2-02 VolM-02 ENABLED 20480 0 – – -

v VolS fsgen ENABLED 20480 – ACTIVE – -

pl VolS-01 VolS ENABLED 20480 – ACTIVE – -

sd c1t1d0-01 VolS-01 ENABLED 10240 0 – – -

sd c1t2d0s2-01 VolS-01 ENABLED 10240 0 – – -

Note: You can see above that both Volumes were created successfully. Also, you can note the difference between stripping and mirroring volume layouts.

VolM is using two different Plex in differente disks. This means that if you lose one disk (Plex) you still have the data in the other disk (other Plex). It is the main configuration of Mirroring Volumes.

VolS is using only one Plex divided in 2 disks. This means that the data will be split in those 2 disks. If you lose one disk you would lose the whole Plex, therefore you would lose the data. This is the main configuration of Stripping Volumes. It does not provide data protection but it is very useful for performance for purpose.

Also, you can add those 2 layouts in only one layout that provide data protection and better performance. It is the case of RAID 0 + 1 or RAID 1 + 0.

In the next step we will create 2 different Filesystem using those 2 Volumes.

Filesystem

In this example we will create two filesystem:

- Filesystem fsS will use VolS. It will be mounted at /stripe mount point.

- Filesystem fsM will use VolM. It will be mounted at /mirror mount point.

To create a VxFS filesystem:

bash-3.00# mkfs -F vxfs /dev/vx/rdsk/DG1/VolS

version 7 layout

20480 sectors, 10240 blocks of size 1024, log size 1024 blocks

largefiles supported

bash-3.00# mkfs -F vxfs /dev/vx/rdsk/DG1/VolM

version 7 layout

20480 sectors, 10240 blocks of size 1024, log size 1024 blocks

largefiles supported

To mount a VxFS filesystem:

bash-3.00# mount -F vxfs /dev/vx/dsk/DG1/VolS /stripe/

bash-3.00# mount -F vxfs /dev/vx/dsk/DG1/VolM /mirror/

Now there are 2 filesystems configured and you can use it at Solaris Mount Point level.

Any data written in /stripe directory will be written in the stripping VolS volume.

Any data written in /mirror directory will be written in the mirroring VolM volume.

I hope this help you to understand how to:

  • Initialize a disk on Veritas Storage Foundation.

  • Create a Disk Group.

  • Add disks to a Disk Group.

  • Create a Stripping Volume.

  • Create a Mirroring Volume.

  • Create a VxFS Filesystem.

  • Mount a VxFS Filesystem.

Feel free to comment this article to improve its quality.

Reference: Veritas Storage Foundation 5.0 for UNIX: Fundamental – Training Book





CLOSE_WAIT Connections – Tuning Solaris

8 08 2008

This article aims to describe a way to tune TCP parameters on Solaris to get a better performance running a WebServer. I will show how the TCP connection is initiated and termintated on a high level and I will focus on how to tune TCP parameters on Solaris.

I have experienced some problems with a bunch of CLOSE_WAIT connections on Solaris affecting  the application causing delays on response time and refusing new connections.

I will start this article explaning how a connection is initiated (TCP Three-Way Handshake sequence).

Let’s use the scenario of two servers (A and B) where the server A is going to initiate the connection.

1-) The first segment SYN is sent by the node A to node B. This is a request to server synchronizes the sequence numbers.

Node A —— SYN —–> Node B

2-) Node B sends an acknowledge (ACK) about the request of the Node A. At the same time,  Node B is also sending its request (SYN) to the Node A for synchronization of its sequence numbers.

Node A <—– SYN/ACK —- Node B

3-) Node A then send an acknowledge to Node B.

Node A —— ACK ——-> Node B

At this time the connection should be established.

Let’s show how the connections are terminated (here is the CLOSE_WAIT issue):

In the termination process, it is important to remember that each application process on each side of connection should close independently its half of connection. This terminating process consists of:

Let’s supose Node A will close its half of connection first.

1-)NodeA transmits a FIN packet to Node B.

(Established)         (Established)
Node A —- FIN —-> Node B
(FIN_WAIT1)

2-)NodeB transmits an ACK packet to Node A:

NodeA <—- ACK —- Node B
(FIN_WAIT2)          (CLOSE_WAIT)

//Here is the CLOSE_WAIT Issue. App on Node B should invoke close() method to close the connection on its end.
If the App does not invoke close() method, then it will keep the connection stuck on CLOSE_WAIT for the time specified on TCP Stack.

If you have much traffic in the server and a lot of connections on CLOSE_WAIT status it will cause some issues such as:

- Refusing new connections request.

- Slow on response time.

- High Processing Resource Utilization.

Now, I will describe some tips that helped me to solve some problems in Webservers.
It basically consists of changing some TCP parameters in Solaris that will reduce the time that a connection will be on CLOSE_WAIT, releasing this kind of connection quickly.

- TCP_TIME_INTERVAL parameter:
Description: Notifies TCP/IP on how long to keep the
connection control blocks closed. After the applications complete the
TCP/IP connection, the control blocks are kept for the specified time.
When high connection rates occur, a large backlog of the TCP/IP
connections accumulates and can slow server performance. The server
can stall during certain peak periods. If the server stalls, the
netstat command shows that many of the sockets that are opened to the
HTTP server are in the CLOSE_WAIT or FIN_WAIT_2 state. Visible delays
can occur for up to four minutes, during which time the server does
not send any responses, but CPU utilization stays high, with all of
the activities in system processes.

1-) Verify the current value of this:
ndd -get /dev/tcp tcp_time_wait_interval
2-) Set the new value
ndd -set /dev/tcp tcp_time_wait_interval 60000

(Defaul value is 240000 milliseconds = 4 minutes. Recommended is 60000 milliseconds).

- TCP_FIN_WAIT_2_FLUSH_INTERVAL
Specifies the timer interval prohibiting a connection in the FIN_WAIT_2 state to remain in that state.

1-)Verify the current value of this:
ndd -get /dev/tcp tcp_fin_wait_2_flush_interval
2-) Set the new value:
ndd -set /dev/tcp tcp_fin_wait_2_flush_interval 67500

(Default Value is 675000 milliseconds. Recommended is 67500 milliseconds).

- TCP_KEEPALIVE_INTERVAL
keepAlive packet ensures that a connection stays in an active and established state.

1-)Verify the current value of this:
ndd -get /dev/tcp tcp_keepalive_interval
2-) Set the new value:
ndd -set /dev/tcp tcp_keepalive_interval 300000

(Default Value is 7200000 milliseconds. Recommended is 15000 milliseconds).

- Connection backlog
It means that a high number of incoming connections results in failure.

1-)Verify the current value of this:
ndd -get /dev/tcp tcp_conn_req_max_q
2-) Set the new value:
ndd -set /dev/tcp tcp_conn_req_max_q 8000

(Default value is 128. Recommended is 8000)

This configuration change will help to improve the system performance, and better than this, it will help to reduce major impacts.
I have experienced situations that the application was not responding due to a lot of connections on CLOSE_WAIT status.
In my case, we identified a bug in the application and we use this tunings as an work-around.
It is very useful and it can help you when you are experiencing problems due to many connections on this state.

Reference: This article was inspirate on IBM “Tuning Solaris systems” @ WebSphere Application Server
Information Center.

Additional Info:

Local Server closes first:
ESTABLISHED -> FIN_WAIT_1-> FIN_WAIT_2 -> TIME_WAIT -> CLOSED.

Remote Server closes first:
ESTABLISHED -> CLOSE_WAIT -> LAST_ACK -> CLOSED.

Local and Remote Server close at the same time:
ESTABLISHED -> FIN_WAIT_1-> CLOSING ->TIME_WAIT -> CLOSED.

Note: This is an open article and if you have suggestions or comments on how to improve its quality. Please, let me know.