Wednesday, 3 July 2013

VLSI Project Titles, VLSI Project Abstracts, VLSI IEEE Project Abstracts, VLSI Projects abstracts for CSE EEE ECE, Download VLSI Titles, Download VLSI Project Abstracts, Download IEEE VLSI Abstracts

VLSI PROJECTS - ABSTRACTS
A 10-T SRAM cell with Inbuilt Charge Sharing for Dynamic Power Reduction
In this paper we present a novel 10T SRAM cell design with an inbuilt mechanism for charge recycling to cut down the dynamic power budget. The read discharge power of a single ended 8T cell is reused efficiently in the proposed cell architecture. 
The fundamental premise of our approach is that the read current in an 8T SRAM cell can be recycled in the write bitlines to reduce the read bitline and write bitline swings simultaneously. 
The proposed 10T SRAM cell limits the read bitline swing and reduces the read power consumption by 67%. The write `0' and write `1' powers are also reduced by 23.12% and 30.65% respectively. 
The impact of the proposed cell on the delay has also been analyzed. Bitline leakage is also reduced by 65%. These improvements in the results of the proposed cell validate our approach. The design is simulated at 90 nm technology and frequency of 333 MHz.


A Built-In Repair Analyzer With Optimal Repair Rate for Word-Oriented Memories
This paper presents a built-in self repair analyzer with the optimal repair rate for memory arrays with redundancy. The proposed method requires only a single test, even in the worst case. By performing the must-repair analysis on the fly during the test, it selectively stores fault addresses, and the final analysis to find a solution is performed on the stored fault addresses. 
To enumerate all possible solutions, existing techniques use depth first search using a stack and a finite-state machine. Instead, we propose a new algorithm and its combinational circuit implementation. Since our formulation for the circuit allows us to use the parallel prefix algorithm, it can be configured in various ways to meet area and test time requirements. 
The total area of our infrastructure is dominated by the number of content addressable memory entries to store the fault addresses, and it only grows quadratically with respect to the number of repair elements. The infrastructure is also extended to support various types of word-oriented memories.


A Clock Control Strategy for Peak Power and RMS Current Reduction Using Path Clustering
Peak power reduction has been a critical challenge in the design of integrated circuits impacting the chip's performance and reliability. The reduction of peak power also reduces the power density of integrated circuits. Due to large IR-voltage drops in circuits, transistor switching slows down giving rise to timing violations and logic failures. In this paper, we present a new clock control strategy for peak-power reduction in VLSI circuits. 
In the proposed method, the simultaneous switching of combinational paths is minimized by taking advantage of the delay slacks among the paths and clustering the paths with similar slack values. Once the paths are identified based on the path delays and their slack values, the clustering algorithm determines the ideal number of clusters for the given circuit and for each cluster the maximum possible phase shift that can be applied to the clock. 
The paths are assigned to clusters in a load balanced manner based on the slack values and each cluster will have a phase shift possible on its clock depending on the slack. Thus, the proposed register-transfer level (RTL) method takes advantage of the logic-path timing slack to re-schedule circuit activities at optimal intervals within the unaltered clock period. When switching activities are redistributed more evenly across the clock period, the IC supply-current consumption is also spread across a wider range of time within the clock period. 
This has the beneficial effect of reducing peak-current draw in addition to reducing RMS power draw without having to change the operating frequency and without utilizing additional power supply voltages as in dual or multi VT approaches. The proposed method is implemented and tested through simulations using an experimental setup with Synopsys Tools Suite and Cadence Tools on the ISCAS'85 benchmark circuits, OpenCore circuits and LEON processor multiplier circuit. 
Experimental results indicate that peak power can be reduced significantly to at- least 72% depending on the number of clusters and the phase-shifted clock identified as suitable for the given circuit by the proposed algorithms. Although the proposed method incurs some power overhead compared to the traditional clocking method, the overhead can be made negligible compared to the peak-power reduction as seen in the experimental results presented.


A Current-Starved Inverter-based Differential Amplifier Design for Ultra-Low Power Applications
As silicon feature sizes decrease, more complex circui try arrays can now be contrived on a single die. This increase in the number of on-chip devices per unit area results in increased power dissipation per unit area. In order to meet certain power and operating temperature specifications, circuit design necessitates a focus on power efficiency, which is especially important in systems employing hundreds or thousands of instances of the same device. 
In large arrays, a slight increase in the power efficiency of a single component is heightened by the number of instances of the device in the system. This paper proposes a fully differential, low-power current-starving inverter-based amplifier topology designed in a commercial 0.18µm process. 
This design achieves 46dB DC gain and a 464 kHz uni ty gain frequency with a power consumption of only 145.32nW at 700mV power supply vol tage for ultra-low power, low bandwidth applications. Higher bandwidth designs are also proposed, including a 48dB DC gain, 2.4 MHz unity-gain frequency amplifier operating at 900mV wi th only 3.74µW power consumption.


A Fast Low-Light Multi-Image Fusion with Online Image Restoration
This paper presents a new low-light multi-frame fusion algorithm to get a bright and clear shot even under dark conditions. 
To this end, using multiple short-exposure images and one proper-exposure blurry image as an input, a new hierarchical block-wise temporal noise filtering is done. 
Finally, an online image restoration of the denoising result is conducted along with the blurry image input. Test results on real low-light scene show its effectiveness like fast processing speed and satisfactory visual quality.


A High Performance D-Flip Flop Design with Low Power Clocking System using MTCMOS
Power consumption plays an important role in any integrated circuit and is listed as one of the top three challenges in International technology roadmap for semiconductors. In any integrated circuit, clock distribution network and flip -flop consumes large amount of power as they make maximum number of internal transitions. 
In this paper, various techniques for implementing flip-flops with low power clocking system are analyzed. Among those techniques clocked pair shared flip-flop (CPSFF) consume least power than conditional data mapping flip flop (CDMFF), conditional discharge flip flop (CDFF) and conventional double edge triggered flip-flop (DEFF). 
We propose a novel CPSFF using Multi-Threshold voltage CMOS (MTCMOS) technique which reduces the power consumption by approximately 20% to 70% than the original CPSFF. In addition, to build a clocking system, double edge triggering and low swing clocking can be easily incorporated into the new flip-flop. 


A Linear Programming Based Tone Injection Algorithm for PAPR Reduction of OFDM and Linearly Precoded Systems
This work investigates the improvement of power amplifier efficiency through the reduction of peak-to-average power ratio (PAPR) of linearly precoded QAM data signals. 
In particular, we focus on the special cases of linear precoded modulation including the practical OFDM, OFDMA, and SC-FDMA signals that have been widely adopted in W-LAN and W-MAN. We apply the method of tone injection optimization for PAPR reduction. 
To reduce numerical complexity, we propose a linear programming algorithm which closely approximates the original tone injection optimization problem. Our comprehensive numerical results demonstrate substantial PAPR reduction and superior BER performance using several practical examples.


A Low Power Fault Tolerant Reversible Decoder using MOS Transistor
This paper demonstrates the reversible logic synthesis for the n-to-2n decoder, where n is the number of data bits. The circuits are designed using only reversible fault tolerant Fredkin and Feynman double gates. Thus, the entire scheme inherently becomes fault tolerant. 
Algorithm for designing the generalized decoder has been presented. In addition, several lower bounds on the number of constant inputs, garbage outputs and quantum cost of the reversible fault tolerant decoder have been proposed. 
Transistor simulations of the proposed decoder are shown using standard p-MOS 901 and n-MOS 902 model with delay of 0.030 ns and 0.12 m channel length, which proved the functional correctness of the proposed circuits. 
The comparative results show that the proposed design is much better in terms of quantum cost, delay, hardware complexity and has significantly better scalability than the existing approach.


A Low Power Single Phase Clock Distribution using VLSI technology
The clock distribution network consumes nearly 70% of the total power consumed by the IC since this is the only signal which has the highest switching activity. 
Normally for a multi clock domain network we develop a multiple PLL to cater the need, this project aim for developing a low power single clock multiband network which will supply for the multi clock domain network. 
This project is highly useful and recommended for communication applications like Bluetooth, Zigbee. WLAN frequency synthesizers are proposed based on pulse-swallow topology and the designed is modeled using Verilog simulated using Modelsim and implemented in Xilinx. 


A Low-Complexity Turbo Decoder Architecture for Energy-Efficient Wireless Sensor Networks
Turbo codes have recently been considered for energy-constrained wireless communication applications, since they facilitate a low transmission energy consumption. However, in order to reduce the overall energy consumption, lookup table-log-BCJR (LUT-Log-BCJR) architectures having a low processing energy consumption are required. 
In this paper, we decompose the LUT-Log-BCJR architecture into its most fundamental add compare select (ACS) operations and perform them using a novel low-complexity ACS unit. 
We demonstrate that our architecture employs an order of magnitude fewer gates than the most recent LUT-Log-BCJR architectures, facilitating a 71% energy consumption reduction. Compared to state-of-the-art maximum logarithmic Bahl-Cocke-Jelinek-Raviv implementations, our approach facilitates a 10% reduction in the overall energy consumption at ranges above 58 m.


A Low-Cost, Systematic Methodology for Soft Error Robustness of Logic Circuits
Due to current technology scaling trends such as shrinking feature sizes and decreasing supply voltages, circuit reliability is becoming more susceptible to radiation-induced transient faults (soft errors). 
Soft errors, which have been a great concern in memories, are now a main factor in reliability degradation of logic circuits as well. In this paper, we present a systematic and integrated methodology for circuit robustness to soft errors. The proposed soft error rate (SER) reduction framework, based on redundancy addition and removal (RAR), aims at eliminating those gates with large contribution to the overall SER. Several metrics and constraints are introduced to guide the RAR-based approach toward SER reduction. 
Furthermore, we integrate a resizing strategy into our framework, as post-RAR additive SER optimization. The strategy can identify most critical gates to be upsized and thereby, minimize area and power overheads while maintaining a high level of soft error robustness. 
Experimental results show that the proposed RAR-based framework can achieve up to 70% reduction in output failure probability. On average, about 23% SER reduction is obtained with less than 4% area overhead.


A Novel modulo Adder for 2n-2k-1 Residue Number System
Modular adder is one of the key components for the application of residue number system (RNS). Moduli set with the form of $2^{n}-2^{k}-1$ $(1leq kleq n-2)$ can offer excellent balance among the RNS channels for multi-channels RNS processing. In this paper, a novel algorithm and its VLSI implementation structure are proposed for modulo $2^{n}-2^{k}-1$ adder. 
In the proposed algorithm, parallel prefix operation and carry correction techniques are adopted to eliminate the re-computation of carries. Any existing parallel prefix structure can be used in the proposed structure. Thus, we can get flexible tradeoff between area and delay with the proposed structure. 
Compared with same type modular adder with traditional structures, the proposed modulo $2^{n}-2^{k}-1$ adder offers better performance in delay and area. 


A Topology-Based Model for Railway Train Control Systems
An innovative topology-based method for modeling railway train control systems is proposed in this paper. The method addresses the problems of having to rely too much on designers' experience and of incurring excessive cost of validation and verification in the development of railway train control systems. 
Four topics are discussed in the paper: 1) the definition of basic topological units for modeling railway networks, based on the essential characteristics of these units; 2) the concept of a train movement authority topological space; 3) the interpretation of the train control logic as a topological space construct; and 4) topological space theorems for train control system verification. 
A case study is also presented, where the approach was applied in the simulation model of a typical railway network, and the results show good performance, which meets the system requirements.


Achieving Reduced Area by Multi-Bit Flip Flop Design
Reducing clock network power is an efficient way to reduce power consumption of the high-frequency ASICs since it accounts for a considerable amount of the dynamic chip power. 
Recently, the use of multi-bit flip-flops (MBFFs) has been shown to be an effective design technique to improve clock tree synthesis and can be used either as an alternative or in conjunction with the well-known clock gating approach targeting clock power reduction. 
The idea behind this technique is that clock tree power savings can be achieved by using flip-flop cells with optimized design and also through a reduced clock tree once the number of clock sinks is smaller in a design with MBFF cells. Some recent works have been proposing methods to take advantage of using MBFFs in standard cell based designs, where single-bit flip-flops are replaced by MBFF cells during logic and/or physical syntheses. 
However, a more complete analysis is still needed for different steps of a design flow to help understanding the impact of MBFFs on the physical design. We present in this work a comprehensive comparison between traditional flip-flop and MBFF implementations of an industrial 55nm design. 
Our results consider area, power and timing as well as some side effects like clock skew, routing congestion and voltage drop distribution. Finally, this study points to some potential drawbacks of using MBFFs which may be helpful for designers to make trade-off decisions in high performance SoC designs.


Aliasing-Free Digital Pulse-Width Modulation for Burst-Mode RF Transmitters
Burst-mode operation of power amplifiers (PAs) is a promising concept towards higher power efficiency in radio frequency (RF) transmitters. 
Such transmitters use pulse-width modulation (PWM) to create the driving signal for the PA, and a reconstruction filter after amplification to obtain the transmission signal. 
However, conventional digital pulse-width modulated signals contain a large amount of distortion that cannot be removed by the reconstruction filter in a satisfactory manner. 


An Analysis of SOBEL and GABOR Image Filters for Identifying Fish
This paper deals in classifying shark fishes using the Edges characterize boundaries. It is a problem of fundamental importance in detecting the type of shark fish in the deep sea. 
The edge detection is in the head of computer vision system for recognition of objects and estimate it is critical to have a good perceptive of edge detection techniques. 
In this paper the comparative analysis of various Image Edge Detection techniques are considered. The proposed work was tested in MATLAB tool. It has been shown that the Gabor's filter performs better than Sobel filter.


An Efficient Denoising Architecture for Removal of Impulse Noise in Images
Images are often corrupted by impulse noise in the procedures of image acquisition and transmission. In this paper, we propose an efficient denoising scheme and its VLSI architecture for the removal of random-valued impulse noise. To achieve the goal of low cost, a low-complexity VLSI architecture is proposed. 
We employ a decision-tree-based impulse noise detector to detect the noisy pixels, and an edge-preserving filter to reconstruct the intensity values of noisy pixels. Furthermore, an adaptive technology is used to enhance the effects of removal of impulse noise. 
Our extensive experimental results demonstrate that the proposed technique can obtain better performances in terms of both quantitative evaluation and visual quality than the previous lower complexity methods. Moreover, the performance can be comparable to the higher,- complexity methods. 
The VLSI architecture of our design yields a processing rate of about 200 MHz by using TSMC 0.18 µm technology. Compared with the state-of-the-art techniques, this work can reduce memory storage by more than 99 percent. The design requires only low computational complexity and two line memory buffers. Its hardware cost is low and suitable to be applied to many real-time applications. 


An Efficient High Speed Wallace Tree Multiplier
Power dissipation of integrated circuits is a major concern for VLSI circuit designers. A Wallace tree multiplier is an improved version of tree based multiplier architecture. It uses carry save addition algorithm to reduce the latency.
This paper aims at further reduction of the latency and power consumption of the Wallace tree multiplier. This is accomplished by the use of 4:2, 5:2 compressors and a proposed carry select adder. T
he result shows that the proposed Wallace tree multiplier is 44.4% faster than the conventional Wallace tree multiplier, along with realization of 11% of reduced power consumption. The simulations have been carried out using the Modelsim and Xilinx tools.


An Efficient SQRT Architecture of Carry Select Adder Design by Common Boolean Logic
Carry Select adder (CSLA) is known to be the fastest adder among the Conventional adder structures. This work uses an efficient Carry select adder by sharing the Common Boolean logic (CLB) term. 
After a logic simplification, we only need one OR gate and one inverter gate for carry and summation operation. Through the multiplexer, we can select the correct output according to the logic states of the carry in signal. Based on this modification Square root CSLA (SQRT CSLA) architecture have been developed and compared with the regular and Modified SQRT CSLA architecture. 
The Modified CSLA architecture has been developed using Binary to Excess -1 converter (BEC). This paper proposes an efficient method which replaces a BEC using common Boolean logic. The result analysis shows that the proposed architecture achieves the three folded advantages in terms of area, delay and power.


An Interactive RFID-based Bracelet for Airport Luggage Tracking System
Radio Frequency Identification (RFID) is a promising technology that has been implemented lately in airports. RFID tags are used to identify and track the location of passengers' luggage. 
This paper investigates the use of an interactive bracelet that communicates with the RFID system by mean of a database application. The database system interacts with the bracelet using messages that inform the passenger about his luggage status. 
The proposed database design and implementation are also discussed to describe the different functionalities of the application.


Analysis and Design of a Low-Voltage Low-Power Double-Tail Comparator
The need for ultra low-power, area efficient, and high speed analog-to-digital converters is pushing toward the use of dynamic regenerative comparators to maximize speed and power efficiency. In this paper, an analysis on the delay of the dynamic comparators will be presented and analytical expressions are derived. 
From the analytical expressions, designers can obtain an intuition about the main contributors to the comparator delay and fully explore the tradeoffs in dynamic comparator design. Based on the presented analysis, a new dynamic comparator is proposed, where the circuit of a conventional doubletail comparator is modified for low-power and fast operation even in small supply voltages. 
Without complicating the design and by adding few transistors, the positive feedback during the regeneration is strengthened, which results in remarkably reduced delay time. Post-layout simulation results in a 0.18-µm CMOS technology confirm the analysis results. It is shown that in the proposed dynamic comparator both the power consumption and delay time are significantly reduced. 
The maximum clock frequency of the proposed comparator can be increased to 2.5 and 1.1 GHz at supply voltages of 1.2 and 0.6 V, while consuming 1.4 mW and 153 µW, respectively. The standard deviation of the input-referred offset is 7.8 mV at 1.2 V supply. 


Area-Delay Efficient Binary Adders in QCA
As transistors decrease in size more and more of them can be accommodated in a single die, thus increasing chip computational capabilities. However, transistors cannot get much smaller than their current size. 
The quantum-dot cellular automata (QCA) approach represents one of the possible solutions in overcoming this physical limit, even though the design of logic modules in QCA is not always straightforward. In this brief, we propose a new adder that outperforms all state-of-the-art competitors and achieves the best area-delay tradeoff. 
The above advantages are obtained by using an overall area similar to the cheaper designs known in literature. The 64-bit version of the novel adder spans over 18.72 µm² of active area and shows a delay of only nine clock cycles, that is just 36 clock phases.


Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter with Low Adaptation-Delay
In this paper, we present an efficient architecture for the implementation of a delayed least mean square adaptive filter. For achieving lower adaptation-delay and area-delay-power efficient implementation, we use a novel partial product generator and propose a strategy for optimized balanced pipelining across the time-consuming combinational blocks of the structure. 
From synthesis results, we find that the proposed design offers nearly 17% less area-delay product (ADP) and nearly 14% less energy-delay product (EDP) than the best of the existing systolic structures, on average, for filter lengths N = 8, 16, and 32. We propose an efficient fixed-point implementation scheme of the proposed architecture, and derive the expression for steady-state error. 
We show that the steady-state mean squared error obtained from the analytical result matches with the simulation result. Moreover, we have proposed a bit-level pruning of the proposed architecture, which provides nearly 20% saving in ADP and 9% saving in EDP over the proposed structure before pruning without noticeable degradation of steady-state-error performance. 


Asynchronous Design of Energy Efficient Full Adder
Asynchronous adiabatic logic (AAL) is a novel low-power design technique which combines the energy saving benefits of asynchronous systems with adiabatic benefits. In this paper, energy efficient full adder cell using double pass transistor with asynchronous adiabatic logic (DPTAAL) is investigated. 
Asynchronous adiabatic circuits are very low power circuits to preserve energy for reuse, which reduces the amount of energy drawn directly from the power supply. In this work, a full adder cell using DPTAAL is designed and simulated, which exhibits less energy and reliable logical operations. 
To improve the circuit performance at reduced voltage level, double pass transistor logic (DPL) is introduced. The energy performance of the proposed design is compared with the conventional CMOS full adder and the quasi-adiabatic families of full adder cell designs namely, 2N2P, 2N2N2P, PFAL, ADSL, IPGL. 
Simulation results show significant energy savings from 15 to 75% for clock rates ranging from 100MHz to 200MHz.


Background Subtraction Based on Threshold detection using Modified K-Means Algorithm
In video surveillance systems, background subtraction is the first processing stage and it is used to determine the objects in a particular scene. It is a general term for a process which aims to separate foreground objects from a relatively stationary background. 
It should be processed in real time. It is obtained in human detection system by computing the variation, pixel-by-pixel, between the current frame and the image of the background, followed by an automatic threshold. 
This paper proposed a K means based background subtraction for real time video processing in video surveillance. We have analyzed and evaluate the performance of the proposed method, with standard K-means and other background subtractions algorithms. The experimental results showed that the proposed method provides better output.


Broadside and Skewed-Load Tests under Primary Input Constraints
Tester limitations may impose certain constraints on the primary input vectors applicable as part of a two-pattern test for delay faults. Under these constraints, the primary input vectors may be held constant, or the second primary input vector of a test may be obtained by a single shift of a scan chain relative to the first. 
The goal of this brief is to study the differences in achievable transition fault coverage between various primary input constraints that are similar to the commonly used ones of holding or shifting primary input vectors. This brief also studies the possibility of combining the constraints in order to increase the transition fault coverage. 
The combination requires a fixed and circuit-independent hardware structure similar to the case where shifting of primary input vectors is used. This study is done using test sets that consist of both broadside and skewed-load tests in order to maximize the transition fault coverage.


Built-In Generation of Functional Broadside Tests using a Fixed Hardware Structure
Functional broadside tests are two-pattern scan-based tests that avoid overtesting by ensuring that a circuit traverses only reachable states during the functional clock cycles of a test. 
In addition, the power dissipation during the fast functional clock cycles of functional broadside tests does not exceed that possible during functional operation. On-chip test generation has the added advantage that it reduces test data volume and facilitates at-speed test application. 
This paper shows that on-chip generation of functional broadside tests can be done using a simple and fixed hardware structure, with a small number of parameters that need to be tailored to a given circuit, and can achieve high transition fault coverage for testable circuits. 
With the proposed on-chip test generation method, the circuit is used for generating reachable states during test application. This alleviates the need to compute reachable states offline.


Comparison of Static and Dynamic Printed Organic Shift Registers
Dynamic and static shift-register circuits are fabricated with an inkjet process for printing complementary organic semiconductors. 
The static design is based on edge-triggered master-slave flip-flops, and the dynamic design is based on a true-single-phase-clock architecture. The merits and drawbacks of the two approaches are considered and compared.


CORDIC based Fast Radix-2 DCT Algorithm
This letter proposes a novel coordinate rotation digital computer (CORDIC)-based fast radix-2 algorithm for computation of discrete cosine transformation (DCT). The proposed algorithm has some distinguish advantages, such as Cooley-Tukey fast Fourier transformation (FFT)-like regular data flow, uniform post-scaling factor, in-place computation and arithmetic-sequence rotation angles. 
Compared to existing DCT algorithms, this proposed algorithm has lower computational complexity. Furthermore, the proposed algorithm is highly scalable, modular, regular, and suitable for pipelined VLSI implementation. 
In addition, this letter also provides an easy way to implement the reconfigurable or unified architecture for DCTs and inverse DCTs.


Design and Implementation of 32 Bit Unsigned Multiplier Using CLAA and CSLA
This project deals with the comparison of the VLSI design of the carry look-ahead adder (CLAA) based 32-bit unsigned integer multiplier and the VLSI design of the carry select adder (CSLA) based 32-bit unsigned integer multiplier. 
Both the VLSI design of multiplier multiplies two 32-bit unsigned integer values and gives a product term of 64-bit values. The CLAA based multiplier uses the delay time of 99ns for performing multiplication operation where as in CSLA based multiplier also uses nearly the same delay time for multiplication operation. 
But the area needed for CLAA multiplier is reduced to 31% by the CSLA based multiplier to complete the multiplication operation. These multipliers are implemented using Altera Quartus II and timing diagrams are viewed through avan waves.


Design and Implementation of an On-Chip Permutation Network for Multiprocessor System-On-Chip
This paper presents the silicon-proven design of a novel on-chip network to support guaranteed traffic permutation in multiprocessor system-on-chip applications. The proposed network employs a pipelined circuit-switching approach combined with a dynamic path-setup scheme under a multistage network topology. 
The dynamic path-setup scheme enables runtime path arrangement for arbitrary traffic permutations. The circuit-switching approach offers a guarantee of permuted data and its compact overhead enables the benefit of stacking multiple networks. 
A 0.13-µ m CMOS test-chip validates the feasibility and efficiency of the proposed design. Experimental results show that the proposed on-chip network achieves 1.9× to 8.2× reduction of silicon overhead compared to other design approaches.


Design Flow for Flip-Flop Grouping in Data-Driven Clock Gating
Clock gating is a predominant technique used for power saving. It is observed that the commonly used synthesis-based gating still leaves a large amount of redundant clock pulses. Data-driven gating aims to disable these. 
To reduce the hardware overhead involved, flip-flops (FFs) are grouped so that they share a common clock enabling signal. The question of what is the group size maximizing the power savings is answered in a previous paper. Here we answer the question of which FFs should be placed in a group to maximize the power reduction. We propose a practical solution based on the toggling activity correlations of FFs and their physical position proximity constraints in the layout. 
Our data-driven clock gating is integrated into an Electronic Design Automation (EDA) commercial backend design flow, achieving total power reduction of 15%--20% for various types of large-scale state-of-the-art industrial and academic designs in 40 and 65 manometer process technologies. 
These savings are achieved on top of the savings obtained by clock gating synthesis performed by commercial EDA tools, and gating manually inserted into the register transfer level design.


Design of Digit-Serial FIR Filters: Algorithms, Architectures and a CAD Tool
In the last two decades, many efficient algorithms and architectures have been introduced for the design of low-complexity bit-parallel multiple constant multiplications (MCM) operation which dominates the complexity of many digital signal processing systems. 
On the other hand, little attention has been given to the digit-serial MCM design that offers alternative low-complexity MCM operations albeit at the cost of an increased delay. In this paper, we address the problem of optimizing the gate-level area in digit-serial MCM designs and introduce high-level synthesis algorithms, design architectures, and a computer-aided design tool. 
Experimental results show the efficiency of the proposed optimization algorithms and of the digit-serial MCM architectures in the design of digit-serial MCM operations and finite impulse response filters. 


Design of Hardware Function Evaluators using Low-Overhead Nonuniform Segmentation with Address Remapping
In the piecewise function evaluation with polynomial approximation, nonuniform segmentation can effectively reduce the size of lookup tables for some arithmetic functions compared to uniform segmentation approaches, at the cost of the extra segment address (index) encoder that results in area and delay overhead. 
Also, it is observed that the nonuniform segmentation reflects a design tradeoff between the ROM size and the area cost of the subsequent arithmetic computation hardware. In this paper, we propose a new nonuniform segmentation method that searches for the optimal segmentation scheme with the goal of minimized ROM, total area, or delay. 
For some high-variation arithmetic functions, the proposed segmentation method achieves significant area reduction compared to the uniform segmentation method. 
We also demonstrate the design tradeoff among uniform and nonuniform segmentation, and degree-one and degree-two polynomial approximations, with respect to precision ranging from 12 to 32 bits for the elementary function of reciprocal.


Design of High Speed Low Power Viterbi Decoder for TCM System
Viterbi decoder is the most power hungry module in the Trelli coded modulation system. In VLSI implementation, reduced chip area, low power consumption, improved speed are the main concerns to be obtained. In this brief, high speed low power Viterbi algorithm architecture is proposed to decode the high rate convolution codes. Constraint length of high rate convolution code should be high to maintain low error probability. 
But computational complexity of the Viterbi algorithm for high rate convolution code increases exponentially with the constraint length. This computational problem can be solved by trimming the least likely paths at each trelli stage in the T-algorithm; as a result significant power reduction can also be achieved. Furthermore, the pre-computation technique is used to speed up the process of searching for the optimal path metric from the ACSU loop. 
Architecture of the Add-Compare-Select loop is modified using the pre-computation architecture. This shortens the long critical path introduced by the conventional T-algorithm. Register exchange algorithm is used for the survivor memory unit design, since it is faster and requires lesser memory. Conceptually, Register exchange algorithm has a pre-defined end state. 
Since the optimized T-algorithm is used, pre-defining the end state is not possible. This issue is focused and appropriate solution is provided in this paper. From the simulation results it is observed that the proposed Viterbi decoder architecture can reduce significant amount of computations, power consumption with negligible performance reduction.


Design of Low Energy, High Performance Synchronous and Asynchronous 64-Point FFT
A case study exploring multi-frequency design is presented for a low energy and high performance FFT circuit implementation. An FFT architecture with concurrent data stream computation is selected. 
An asynchronous and synchronous implementations for a 16-point and a 64-point FFT circuit were designed and compared for energy, performance and area. Both versions are structurally similar and are generated using similar ASIC CAD tools and flows. 
The asynchronous design shows a benefit of 2.4×, 2.4× and 3.2× in terms of area, energy and performance respectively over its synchronous counterpart. The circuit is further compared with other published designs and shows 0.4×, 4.8× and 32.4× benefit with respect to area, energy and performance. 


Design of Low Power Sequential Circuit Using Clocked Pair Shared Flip flop
The clock system consisting of clock distribution networks and sequential elements is most power consuming VLSI components. Reductions of flip flop, power consumption have a deep impact on the total power consumption. Since power consumption is a major bottleneck of system performance, the clock load should be reduced to reduce the power consumption. 
The clock distribution network distributes the clock signal from a common point to all the elements that need it. Since this function is vital to synchronous system, much attention has been given to the characteristics of these clock signal and the electrical networks used in their distribution. 
In synchronous system clock distribution networks consumes a large amount of total power because of high operation frequency of highest capacitance. An effective way to reduce capacity of clock load is by minimizing number of clocked transistor. 
In low swing differential capturing flip flop system clock distribution networks consumes a large amount of chip power and there exist a more number of clocked transistor. Hence by a novel approach, clocked paired shared flip flop is used to reduce the number of local clocked transistors.


Effective and Efficient Approach for Power Reduction by Using Multi-Bit Flip-Flops
Power has become a burning issue in modern VLSI design. In modern integrated circuits, the power consumed by clocking gradually takes a dominant part. Given a design, we can reduce its power consumption by replacing some flip-flops with fewer multi-bit flip-flops. 
However, this procedure may affect the performance of the original circuit. Hence, the flip-flop replacement without timing and placement capacity constraints violation becomes a quite complex problem. To deal with the difficulty efficiently, we have proposed several techniques. 
First, we perform a co-ordinate transformation to identify those flip-flops that can be merged and their legal regions. Besides, we show how to build a combination table to enumerate possible combinations of flip-flops provided by a library. 
Finally, we use a hierarchical way to merge flip-flops. Besides power reduction, the objective of minimizing the total wirelength is also considered. The time complexity of our algorithm is $Theta({rm n}^{1.12})$ less than the empirical complexity of $Theta({rm n}^{2})$. 
According to the experimental results, our algorithm significantly reduces clock power by 20–30% and the running time is very short. In the largest test case, which contains 1 700 000 flip-flops, our algorithm only takes about 5 min to replace flip-flops and the power reduction can achieve 21%.


Efficiency Optimization for Burst-Mode Multilevel Radio Frequency Transmitters
The utilization of a burst-mode power amplifier (PA) together with pulse-width modulation (PWM) is a promising concept for achieving high efficiency in radio frequency (RF) transmitters. Nevertheless, such a transmitter architecture requires bandpass filtering to suppress side-band spectral components to retrieve the wanted signal, which reduces the transmit power and the transmitter efficiency. 
High efficiency can only be expected with the maximum transmit power and signals with low peak-to-average-power ratios (PAPRs). To boost efficiency for signals with high PAPRs and signals at variable transmit power levels, the burst-mode multilevel transmitter architecture has been widely discussed as a potential solution. 
This paper presents an efficiency optimization procedure of burst-mode multilevel transmitters for signals with high PAPRs and signals at variable transmit power levels. The impact of the threshold value on the transmitter efficiency is studied, where the optimum threshold value and the maximum transmitter efficiency can be obtained according to input magnitude statistics. 
In addition, the relation between the threshold value and the efficiency expression of burst-mode multilevel transmitters and those of Doherty PAs is investigated. It is shown that the obtained optimum threshold value, although originally designed for burst-mode transmitters, can also be applied to Doherty and multistage Doherty PAs to achieve maximum transmitter efficiency. Simulations are used to validate the efficiency improvement of the optimized burst-mode multilevel transmitters compared to two-level and non-optimized multilevel transmitters.


Efficient Implementation of Reconfigurable Warped Digital Filters With Variable Low-Pass, High-Pass, Band pass, and Band stop Responses
In this brief, an efficient implementation of reconfigurable warped digital filter with variable low-pass, high-pass, bandpass, and bandstop responses is presented. The warped filters, obtained by replacing each unit delay of a digital filter with an all-pass filter, are widely used for various audio processing applications. 
However, warped filters require first-order all-pass transformation to obtain variable low-pass or high-pass responses, and second-order all-pass transformation to obtain variable bandpass or bandstop responses. To overcome this drawback, the proposed method combines the warped filters with the coefficient decimation technique. 
The proposed architecture provides variable low-pass or high-pass responses with fine control over cut-off frequency and variable bandwidth bandpass or bandstop responses at an arbitrary center frequency without updating the filter coefficients or filter structure. 
The design example shows that the proposed variable digital filter is simple to design and offers substantial savings in gate counts and power consumption over other approaches.


Efficient Power-Analysis-Resistant Dual-Field Elliptic Curve Cryptographic Processor Using Heterogeneous
Elliptic curve cryptography (ECC) for portable applications is in high demand to ensure secure information exchange over wireless channels. Because of the high computational complexity of ECC functions, dedicated hardware architecture is essential to provide sufficient ECC performance. 
Besides, crypto-ICs are vulnerable to side-channel information leakage because the private key can be revealed via power-analysis attacks. In this paper, a new heterogeneous dual-processing-element (dual-PE) architecture and a priority-oriented scheduling of right-to-left double-and-add-always EC scalar multiplication (ECSM) with randomized processing technique are proposed to achieve a power-analysis-resistant dual-field ECC (DF-ECC) processor. 
For this dual-PE design, a memory hierarchy with local memory synchronization scheme is also exploited to improve data bandwidth. Fabricated in a 90-nm CMOS technology, a 0.4-mm² 160-b DF-ECC chip can achieve 0.34/0.29 ms 11.7/9.3 µJ for one GF(p)/GF(2m) ECSM. 
Compared to other related works, our approach is advantageous not only in hardware efficiency but also in protection against power-analysis attacks. 


Efficient VLSI Implementation of Neural Networks With Hyperbolic Tangent Activation Function
Nonlinear activation function is one of the main building blocks of artificial neural networks. Hyperbolic tangent and sigmoid are the most used nonlinear activation functions. Accurate implementation of these transfer functions in digital networks faces certain challenges. 
In this paper, an efficient approximation scheme for hyperbolic tangent function is proposed. The approximation is based on a mathematical analysis considering the maximum allowable error as design parameter. Hardware implementation of the proposed approximation scheme is presented, which shows that the proposed structure compares favorably with previous architectures in terms of area and delay. 
The proposed structure requires less output bits for the same maximum allowable error when compared to the state-of-the-art. The number of output bits of the activation function determines the bit width of multipliers and adders in the network. 
Therefore, the proposed activation function results in reduction in area, delay, and power in VLSI implementation of artificial neural networks with hyperbolic tangent activation function. 


Eliminating Synchronization Latency Using Sequenced Latching
Modern multicore systems have a large number of components operating in different clock domains and communicating through asynchronous interfaces. These interfaces use synchronizer circuits, which guard against metastability failures but introduce latency in processing the asynchronous input. 
We propose a speculative method that hides synchronization latency by overlapping it with computation cycles. We verify the correctness of our approach through a field programmable gate array implementation and apply it to a number of synthesized benchmarks. 
Synthesis results reveal that our approach achieves average savings of 135% and 204% in area costs and nearly 100% in power costs compared to two similar speculative techniques. 


Error Detection in Majority Logic Decoding of Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes
In a recent paper, a method was proposed to accelerate the majority logic decoding of difference set low density parity check codes. This is useful as majority logic decoding can be implemented serially with simple hardware but requires a large decoding time. For memory applications, this increases the memory access time. 
The method detects whether a word has errors in the first iterations of majority logic decoding, and when there are no errors the decoding ends without completing the rest of the iterations. Since most words in a memory will be error-free, the average decoding time is greatly reduced. 
In this brief, we study the application of a similar technique to a class of Euclidean geometry low density parity check (EG-LDPC) codes that are one step majority logic decodable. 
The results obtained show that the method is also effective for EG-LDPC codes. Extensive simulation results are given to accurately estimate the probability of error detection for different code sizes and numbers of errors. 


FFT Architectures for Real-Valued Signals Based on Radix-2by3 & Radix-2by4 Algorithms
This brief presents novel parallel pipelined architectures for the computation of the fast Fourier transform (FFT) of real signals and inverse FFT of Hermitian-symmetric signals using only real datapaths. 
The real FFT structure is transformed by transferring twiddle factors to subsequent stages, such that each stage in the proposed flow graph contains one column of butterfly units and one column of twiddle factor blocks, and each column of the flow graph contains only N samples. 
This is a key requirement for the design of architectures that are based on only real datapaths. This structure is then mapped to pipelined architectures. The proposed architectures can be used with any FFT size or level of parallelism, which is a power of two. 
A systematic method to design architectures for FFTs with different levels of parallelism and radix values is presented. By modifying the FFT flow graph for real-valued samples, this methodology leads to architectures with fewer adders, delays, and interconnections. 


Fixed-Width Multipliers and Multipliers- Accumulators with Min-Max Approximation Error
Fixed-width multipliers have two n-bits operands and produce an approximate n-bits results for their product. These multipliers discard part of the partial products matrix, to reduce hardware cost, and employ extra correction functions to reduce approximation error.


FPGA Implementation of Pipelined Architecture For SPIHT Algorithm
This paper presents a throughput efficient FPGA implementation of the `Set Partitioning in Hierarchical Trees' (SPIHT) algorithm for compression of images. The SPIHT uses inherent redundancy among wavelet coefficients and suited for both gray and color images. The SPIHT algorithm uses dynamic data structures which hinders hardware realization. 
In this FPGA implementation have modified basic SPIHT in two ways, one by using static (fixed) mappings which represent significant information and the other by interchanging the sorting and refinement passes. 
A hardware realization is done in a Xilinx XC3S200 device. The SPIHT algorithm can be applied to both grey-scale and colored images. SPIHT displays exceptional characteristics over several properties like good image quality, fast coding and decoding, a fully progressive bit stream, application in lossless compression, error protection and ability to code for exact bit rate. 


Gate Mapping Automation for Asynchronous NULL Convention Logic Circuits
Design automation techniques are a key challenge in the widespread application of timing-robust asynchronous circuit styles. In this paper, a new methodology for mapping multi-rail logic expressions to a NULL convention logic (NCL) gate library is proposed. 
The new methodology is then compared to another recently proposed mapping approach, demonstrating that the new methodology can further reduce the area and improve the delay of NCL circuits. Also, in contrast to the original approach, which only targets area reduction, the new methodology can target any arbitrary cost function or use any subset of the NCL gate library for mapping. 
In order to automate the new methodology and compare it with the original one, both methodologies were implemented in the Perl programming language and compared in terms of mapping performance and runtime. The results show that, depending on the test circuit, the new methodology can offer up to 10% improvement in area, and 39% improvement in delay.


Glitch-Free NAND-Based Digitally Controlled Delay-Lines
The recently proposed NAND-based digitally controlled delay-lines (DCDL) present a glitching problem which may limit their employ in many applications. This paper presents a glitch-free NAND-based DCDL which overcame this limitation by opening the employ of NAND-based DCDLs in a wide range of applications. 
The proposed NAND-based DCDL maintains the same resolution and minimum delay of previously proposed NAND-based DCDL. The theoretical demonstration of the glitch-free operation of proposed DCDL is also derived in the paper. 
Following this analysis, three driving circuits for the delay control-bits are also proposed. Proposed DCDLs have been designed in a 90-nm CMOS technology and compared, in this technology, to the state-of-the-art. Simulation results show that novel circuits result in the lowest resolution, with a little worsening of the minimum delay with respect to the previously proposed DCDL with the lowest delay. 
Simulations also confirm the correctness of developed glitching model and sizing strategy. As example application, proposed DCDL is used to realize an All-digital spread-spectrum clock generator (SSCG). The employ of proposed DCDL in this circuit allows to reduce the peak-to-peak absolute output jitter of more than the 40% with respect to a SSCG using three-state inverter based DCDLs.


Hardware Implementation of a Digital Watermarking System for Video Authentication
This paper presents a hardware implementation of a digital watermarking system that can insert invisible, semifragile watermark information into compressed video streams in real time. The watermark embedding is processed in the discrete cosine transform domain. 
To achieve high performance, the proposed system architecture employs pipeline structure and uses parallelism. Hardware implementation using field programmable gate array has been done, and an experiment was carried out using a custom versatile breadboard for overall performance evaluation. 
Experimental results show that a hardware-based video authentication system using this watermarking technique features minimum video quality degradation and can withstand certain potential attacks, i.e., cover-up attacks, cropping, and segment removal on video sequences. 
Furthermore, the proposed hardware-based watermarking system features low power consumption, low cost implementation, high processing speed, and reliability.


High-Throughput Compact Delay-Insensitive Asynchronous NOC Router
A new asynchronous delay-insensitive data-transmission method based on level-encoded dual-rail (LEDR) encoding with novel packet-structure restriction is proposed to realize a high-throughput Network-on-Chip (NoC) router together with a compact hardware. 
The use of LEDR encoding makes communication steps and the registers being used half in comparison with four-phase dual-rail encoding, be- cause the spacer information of the four-phase one is eliminated, which significantly improves the network throughput. By using the proposed packet structure, the phase information of header and tail flits is uniquely determined. 
Since the router can be asynchronously controlled by ignoring the phase information, the circuit is compactly implemented. As a result, the proposed asynchronous NoC router on a 0.13μm CMOS technology, has a 90% increase in throughput and a 34% decrease in energy dissipation with 25% area overhead in comparison with a conventional four-phase asynchronous NoC router under a post-layout simulation. 
In a 4x4 2-D mesh topology, the proposed asynchronous NoC has a 140% increase in throughput and half packet latency compared with the conventional one. We also fabricate the asynchronous NoC based on the proposed router on a 0.13μm CMOS technology and demonstrate the chip correctly operates under a supply voltage of 0.6V to 1.8V.


High-Throughput Multi standard Transform Core Supporting MPEG/H.264/VC-1 using Common Sharing Distributed Arithmetic
This paper proposes a low-cost high-throughput multistandard transform (MST) core, which can support MPEG-1/2/4 (8 x 8), H.264 (8 x 8, 4 x 4), and VC-1 (8 x 8, 8 x 4, 4 x 8, 4 x 4) transforms. Common sharing distributed arithmetic (CSDA) combines factor sharing and distributed arithmetic sharing techniques, efficiently reducing the number of adders for high hardware-sharing capability. 
This achieves a 44.5% reduction in adders in the proposed MST, compared with the direct implementation method. With eight parallel computation paths, the proposed MST core has an eightfold operation frequency throughput rate. Measurements show that the proposed CSDA-MST core achieves a high-throughput rate of 1.28 G-pels/s, supporting the (4928 x 2048@24 Hz) digital cinema or ultrahigh resolution format. 
This is possible only with 30k gate counts when implemented in a TSMC 0.18-µm CMOS process. The CSDA-MST core thus achieves a high-throughput rate supporting multistandard transformations at low cost.


Improvement of the Security of Zigbee by a New Chaotic Algorithm
The security protocols used in ZigBee rely on an advanced encryption standard-counter mode (AES-CTR) algorithm to encrypt data before transmission. This algorithm is very robust, but it is time consuming. 
For some industrial and medical applications, it does not meet the real-time requirement. When the AES is used in counter mode CTR, it becomes like a stream cipher that aims to generate pseudorandom bits. Also, to encrypt data, the latter are combined with the plaintext using the XOR operation. New fast stream ciphers were proposed for the eStream project, but these ciphers have shown some weakness. 
On the other hand, ciphers based on chaotic functions seem to be more promising. Detailed analyses have shown that chaotic functions have very good cryptographic properties and can be used to construct high speed and strong stream ciphers. In this paper, a new robust and fast chaotic encryption algorithm RFCA is presented. 
This consists of a chaotic cipher composed of two perturbed maps piecewise linear chaotic map. This algorithm is, in particular, adequate for data encryption in ZigBee networks where robustness and real time are both essential. 
A comparison between our algorithm (RFCA) and the AES-CTR, the simplified AES, and the eStream finalist candidates, is presented with regard to speed and robustness. This is done using correlation coefficients, unified average changing intensity, number of pixels change rate, and test of randomness for the generated bit sequences using the National Institute of Standards and Technology statistical test suite.


IsoNet: Hardware-Based Job Queue Management for Many-Core Architectures
Imbalanced distribution of workloads across a chip multiprocessor (CMP) constitutes wasteful use of resources. Most existing load distribution and balancing techniques employ very limited hardware support and rely predominantly on software for their operation. 
This paper introduces IsoNet, a hardware-based conflict-free dynamic load distribution and balancing engine. IsoNet is a lightweight job queue manager responsible for administering the list of jobs to be executed, and maintaining load balance among all CMP cores. By exploiting a micro-network of load-balancing modules, the proposed mechanism is shown to effectively reinforce concurrent computation in many-core environments. 
Detailed evaluation using a full-system simulation framework indicates that IsoNet significantly outperforms existing techniques and scales efficiently to as many as 1024 cores. Furthermore, to assess its feasibility, the IsoNet design is synthesized, placed, and routed in 45-nm VLSI technology. Analysis of the resulting low-level implementation shows that IsoNet's area and power overhead are almost negligible. 


Least Significant Bit Matching Steganalysis based on Feature Analysis
Steganography is a science of hiding messages into multimedia documents. In steganography, there is a technique in which the least significant bit is modified to hide the secret message, known as the least significant bit (LSB) steganography. 
Several steganalyzers are developed to detect least significant bit (LSB) matching steganography. Least significant bit matching images are still not well detected, especially, at low embedding rate. 
In this paper, we have improved the least significant bit steganalyzers by analyzing and manipulating the features of some existing least significant bit matching steganalysis techniques. 
A comprehensive set of experiments is carried out to justify proposed method's applicability and evaluate its performance against the existing least significant bit matching steganalysis techniques. 


Location-Aware and Safer Cards: Enhancing RFID Security and Privacy via Location Sensing
In this paper, we report on a new approach for enhancing security and privacy in certain RFID applications whereby location or location-related information (such as speed) can serve as a legitimate access context. 
Examples of these applications include access cards, toll cards, credit cards, and other payment tokens. We show that location awareness can be used by both tags and back-end servers for defending against unauthorized reading and relay attacks on RFID systems. On the tag side, we design a location-aware selective unlocking mechanism using which tags can selectively respond to reader interrogations rather than doing so promiscuously. 
On the server side, we design a location-aware secure transaction verification scheme that allows a bank server to decide whether to approve or deny a payment transaction and detect a specific type of relay attack involving malicious readers. 
The premise of our work is a current technological advancement that can enable RFID tags with low-cost location (GPS) sensing capabilities. Unlike prior research on this subject, our defenses do not rely on auxiliary devices or require any explicit user involvement.


Low Latency Systolic Montgomery Multiplier for Finite Field Based on Pentanomials
In this paper, we present a low latency systolic Montgomery multiplier over GF(2m) based on irreducible pentanomials. An efficient algorithm is presented to decompose the multiplication into a number of independent units to facilitate parallel processing. Besides, a novel so-called “pre-computed addition” technique is introduced to further reduce the latency. 
The proposed design involves significantly less area-delay and power-delay complexities compared with the best of the existing designs. It has the same or shorter critical-path and involves nearly one-fourth of the latency of the other in case of the National Institute of Standards and Technology recommended irreducible pentanomials.


Low-Complexity Multiplier for GF (2m) based on All-One Polynomials
We present a new low-complexity bit-parallel canonical basis multiplier for the field GF(2m) generated by an all-one-polynomial. The proposed canonical basis multiplier requires m2-1 XOR gates and m2 AND gates. 
We also extend this canonical basis multiplier to obtain a new bit-parallel normal basis multiplier


Low-Overhead Fault-Tolerance Technique for a Dynamically Reconfigurable Soft-core Processor
In this paper, we propose a new approach to implement a reliable softcore processor on SRAM-based FPGAs, which can mitigate radiation-induced temporary faults (single-event upsets (SEUs)) at moderate cost. 
A new Enhanced Lockstep scheme built using a pair of MicroBlaze cores is proposed and implemented on Xilinx Virtex-5 FPGA. Unlike the basic lockstep scheme, ours allows to detect and eliminate its internal temporary configuration upsets without interrupting normal functioning. 
Faults are detected and eliminated using a Configuration Engine built on the basis of the PicoBlaze core which, to avoid a single point of failure, is implemented as fault-tolerant using triple modular redundancy (TMR). 
A softcore processor can recover from configuration upsets through partial reconfiguration combined with roll-forward recovery. SEUs affecting logic which are significantly less likely than those affecting configuration are handled by checkpointing and rollback. 
Finally, to handle permanent faults, the tiling technique is also proposed. The new Enhanced Lockstep scheme requires significantly shorter error recovery time compared to conventional lockstep scheme and uses significantly smaller number of slices compared to known TMR-based design (although at the cost of longer error recovery time). The efficiency of the proposed approach was validated through fault injection experiments.


Low-Power Area-Efficient High-Speed I/O Circuit Techniques
We present a 4-Gb/s I/O circuit that fits in 0.1-mm/sup 2/ of die area, dissipates 90 mW of power, and operates over 1 m of 7-mil 0.5-oz PCB trace in a 0.25-/spl mu/m CMOS technology. Swing reduction is used in an input-multiplexed transmitter to provide most of the speed advantage of an output-multiplexed architecture with significantly lower power and area. 
A delay-locked loop (DLL) using a supply-regulated inverter delay line gives very low jitter at a fraction of the power of a source-coupled delay line-based DLL. Receiver capacitive offset trimming decreases the minimum resolvable swing to 8 mV, greatly reducing the transmission energy without affecting the performance of the receive amplifier. These circuit techniques enable a high level of I/O integration to relieve the pin bandwidth bottleneck of modern VLSI chips.


Low-Power Digital Signal Processing Using Approximate Adders
Low power is an imperative requirement for portable multimedia devices employing various signal processing algorithms and architectures. In most multimedia applications, human beings can gather useful information from slightly erroneous outputs. Therefore, we do not need to produce exactly correct numerical outputs. 
Previous research in this context exploits error resiliency primarily through voltage overscaling, utilizing algorithmic and architectural techniques to mitigate the resulting errors. In this paper, we propose logic complexity reduction at the transistor level as an alternative approach to take advantage of the relaxation of numerical accuracy. 
We demonstrate this concept by proposing various imprecise or approximate full adder cells with reduced complexity at the transistor level, and utilize them to design approximate multi-bit adders. In addition to the inherent reduction in switched capacitance, our techniques result in significantly shorter critical paths, enabling voltage scaling. 
We design architectures for video and image compression algorithms using the proposed approximate arithmetic units and evaluate them to demonstrate the efficacy of our approach. We also derive simple mathematical models for error and power consumption of these approximate adders. 
Furthermore, we demonstrate the utility of these approximate adders in two digital signal processing architectures (discrete cosine transform and finite impulse response filter) with specific quality constraints. Simulation results indicate up to 69% power savings using the proposed approximate adders, when compared to existing implementations using accurate adders.


Low-Power Digital Signal Processor Architecture for Wireless Sensor Nodes
Radio communication exhibits the highest energy consumption in wireless sensor nodes. Given their limited energy supply from batteries or scavenging, these nodes must trade data communication for on-the-node computation. Currently, they are designed around off-the-shelf low-power microcontrollers. 
But by employing a more appropriate processing element, the energy consumption can be significantly reduced. This paper describes the design and implementation of the newly proposed folded-tree architecture for on-the-node data processing in wireless sensor networks, using parallel prefix operations and data locality in hardware. 
Measurements of the silicon implementation show an improvement of 10-20? in terms of energy as compared to traditional modern micro-controllers found in sensor nodes. 


Low-Power, High-Throughput, and Low-Area Adaptive FIR Filter Based on Distributed Arithmetic
This brief presents a novel pipelined architecture for low-power, high-throughput, and low-area implementation of adaptive filter based on distributed arithmetic (DA). The throughput rate of the proposed design is significantly increased by parallel lookup table (LUT) update and concurrent implementation of filtering and weight-update operations. 
The conventional adder-based shift accumulation for DA-based inner-product computation is replaced by conditional signed carry-save accumulation in order to reduce the sampling period and area complexity. Reduction of power consumption is achieved in the proposed design by using a fast bit clock for carry-save accumulation but a much slower clock for all other operations. 
It involves the same number of multiplexors, smaller LUT, and nearly half the number of adders compared to the existing DA-based design. From synthesis results, it is found that the proposed design consumes 13% less power and 29% less area-delay product (ADP) over our previous DA-based adaptive filter in average for filter lengths N = 16 and 32. Compared to the best of other existing designs, our proposed architecture provides 9.5 times less power and 4.6 times less ADP.


Low-Resolution DAC-Driven Linearity Testing of Higher Resolution ADCs Using Polynomial Fitting Measurements
A low-cost linearity test methodology for high-resolution analog-to-digital converters (ADCs) is presented in this paper. Linearity testing of ADCs requires high-precision digital-to-analog conversion (DAC) capability, commonly 3-bit higher resolution than the ADC under test. 
Further, a large number of ADC output data samples must be collected making conventional histogram testing impractical for high-resolution ADCs with 18-24 bit precision. In the proposed test methodology, two low-precision and low-cost DACs are used to generate a high-resolution ADC test stimulus. 
Significant reductions in test cost and test time are achieved by using low-cost instrumentation and by making fewer measurements than required for conventional histogram test. A least-squares-based polynomial fitting approach is used to determine the transfer function of the ADC under test. 
The generated transfer function is used to compute the non-linearity of the ADC accurately. No assumption is made regarding the linearity of the lower precision signal generators (DACs) used in the testing procedure. Software simulations and hardware experiments are performed to validate the proposed test methodology.


MDC FFT/IFFT Processor With Variable Length for MIMO-OFDM Systems
This paper presents an multipath delay commutator (MDC)-based architecture and memory scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple output-orthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length. 
Based on the MDC architecture, we propose to use radix-$N_{s}$ butterflies at each stage, where $N_{s}$ is the number of data streams, so that there is only one butterfly needed in each stage. Consequently, a 100% utilization rate in computational elements is achieved. Moreover, thanks to the simple control mechanism of the MDC, we propose simple memory scheduling methods for input data and output bit/set-reversing, which again results in a full utilization rate in memory usage. 
Since the memory requirements usually dominate the die area of FFT/inverse fast Fourier transform (IFFT) processors, the proposed scheme can effectively reduce the memory size and thus the die area as well. Furthermore, to apply the proposed scheme in practical applications, we let $N_{s}=4$ and implement a 4-stream FFT/IFFT processor with variable length including 2048, 1024, 512, and 128 for MIMO-OFDM systems. 
This processor can be used in IEEE 802.16 WiMAX and 3GPP long term evolution applications. The processor was implemented with an UMC 90-nm CMOS technology with a core area of 3.1 ${rm mm}^{2}$. The power consumption at 40 MHz was 63.72/62.92/57.51/51.69 mW for 2048/1024/512/128-FFT, respectively in the post-layout simulation. 
Finally, we analyze the complexity and performance of the implemented processor and compare it with other processors. The results show advantages of the proposed scheme in terms of area and power consumption. 


Memory efficient high-Speed convolution-based generic structure for multilevel 2D DWT
In this paper, we have proposed a design strategy for the derivation of memory-efficient architecture for multilevel 2-D DWT. Using the proposed design scheme, we have derived a convolution-based generic architecture for the computation of three-level 2-D DWT based on Daubechies (Daub) as well as biorthogonal filters. The proposed structure does not involve frame-buffer. 
It involves line-buffers of size 3(K-2)M/4 which is independent of throughput-rate, where K is the order of Daubechies/biorthogonal wavelet filter and M is the image height. This is a major advantage when the structure is implemented for higher throughput. The structure has regular data-flow, small cycle period TM and 100% hardware utilization efficiency. 
As per theoretical estimate, for image size 512 × 512, the proposed structure for Daub-4 filter requires 152 more multipliers and 114 more adders, but involves 82 412 less memory words and takes 10.5 times less time to compute three-level 2-D DWT than the best of the existing convolution-based folded structures. 
Similarly, compared with the best of the existing lifting-based folded structures, proposed structure for 9/7-filter involves 93 more multipliers and 166 more adders, but uses 85 317 less memory words and requires 2.625 times less computation time for the same image size. 
It involves 90 (nearly 47.6%) more multipliers and 118 (nearly 40.1%) more adders, but requires 2723 less memory words than the recently proposed parallel structure and performs the computation in nearly half the time of the other. Inspite of having more arithmetic components than the lifting-based structures, the proposed structure offers significant saving of area and power over the other due to substantial reduction in memory size and smaller clock-period. 
ASIC synthesis result shows that, the proposed structure for Daub-4 involves 1.7 times less area-delay-product (ADP) and consumes 1.21 times less energy per image- (EPI) than the corresponding best available convolution-based structure. It involves 2.6 times less ADP and consumes 1.48 times less EPI than the parallel lifting-based structure. 


Modified Gradient Search for Level Set Based Image Segmentation
Level set methods are a popular way to solve the image segmentation problem. The solution contour is found by solving an optimization problem where a cost functional is minimized. Gradient descent methods are often used to solve this optimization problem since they are very easy to implement and applicable to general nonconvex functionals. They are, however, sensitive to local minima and often display slow convergence. 
Traditionally, cost functionals have been modified to avoid these problems. In this paper, we instead propose using two modified gradient descent methods, one using a momentum term and one based on resilient propagation. These methods are commonly used in the machine learning community. 
In a series of 2-D/3-D-experiments using real and synthetic data with ground truth, the modifications are shown to reduce the sensitivity for local optima and to increase the convergence rate. 
The parameter sensitivity is also investigated. The proposed methods are very simple modifications of the basic method, and are directly compatible with any type of level set implementation. Downloadable reference code with examples is available online.


Multicarrier Systems based on Multistage Layered IFFT Structure
This letter extends our previous work on layered inverse Fast Fourier Transform (IFFT) structure to a multistage layered IFFT structure where data symbols can input at different stages of the IFFT. 
We first show that part of the IFFT in the transmitter of an OFDM system can be shifted to the receiver, while a conventional one-tap frequency-domain equalizer is still applicable. 
We then propose two IFFT split schemes based on decimation-in-time and decimation-in-frequency IFFT algorithms to enable interference-free symbol recovery with simple linear equalizers. Applications of the proposed schemes in multiple access communications are investigated. 
Simulation results demonstrate the effectiveness of the proposed schemes in improving bit-error-rate performance. 


Multivoltage Aware Resistive Open Fault Model
Resistive open faults (ROFs) represent common interconnect manufacturing defects in VLSI designs causing delay failures and reliability-related concerns. The widespread utilization of multiple supply voltages in contemporary VLSI designs and emerging test methods poses a critical concern as to whether conventional models for resistive opens will still be effective. 
Conventional models do not explicitly model the VDD effect on fault behavior and detectability. {We have empirically observed that a sensitized ROF could exhibit multiple behaviors across its resistance continuum. We also observe that the detectable resistance range versus VDD varies with test speed}. 
We consequently propose a voltage-aware model that divides the full range of open resistances into continuous behavioral intervals and three detectability ranges. The presented model is expected to substantially enhance multivoltage test generation and fault distinction.


Optical Flow Estimation for Flame Detection in Videos
Computational vision-based flame detection has drawn significant attention in the past decade with camera surveillance systems becoming ubiquitous. Whereas many discriminating features, such as color, shape, texture, etc., have been employed in the literature, this paper proposes a set of motion features based on motion estimators. 
The key idea consists of exploiting the difference between the turbulent, fast, fire motion, and the structured, rigid motion of other objects. Since classical optical flow methods do not model the characteristics of fire motion (e.g., non-smoothness of motion, non-constancy of intensity), two optical flow methods are specifically designed for the fire detection task: optimal mass transport models fire with dynamic texture, while a data-driven optical flow scheme models saturated flames. Then, characteristic features related to the flow magnitudes and directions are computed from the flow fields to discriminate between fire and non-fire motion. 
The proposed features are tested on a large video database to demonstrate their practical usefulness. Moreover, a novel evaluation method is proposed by fire simulations that allow for a controlled environment to analyze parameter influences, such as flame saturation, spatial resolution, frame rate, and random noise.


Oscillation and Transition Tests for Synchronous Sequential Circuits
In this brief, we propose an oscillation-ring test methodology for synchronous sequential circuits under the scan test environment. 
This approach provides the following features: 1) it is at-speed testing, which makes delay defects detectable; 2) the automatic test pattern generation is much easier, and the test set is usually smaller; and 3) test responses are directly observable at primary outputs and, thus, it greatly reduces the communication bandwidth between the automatic test equipment and the circuit under test. 
A modified scan register design supporting the oscillation-ring test is presented and an effective oscillation test generation algorithm for the proposed test scheme is given. Experimental results on LGSyn91 benchmarks show that the proposed test method achieves high fault coverage with a smaller number of test vectors.


Parallel AES Encryption Engines for Many-Core Processor Arrays
By exploring different granularities of data-level and task-level parallelism, we map 16 implementations of an Advanced Encryption Standard (AES) cipher with both online and offline key expansion on a fine-grained many-core system. 
The smallest design utilizes only six cores for offline key expansion and eight cores for online key expansion, while the largest requires 107 and 137 cores, respectively. 
In comparison with published AES cipher implementations on general purpose processors, our design has 3.5-15.6 times higher throughput per unit of chip area and 8.2-18.1 times higher energy efficiency. 
Moreover, the design shows 2.0 times higher throughput than the TI DSP C6201, and 3.3 times higher throughput per unit of chip area and 2.9 times higher energy efficiency than the GeForce 8800 GTX.


Performance Analysis of a New CMOS Output Buffer
A new CMOS output buffer with low switching noise and load adaptability is presented in this paper. The proposed circuit consists of two stages; first stage is set to reduce switching noise, static power dissipation and also output ringing. 
The second stage involves enough speed and full dynamic range. The performance of the proposed circuit is examined using Cadence and the model parameters of a 180 nm CMOS process. 
The simulation results have confirmed that the proposed output buffer can reduce propagation delay compared with the previous designs. The topology reports low sensitivities and has features suitable for VLSI implementation. 


Performance Evaluation of FFT Processor Using Conventional and Vedic Algorithm
Recently digital signal processing has received high attention due to the advancement in multimedia and wireless communication. Accordingly Orthogonal Frequency Division Multiple Access (OFDM) technique based on Time Division Duplex (TDD) is an attractive technology for high data rate wireless access in multichannel communication. 
The modulation and demodulation of OFDM are done by Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (IFFT) respectively. In this paper we propose a Vedic algorithm for the implementation of multiplier that is to be used in radix 25 512-point FFT processor. 
The multipliers based on Vedic mathematics are one of the fastest and low power multiplier. It enables parallel generation of partial product and eliminates unwanted multiplication steps. Thus Vedic multipliers ensure substantial reduction of propagation delay in FFT processor. 
The FFT processor employing Vedic multiplier reduces hardware complexity in area and power in FPGA implementation. The proposed processor has been designed in Xilinx and implemented using Spartan 3E FPGA kit with a supply voltage of 1.2 V. The delay and power obtained using the Vedic multiplier are 173.60ns and 11×10-2 W respectively. 


Pipelined Radix-2k Feed forward FFT Architectures
The focus of this study is on a family of hybrid architectures for feed-forward multi-layer neural networks and issues that arise in their design. The main objective in the design of this family has been to reduce the complexity of hardware, and hence make possible the implementation of larger networks for practical applications, by two main ideas: trading time for circuit complexity by a multiplexing scheme and a modular characteristic that allows multi-chip realizations without a prohibitive number of interconnections. 
In this paper, we propose to bring the various forms of this architecture together, which are at this time scattered in the literature. After presenting the main points in its operation, we will proceed to permutations and trade-offs, some of which have not been published in accessible literature so far. 
We start with the introduction of the basic architecture. We then present modifications and discuss some I/O issues. Matching neural transfer characteristics is important to the performance of the system and we address this problem with a set of second order improvements. 
Another version of the architecture, with external weight memory, is introduced which allows interaction with a host computer, and finally, a pipelined version of the architecture is presented that improves system speed with a small increment in overall complexity


Power-Planning-Aware Soft Error Hardening via Selective Voltage Assignment
Soft errors, which have been a significant concern in memories, are now a main factor in reliability degradation of logic circuits. This paper presents a power-planning-aware methodology using dual supply voltages for soft error hardening. 
Given a constraint on power overhead, our proposed framework can minimize the soft error rate (SER) of a circuit via selective voltage assignment. In the 70-nm predictive technology model, circuit SER can be reduced by 23% on top of SER-aware gate resizing. For power-planning awareness, a bi-partitioning technique based on a simplified version of the Fiduccia-Mattheyses (FM) algorithm is presented. 
The simplified FM-based partitioning refines the result of selective voltage assignment by decreasing the number of connections across voltage islands, while maintaining the SER reduction that has been accomplished.


Prototype of a Fingerprint Based Licensing System For Driving
To prevent non-licensees from driving and therefore causing accidents, a new system is proposed. An important and very reliable human identification method is fingerprint identification. Fingerprint identification is one of the most popular and reliable personal biometric identification methods. 
The proposed system consists of a smart card capable of storing the fingerprint of particular person. While issuing the license, the specific person's fingerprint is to be stored in the card. Vehicles such as cars, bikes etc should have a card reader capable of reading the particular license. 
The same automobile should have the facility of fingerprint reader device. A person, who wishes to drive the vehicle, should insert the card (license) in the vehicle and then swipe his/her finger. If the finger print stored in the card and fingerprint swiped in the device matches, he/she can proceed for ignition, otherwise ignition will not work. 
Moreover, the seat belt detector verifies and then prompts the user to wear the seat belt before driving. This increases the security of vehicles and also ensures safe driving by preventing accidents. 


RATS: Restoration-Aware Trace Signal Selection for Post-Silicon Validation
Post-silicon validation is one of the most important and expensive tasks in modern integrated circuit design methodology. The primary problem governing post-silicon validation is the limited observability due to storage of a small number of signals in a trace buffer. The signals to be traced should be carefully selected in order to maximize restoration of the remaining signals. 
Existing approaches have two major drawbacks. They depend on partial restorability computations that are not effective in restoring maximum signal states. They also require long signal selection time due to inefficient computation as well as operating on gate-level netlist. 
We have proposed a signal selection approach based on total restorability at gate-level, which is computationally more efficient (10 times faster) and can restore up to three times more signals compared to existing methods. We have also developed a register transfer level signal selection approach, which reduces both memory requirements and signal selection time by several orders-of-magnitude. 


Real Time Communication between Multiple FPGA Systems in Multitasking Environment Using RTOS
The recent development of Field-Programmable Gate Array (FPGA) architectures, with soft core (MicroBlaze) and hard core (PowerPC) processors, embedded memories and IP cores, offers the potential for high computing power. 
Presently FPGAs are considered as a major platform for high performance embedded applications as it provides the opportunity for reconfiguration as well as good clock speed and design resources. As the complexities in the embedded applications increase, use of an operating system brings in a lot of advantages. In present day application scenarios most embedded systems have real-time requirements that demand the use of Real-time operating systems (RTOS), which creates a suitable environment for real time applications to be designed and expanded easily. 
In an RTOS the design process is simplified by splitting the application code into separate tasks and then the scheduler executes them according to a specific schedule, meeting the real-time deadline. In this research work, we propose the design and implementation of a real-time FPGA based application, which demonstrates the creation of real-time process tasks in FPGA systems for successful real-time communication between multiple FPGA systems. 
We have chosen the RSA based encryption and decryption algorithm for this implementation, as security is one of the most important need for data communication. At first we demonstrate the real-time execution of multiple process tasks in a single FPGA system for the encryption and decryption of data. 
Next we describe the most challenging part of our work, where we establish the real-time communication between two FPGA systems, each running the encryption engine and decryption engine respectively and communicating with one another via an RS232 communication link. The results show that our design is better in terms of execution speed in comparison with the existing research works.


Reconfigurable Processor for Binary Image Processing
Binary image processing is a powerful tool in many image and video applications. A reconfigurable processor is presented for binary image processing in this paper. The processor's architecture is a combination of a reconfigurable binary processing module, input and output image control units, and peripheral circuits. 
The reconfigurable binary processing module, which consists of mixed-grained reconfigurable binary compute units and output control logic, performs binary image processing operations, especially mathematical morphology operations, and implements related algorithms more than 200 f/s for a 1024  × 1024 image. 
The periphery circuits control the whole image processing and dynamic reconfiguration process. The processor is implemented on an EP2S180 field-programmable gate array. 
Synthesis results show that the presented processor can deliver 60.72 GOPS and 23.72 GOPS/mm2 at a 220-MHz system clock in the SMIC 0.18-µm CMOS process. The simulation and experimental results demonstrate that the processor is suitable for real-time binary image processing applications.


Reduced-Complexity LCC Reed–Solomon Decoder Based on Unified Syndrome Computation
Reed-Solomon (RS) codes are widely used in digital communication and storage systems. Algebraic soft-decision decoding (ASD) of RS codes can obtain significant coding gain over the hard-decision decoding (HDD). Compared with other ASD algorithms, the low-complexity Chase (LCC) decoding algorithm needs less computation complexity with similar or higher coding gain. 
Besides employing complicated interpolation algorithm, the LCC decoding can also be implemented based on the HDD. However, the previous syndrome computation for 2? test vectors and the key equation solver (KES) in the HDD requires long latency and remarkable hardware. 
In this brief, a unified syndrome computation algorithm and the corresponding architecture are proposed. Cooperating with the KES in the reduced inversion-free Berlekamp-Messy algorithm, the reduced-complexity LCC RS decoder can speed up by 57% and the area will be reduced to 62% compared with the original design for ? = 3. 


Reducing the Cost of Implementing Error Correction Codes in Content Addressable Memories
Reliability is a major concern for memories. To ensure that errors do not affect the data stored in a memory, error correction codes (ECCs) are widely used in memories. ECCs introduce an overhead as some bits are added to each word to detect and correct errors. This increases the cost of the memory. 
Content addressable memories (CAMs) are a special type of memories in which the input is compared with the data stored, and if a match is found, the output is the address of that word. CAMs are used in many computing and networking applications. In this brief, the specific features of CAMs are used to reduce the cost of implementing ECCs. More precisely, the proposed technique eliminates the need to store the ECC bits for each word in the memory. 
This is done by embedding those bits into the address of the key. The main potential issue of the new scheme is that it restricts the addresses in which a new key can be stored. Therefore, it can occur that a new key cannot be added into the CAM when there are addresses that are not used. 
This issue is analyzed and evaluated showing that, for large CAMs, it would only occur when the CAM occupancy is close to 100%. Therefore, the proposed scheme can be used to effectively reduce the cost of implementing ECCs in CAMs.


Reduction of Leakage Current and Power in Full Subtractor Using MTCMOS Technique
In this paper a full subtractor using MTCMOS technique design is proposed. Combinational logic has extensive applications in quantum computing, low power VLSI design and optical computing. Reducing power dissipation is one of the most principle subjects in VLSI design today. 
But Scaling causes sub threshold leakage currents to become a large component of total power dissipation. Low-power design techniques proposed to minimize the active leakage power in nanoscale CMOS very large scale integration (VLSI) systems. 
Using MTCMOS approach compare leakage current and leakage power of full subtractor in active mode. leakage current in conventional full subtractor is 228.7 fA and proposed full subtractor is 271.1 fA, reduction in current is 15.63%. simulation result is performed at 0.7 volt using cadence virtuoso tool in 45 nanometer technology.


Reverse Circle Cipher for Personal and Network Security
Many data encryption techniques have been employed to ensure both personal data security and network security. But few have been successful in merging both under one roof. The block cipher techniques commonly used for personal security such as DES and AES run multiple passes over each block making them ineffective for real time data transfer. 
Also, ciphers for network security such as Diffie-Hellman and RSA require large number of bits. This paper suggests a simple block cipher scheme to effectively reduce both time and space complexities and still provide adequate security for both security domains. 
The proposed Reverse Circle Cipher uses `circular substitution' and `reversal transposition' to exploit the benefits of both confusion and diffusion. This scheme uses an arbitrarily variable key length which may even be equal to the length of the plaintext or as small as a few bits coupled with an arbitrary reversal factor. 
This method of encryption can be utilized within stand alone systems for personal data security or even streamed into real time packet transfer for network security. This paper also analyses the effectiveness of the algorithm with respect to the size of the plaintext and frequency distribution within the ciphertext.


RFID-based Location System for Forest Search and Rescue Missions
This paper presents the framework of an RFID-based rescue robot for missing people in forest environment. The three main design considerations include the reliability, the cost, and the environmental sustainability. 
For that, the paper analyzes different outdoor location technologies that can be used for this task, namely GPS, WiMAX and RFID. Former research in mobile robots based on GPS and WiMAX has resulted in high costs systems while RFID offers a lower cost alternative. 
The aim of this work is to provide an already existing mobile robot with RFID technology. Moreover, to overcome the inability of current rescue robots to detect human presence, the addition of an Infrared camera with thermal sensors is discussed. 
Finally, in order to optimize the energy management and to increase the autonomy of the rescue robot, the paper presents a power supply solution using solar energy.


Split-SAR ADCs: Improved Linearity With Power and Speed Optimization
This paper presents the linearity analysis of a successive approximation registers (SAR) analog-to-digital converters (ADC) with split DAC structure based on two switching methods: conventional charge-redistribution and Vcm-based switching. 
The static linearity performance, namely the integral nonlinearity and differential nonlinearity, as well as the parasitic effects of the split DAC, are analyzed hereunder. 
In addition, a code-randomized calibration technique is proposed to correct the conversion nonlinearity in the conventional SAR ADC, which is verified by behavioral simulations, as well as measured results. Performances of both switching methods are demonstrated in 90 nm CMOS. Measurement results of power, speed, and linearity clearly show the benefits of using Vm-based switching.


Spur-Reduction Frequency Synthesizer Exploiting Randomly Selected PFD
This brief presents a low-spur phase-locked loop (PLL) system for wireless applications. The low-spur frequency synthesizer randomizes the periodic ripples on the control voltage of the voltage-controlled oscillator to reduce the reference spur at the output of the PLL. 
A novel random clock generator is presented to perform the random selection of the phase frequency detector control for the charge pump in locked state. The proposed frequency synthesizer was fabricated in a TSMC 0.18-µm CMOS process. The proposed PLL achieved phase noise of -93 dBc/Hz with a 600-kHz offset frequency and reference spurs below -72 dBc.


Static Power Reduction Using Variation-Tolerant and Reconfigurable Multi-Mode Power Switches
Multithreshold CMOS is very effective for reducing standby leakage power during long periods of inactivity. Recently, a power-gating scheme was presented to support multiple power-off modes and reduce the leakage power during short periods of inactivity. However, this scheme can suffer from high sensitivity to process variations, which impedes manufacturability. 
We propose a new power-gating technique that is tolerant to process variations and scalable to more than two intermediate power-off modes. The proposed design requires less design effort and offers greater power reduction and smaller area cost than the previous method. 
In addition, it can be combined with existing techniques to offer further static power reduction benefits. Analysis and extensive simulation results demonstrate the effectiveness of the proposed design.


Teaching HW/SW Co-Design with a Public Key Cryptography Application
This paper describes a lab session-based course on hardware/software (HW/SW) co-design. Real problems often need to combine the speed of an HW solution with the flexibility of an SW solution. 
The goals of this course are to show that there are many alternative solutions in the design space and to teach the fundamental concepts of HW/SW co-design. The sample application for the course project is a basic public key (RSA) application. This application is attractive for pedagogic purposes because its complex arithmetic and large word lengths make it difficult to realize in SW on an embedded microcontroller. 
However, the alternative of a pure application-specific integrated circuit (ASIC) application is also not a satisfactory solution, as this lacks the flexibility to support multiple public key applications. 
The project follows a stepwise approach, with assignments that build on each other. Students are required to make their own decisions as to the partitioning between HW and SW, the interface design, and the optimizations goals. Besides imparting hard skills in HW design and embedded SW design, the course inculcates several soft skills—in particular, decision making, presentation skills, teamwork, and design creativity—generally overlooked in engineering.


Test Patterns of Multiple SIC Vectors: Theory and Application in BIST Schemes
This paper proposes a novel test pattern generator (TPG) for built-in self-test. Our method generates multiple single-input change (MSIC) vectors in a pattern, i.e., each vector applied to a scan chain is an SIC vector. A reconfigurable Johnson counter and a scalable SIC counter are developed to generate a class of minimum transition sequences. 
The proposed TPG is flexible to both the test-per-clock and the test-per-scan schemes. A theory is also developed to represent and analyze the sequences and to extract a class of MSIC sequences. Analysis results show that the produced MSIC sequences have the favorable features of uniform distribution and low input transition density. 
The performances of the designed TPGs and the circuits under test with 45 nm are evaluated. Simulation results with ISCAS benchmarks demonstrate that MSIC can save test power and impose no more than 7.5% overhead for a scan design. It also achieves the target fault coverage without increasing the test length. 


The LUT-SR Family of Uniform Random Number Generators for FPGA Architectures
Field-programmable gate array (FPGA) optimized random number generators (RNGs) are more resource-efficient than software-optimized RNGs because they can take advantage of bitwise operations and FPGA-specific features. However, it is difficult to concisely describe FPGA-optimized RNGs, so they are not commonly used in real-world designs. 
This paper describes a type of FPGA RNG called a LUT-SR RNG, which takes advantage of bitwise xor operations and the ability to turn lookup tables (LUTs) into shift registers of varying lengths. This provides a good resource–quality balance compared to previous FPGA-optimized generators, between the previous high-resource high-period LUT-FIFO RNGs and low-resource low-quality LUT-OPT RNGs, with quality comparable to the best software generators. 
The LUT-SR generators can also be expressed using a simple C++ algorithm contained within this paper, allowing 60 fully-specified LUT-SR RNGs with different characteristics to be embedded in this paper, backed up by an online set of very high speed integrated circuit hardware description language (VHDL) generators and test benches.


The Security Technology and Tendency of New Energy Vehicle in Future
Enhancing the vehicle's security is one of the themes to the future new energy vehicle, this article introduces future vehicle's active and passive safety technology, and analysises the developments of the vehicle's new technology, also, explains the development direction of the future-vehicle security technology. 
Integrated, intelligent, systematic and high safety index vehicles are an irresistible trend of the new energy vehicle.


Theoretical Modeling of Elliptic Curve Scalar Multiplier on LUT-Based FPGAs for Area and Speed
This paper uses a theoretical model to approximate the delay of different characteristic two primitives used in an elliptic curve scalar multiplier architecture (ECSMA) implemented on k input lookup table (LUT)-based field-programmable gate arrays. Approximations are used to determine the delay of the critical paths in the ECSMA. 
This is then used to theoretically estimate the optimal number of pipeline stages and the ideal placement of each stage in the ECSMA. This paper illustrates suitable scheduling for performing point addition and doubling in a pipelined data path of the ECSMA. 
Finally, detailed analyses, supported with experimental results, are provided to design the fastest scalar multiplier over generic curves. Experimental results for GF(2163) show that, when the ECSMA is suitably pipelined, the scalar multiplication can be performed in only 9.5 µs on a Xilinx Virtex V. 
Notably the design has an area which is significantly smaller than other reported high-speed designs, which is due to the better LUT utilization of the underlying field primitives.


Time-Based All-Digital Technique for Analog Built-in Self-Test
A scheme for built-in self-test of analog signals with minimal area overhead for measuring on-chip voltages in an all-digital manner is presented. The method is well suited for a distributed architecture, where the routing of analog signals over long paths is minimized. A clock is routed serially to the sampling heads placed at the nodes of analog test voltages. 
This sampling head present at each test node, which consists of a pair of delay cells and a pair of flip-flops, locally converts the test voltage to a skew between a pair of subsampled signals, thus giving rise to as many subsampled signal pairs as the number of nodes. 
To measure a certain analog voltage, the corresponding subsampled signal pair is fed to a delay measurement unit to measure the skew between this pair. The concept is validated by designing a test chip in a UMC 130-nm CMOS process. Sub-millivolt accuracy for static signals is demonstrated for a measurement time of a few seconds, and an effective number of bits of 5.29 is demonstrated for low-bandwidth signals in the absence of sample-and-hold circuitry.


Two-Tone Phase Delay Control of Center Frequency and Bandwidth in Low-Noise-Amplifier RF Front Ends
This brief presents a two-tone system for controlling the center frequency and bandwidth of an RLC tank with an application to center frequency and bandwidth control in low noise amplifier (LNA) RF front ends. 
The circuit operates based on the fact that an RLC tank induces a phase difference with special properties between two frequencies. The system is demonstrated in hardware in the TSMC CMOS 0.18-µm process for a center frequency of 2.45 GHz and a bandwidth of 60 MHz. 
The LNA center frequency can be controlled with a precision of ˜0.2% while the bandwidth can be controlled with a precision of ˜8%. The tuning time is 3 µs multiplied by the number of tuning states. The tuning states are the circuit states set digitally and analyzed until the desired operating point is achieved. 


Unique Measurement and Modeling of Total Phase Noise in RF Receiver
Radio frequency (RF) receivers are common in many modern communications and radar systems, and they suffer from many performance degradation factors due to hardware limitations. Among all performance degradation contributors, phase noise and time jitter are particularly troublesome since they cause random errors which are difficult to compensate. 
The local oscillator in the receiver front end is a major contributor of phase noise, while the analog-to-digital converter (ADC) introduces time jitter. It is desired to know the accumulated effect of individual phase noise sources and time jitter. The total effect of all phase noise and jitter can be represented by an accumulated phase noise term at the ADC's output, called total phase noise (TPN) in this brief. 
The focus of this work is on measuring and modeling TPN in the RF receiver by applying optimization techniques. In contrast to traditional phase noise measurement that typically requires a high-quality tunable downconverter, a digital approach using the data captured directly by the RF receiver is proposed. 
In addition, iterative optimization-based TPN spectral model fitting and statistic modeling are introduced. The model is examined using the measured TPN. It is confirmed that the RF receiver TPN can be viewed as a wide-sense stationary zero-mean Gaussian process with certain spectral profile.


VLSI Implementation of a Multi-Mode Turbo/LDPC Decoder Architecture
Flexible and reconfigurable architectures have gained wide popularity in the communications field. In particular, reconfigurable architectures for the physical layer are an attractive solution not only to switch among different coding modes but also to achieve interoperability. 
This work concentrates on the design of a reconfigurable architecture for both turbo and LDPC codes decoding. The novel contributions of this paper are: i) tackling the reconfiguration issue introducing a formal and systematic treatment that, to the best of our knowledge, was not previously addressed and ii) proposing a reconfigurable NoC-based turbo/LDPC decoder architecture and showing that wide flexibility can be achieved with a small complexity overhead. 
Obtained results show that dynamic switching between most of considered communication standards is possible without pausing the decoding activity. Moreover, post-layout results show that tailoring the proposed architecture to the WiMAX standard leads to an area occupation of 2.75 mm2 and a power consumption of 101.5 mW in the worst case.


WLS Design of Sparse FIR Digital Filters
In this paper, we propose a novel algorithm for sparse finite impulse response (FIR) filter designs. The objective of the sparse digital filter design problem considered in this paper is to reduce the number of nonzero-valued filter coefficients, subject to a weighted least-squares (WLS) approximation error constraint imposed on the frequency domain. 
The proposed design method is inspired by the iterative shrinkage/thresholding (IST) algorithms, which are used in sparse and redundant representation for signals. The basic idea of the proposed design algorithm is to successively transform the original nonconvex problem to a series of constrained subproblems in a simpler form. 
Despite of their nonconvexity, these subproblems can be efficiently and reliably solved in each iterative step by a numerical approach developed in this paper. 
Furthermore, it can be demonstrated that the obtained solutions are essentially optimal to their respective subproblems. Since its major part only involves scalar operations, the proposed algorithm is computationally efficient. Three sets of numerical examples are presented in this paper to illustrate the effectiveness of the proposed design algorithm.




FOR MORE ABSTRACTS, IEEE BASE PAPER / REFERENCE PAPERS AND NON IEEE PROJECT ABSTRACTS

CONTACT US
No.109, 2nd Floor, Bombay Flats, Nungambakkam High Road, Nungambakkam, Chennai - 600 034
Near Ganpat Hotel, Above IOB, Next to ICICI Bank, Opp to Cakes'n'Bakes
044-2823 5816, 98411 93224, 89393 63501
ncctchennai@gmail.com, ncctprojects@gmail.com 


EMBEDDED SYSTEM PROJECTS IN
Embedded Systems using Microcontrollers, VLSI, DSP, Matlab, Power Electronics, Power Systems, Electrical
For Embedded Projects - 044-45000083, 7418497098 
ncctchennai@gmail.com, www.ncct.in


Project Support Services
Complete Guidance | 100% Result for all Projects | On time Completion | Excellent Support | Project Completion Experience Certificate | Free Placements Services | Multi Platform Training | Real Time Experience


TO GET ABSTRACTS / PDF Base Paper / Review PPT / Other Details
Mail your requirements / SMS your requirements / Call and get the same / Directly visit our Office


WANT TO RECEIVE FREE PROJECT DVD...
Want to Receive FREE Projects Titles, List / Abstracts  / IEEE Base Papers DVD… Walk in to our Office and Collect the same Or

Send your College ID scan copy, Your Mobile No & Complete Postal Address, Mentioning you are interested to Receive DVD through Courier at Free of Cost


Own Projects
Own Projects ! or New IEEE Paper… Any Projects…
Mail your Requirements to us and Get is Done with us… or Call us / Email us / SMS us or Visit us Directly

We will do any Projects…



VLSI Project Titles, VLSI Project Abstracts, VLSI IEEE Project Abstracts, VLSI Projects abstracts for CSE IT MCA, Download VLSI Titles, Download VLSI Project Abstracts, Download IEEE VLSI Abstracts