Tuesday, April 2, 2019
Algorithm For Segmentation Of Urdu Script English Language Essay
algorithmic program For sectionalisation Of Urdu Script English Language Essay divider of book of account plays a vital role in paw light. It is vital to examine the script that is utilise in penning a document off front developing or utilise a type to recognize it. grasp codes etc. In ligature model, intelligence agency model is utilize at document, page and enunciate level for variance. Our algorithm for segmentation of Urdu script employ book of facts model and Hidden Markov Model (HMM) to enhance spring feature previously. We have extracted features from images and calculated the upper limit likelihood to match reference books in evidence algorithm with a feature extracted from a textual matterbook ensample. The main features use in the system impart be pre- revealgrowthing, connected comp whizznt analysis, credit and segmentation of text up to fictitious section level. The algorithm leave al unmatched show a means to implement an Urdu OCR system o n the basis of the feature model.Key terms Preprocessing, Segmentation of characters, character model, Optical character recognition (OCR), max and argmax. entranceWe use an OCR system / s wadner to get images of text 1. Into preprocessing image pass on be converted to noiseless B/W image.1.1 SegmentationSegmentation is dividing an image into sm wholeer segments or pieces 2. Segmentation occurs on dickens levels. At first level both text and graphics ar marooned for farther processing. At second level, segmentation is per appointed on text to detach paragraphs, talking to, and characters etc. Segmentation of text canister be performed on a document, page, paragraph and character levels 3. They suggested various segmentation neares namely 4.Holistic regularitySegmentation based approachSegmentation free approachIn holistic method whole expression is categorize development a lexicon, the features of test input are matched against trained prototypes 5. The limitation is that the method is non good for larger classes and it can only be used with the some other(a) two methods. Segmentation divides a word into sm all(prenominal)(prenominal)er segments. The image of the word is broken up into several entities called graphemes 4. Segmentation depends on human intuition. In segmentation free approach character model can be used to concatenate characters and form words. For instance segmentation free approach can be based on Hidden Markov Model (HMM) that is a stochastic model.1.2. Urdu Language and Text SegmentationUrdu is a cursive (written with the characters joined) writing language. Urdu language characters are similar in shape and have curves that make it difficult to recognize by a machine. Moreover it has more than one symbol to represent a character. Due to its cursive nature characters / scripts in Urdu language are hard to recognize by a data processor program. A very accurate technique is needed to recognize / encounter Urdu characters. Urdu characters have four elementary shapesBasic Symbols (38 Symbols) tabularize 1 shows the fundamental symbols / shapes for Urdu Language.Beginning Symbols (26 Symbols)Table 2 shows the basic symbols / shapes for Urdu Language.Mid Symbols (40 Symbols)Table 3 shows the basic symbols / shapes for Urdu Language. different SymbolsThis includes symbols for numbers, special symbols like zabar, zair, paish etc.The symbol tables, Table 1, Table 2, Table3 and Table 4, for Urdu language are given below asTable1. Basic SymbolsTable 2. Beginning SymbolsTable 3. Mid SymbolsTable 4. Other SymbolsWe used Urdu script Nastaliq for our work. We extracted images for Urdu character set like basic, beginning, mid and other symbols using available Nastaliq font.Literature ReviewIn a morphological approach to script identification, stroke geometry has been utilized for script characterization and identification 6. Individual character images in a document are classified either by applying a prototyp e classification or by using support vector machine. Ligatures are used for segmentation / recognition of Urdu characters. The ligature is a sequence of characters in a word disordered by non-joiner characters like space.Their approach in 1 used ligature model and it is divided up into two stagesLine SegmentationLine segmentation deals with the staining of text lines in the image. The image is scanned horizontally from right to left direction, upward to downwards, in search of a text pixel. Afterwards, it is determined whether this pixel belongs to a primary ligature or a secondary ligature as shown in frame 1. The freeman chain codes (FCC) of the ligature are compared with already calculated FCC of the secondary ligatures.Character SegmentationThe text is skeletonized and a pit matrix is constructed which contains the identifiers of all ligatures in the image. The position of individual characters in a word is determined. Segmentation is by means of with(p) using primary l igatures only.Fig 1. (a) Urdu word (b) Seven ligatures (c) Three Primary ligatures(d) Four Secondary ligatures 7.Limitations of the method are firstly, they performed segmentation on the basis of primary ligatures only, therefore, it will not class between seen and sheen because it will ignore secondary ligatures i.e. dots. Secondly, dictionary of images stored for provision will be huge. Thirdly, there are problems of over segmentation and infra segmentation. In 8, they have proposed a ligature and word model for Urdu word segmentation. It was done in common chord phasesIn 1st phase, data is collected. They set Ligatures and calculated word probabilities using probabilistic measure. From the input set of ligatures, all sequences of words are generated and ranked using the lexicon lookup.In the 2nd phase, exculpate k sequences are selected using a selected beam value for further processing. It uses valid words heuristic for selection process.In the third phase, maximum probabl e sequence from these k word sequences is selected. Their method used dictionary of ligatures/words, chain codes, and to pass best probable sequences they used HMM toolkit HTK to recognize a word / ligature. They have recommended that their work can be further ameliorate by using the character model for Urdu text segmentation 9.A poor segmentation will lead to poor recognition 10. They divided image into smaller blocks, check for uniformity, group uniform block using color similarity and identify text in this block 11. They used edge density based noise detection to segment out text areas in video/ images 12. Segmentation of an image into text and non-text regions work performance in OCR development 13. They proposed line segmentation method using histogram equalization, indicated various problems and text line into ligature using chain codes 14. They presented bounding package based approach for segmentation of table of contents in Urdu script 15. They analyzed horizontal and unsloped projection profiles for line and character segmentation. Misclassification occurs at character level 16. They proposed text line extraction using vertical projection, marking all points where pixel values are not gear up and text line into ligatures using stroke geometry 17. They proposed identification of partial words (i.e. connected chemical elements) in text line and using horizontal / vertical projections to identify words using relative distance matching 18. They used dictionary for text line and ligature segmentation in online text 19.Problem StatementPrevious work has limitations that it cannot correctly perform segmentation in few cases and there will be misclassification problems. Moreover it can recognize a limited set of connected components or ligatures only.Proposed Segmentation AlgorithmWe will enhance previous work by proposing an change algorithm for Urdu script segmentation that will use a character model. For this purpose we have created a set of chara cters. There are nearly 114 characters excluding some special characters like zabar, zair, paish etc. We have used characters of flash-frozen size and style in this work. We are using all the variations of from each one character in a writing style e.g. bay has three shapes a basic, a beginning and mid shapes. Our algorithm uses a character model with Hidden Markov Models (HMMs) for segmentation of Urdu text. To the best of our knowledge, this work has not been done previously. We have offline text i.e., scanned pre-processed B/W Urdu characters and we are using Matlab ver. 7.12 as programming tool.4.1 Our MethodOur method is divided into three broad stairs measuring1 Data Acquisition / Feature ExtractionIn the first step, algorithm transforms images of symbols into binary form as a matrix. so extract features from the images using our feature extraction program and store it into a disk. These features are represented as surreptitious states X(i) = x(0), x(1), . . . , x (k) wh ere each X (i) represents a feature (in matrix form) for each shape in an Urdu character set x (k) is a position vector in the matrix X (i).Step2 Get Observed dataThe sight data contain sequences of Urdu characters. In our study we have used a line of Urdu text. After acquiring this filtered image, we have transformed it into binary form. consequently(prenominal) extracted features from an image using our feature extraction program. This feature contains several Urdu characters in it. The algorithm will scan it and perform segmentation by designing maximum probabilities with hidden states and locating observations in feature using HMMs. These observations form observable states O(i) = o(0), o(1), . . . , o(k) where each O(i) represents feature (in matrix form) for each shape in observed states o(k) is a positional vector in matrix O(i).Step3 Apply HMMsWe are givenHidden states X(i) = x(1), x(2), . . . , x(k) where i = 1,2, , m (for m characters).Observable states O(i) = o(1) , o(2), . . . , o(k) where i = 1,2, , n.Initial Distri saveion X(0).In a hidden Markov model the state variable x(i) is observable only through its measurements o(i). Now, suppose that a sequence O(i) of emission has been observed.Fig 2 shows sack of a character and an observed sequence that are captured using MATLAB matrices.(a)(b)Fig 2 (a) A m x n matrix video display Urdu character Alif. (b) specimen observation showing a connected component of two characters bay and alif spelled out ba.Instead of using characters our algorithm extracted features from all the characters to reduce computation complexity. These features will be used as hidden states in HMM i.e. x(i) and are stored on disk for example, features showing character alif and bay, captured using MATLAB, are shown below in fig 3.(a)(b)(c)Fig 3. (a) Feature for character Alif, (b) Feature for character Bay and(c) Feature for sample S(i) taken from word ba i.e. bay-alif.The algorithm extracts feature from line of sample text S(i). In forward algorithm, the feature s(1), , s(k) is matched against each of the hidden states x(i) by matching rows of x(i) with rows of S(i). The process continues for all characters and stops after calculating probabilities for all the characters i.e. P(X(i)Z(i)). Afterwards it finds the maximation of probability and in this way it finds observation O(1) from the S(1). The forward algorithm will continue from s(k+1), , s(L) to find observations O(2), , O(n). If there is more than one probable character, then we can use a so called Viterbi algorithm that will find argmax and will give the optimal probable sequence if we are not near to actual results. The algorithm for the HMMs is as underAlgorithm Segsha (S, L)j=1while ( j L )for i = 1 to nSample s(j) wwi = pr(s(j)X(i))end-forO(i) = O(i) U max( wi )s(j) = s(j) + 1end-whileWhere S is a sample feature of vectors obtained from an observed sequence O(i) i.e., a line of Urdu text L is the dimension of S (length of S) S(j ) is a sample taken from S each time to match against character feature X(i) and probability of matching will give us weights, wi, for each character max(wi) is maximization of probability that government issue as followsHere max(wi) can be calculated by canvas wi w and calculated by using the eq.1 20.ResultA integral of 1200 words were used that include all the characters in our character set. Sample scanned text was taken from Nastaliq font with point size 36. We found that 1176 out of 1200 were completely recognized. Not the whole word but only one or two characters in a word were misclassified. The accuracy of 97% was very encouraging for us and we are looking forward to work further in this area.ConclusionWe tested our approach on images of text taken from Nastaliq font scanned at 300 dpi and found that better results can be achieved by using HMM with the character model. These results were checked on a prototype using a set of characters. We have achieved 97% accuracy. for thcoming Work and EnhancementsIn future we are planning on two things1. To eliminate restriction of fixed font size and style.2. To work with written Urdu text.We will use both of the options using the same method but that is another story.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment