Research


Research Areas:

Abstracts:

TitlePerformance Prediction for Graph Queries 
Authors: M.H. Namaki, K. Sasani, Y. Wu and A.H. Gebremedhin 
Status: ACM SIGMOD International Conference on Management of Data Workshop on Network Data Analytics (NDA 2017) 

Abstract: Query performance prediction has shown benefits to query optimization and resource allocation for relational databases. Emerging applications are leading to search scenarios where workloads with heterogeneous, structure-less analytical queries are processed over large-scale graph and network data. This calls for effective models to predict the performance of graph analytical queries, which are often more involved than their relational counterparts. In this paper, we study and evaluate predictive techniques for graph query performance prediction. We make several contributions. (1) We propose a general learning framework that makes use of practical and computationally efficient statistics from query scenarios and employs regression models. (2) We instantiate the framework with two routinely issued query classes, namely, reachability and graph pattern matching, that exhibit different query complexity. We develop modeling and learning algorithms for both query classes. (3) We show that our prediction models readily apply to resource-bounded querying, by providing a learning-based workload optimization strategy. Given a query workload and a time bound, the models select queries to be processed with a maximized query profit and a total cost within the bound. Using real-world graphs, we experimentally demonstrate the efficacy of our framework in terms of accuracy and the effectiveness of workload optimization. 

TitleParallel Maximum Clique Algorithms with Applications to Network Analysis 
Authors: R.A. Rossi, D.F. Gleich, A.H. Gebremedhin 
Status: SIAM Journal on Scientific Computing, Vol 37, Issue 5, pages C589-C618, 2015. 

Abstract: We present a fast, parallel maximum clique algorithm for large sparse graphs that is designed to exploit characteristics of social and information networks. The method exhibits a roughly linear runtime scaling over real-world networks ranging from a thousand to a hundred million nodes. In a test on a social network with 1.8 billion edges, the algorithm finds the largest clique in about 20 minutes. At its heart the algorithm employs a branch-and-bound strategy with novel and aggressive pruning techniques. The pruning techniques include the combined use of core numbers of vertices along with a good initial heuristic solution to remove the vast majority of the search space. In addition, the exploration of the search tree is parallelized. During the search, processes immediately communicate changes to upper and lower bounds on the size of maximum clique. This occasionally results in a super-linear speedup because tasks with large search spaces can be pruned by other processes. We demonstrate the impact of the algorithm on applications using two different network analysis problems: computation of temporal strong components in dynamic networks and determination of compress-friendly ordering of nodes of massive networks. 

TitleFast Algorithms for the Maximum Clique Problem on Massive Graphs with Applications to Overlapping Community Detection 
Authors: B. Pattabiraman, M.M.A Patwary, A.H. Gebremedhin, W.K. Liao, A. Choudhary 
Status: Internet Mathematics, Vol 11, No 4-5, pp 421-448, 2015. 

Abstract: The maximum clique problem is a well known NP-Hard problem with applications in data mining, network analysis, information retrieval and many other areas related to the World Wide Web. There exist several algorithms for the problem with acceptable runtimes for certain classes of graphs, but many of them are infeasible for massive graphs. We present a new exact algorithm that employs novel pruning techniques and is able to find maximum cliques in very large, sparse graphs quickly. Extensive experiments on different kinds of synthetic and real-world graphs show that our new algorithm can be orders of magnitude faster than existing algorithms. We also present a heuristic that runs orders of magnitude faster than the exact algorithm while providing optimal or near-optimal solutions. We illustrate a simple application of the algorithms in developing methods for detection of overlapping communities in networks. 

TitleFast Maximum Clique Algorithms for Large Graphs 
Authors: R.A. Rossi, D.F. Gleich, A.H. Gebremedhin, M.M.A. Patwary 
Status: Proceedings of WWW2014. 

Abstract: We propose a fast, parallel maximum clique algorithm for large sparse graphs that is designed to exploit characteristics of social and information networks. Despite clique’s status as an NP-hard problem with poor approximation guarantees, our method exhibits nearly linear runtime scaling over real-world networks ranging from 1000 to 100 million nodes. In a test on a social network with 1.8 billion edges, the algorithm finds the largest clique in about 20 minutes. Key to the efficiency of our algorithm are an initial heuristic procedure that finds a large clique quickly and a parallelized branch and bound strategy with aggressive pruning tnd ordering echniques. We use the algorithm to compute the largest temporal strong components of temporal contact networks. 

TitleFast Algorithms for the Maximum Clique Problem on Massive Sparse Graphs 
Authors: B. Pattabiraman, M.M.A Patwary, A.H. Gebremedhin, W.K. Liao, A. Choudhary 
Status: WAW 2013: 10th Workshop on Algorithms and Models for the Web Graph, LNCS 8305, pp 156-169, 2013. 

Abstract: The maximum clique problem is a well known NP-Hard problem with applications in data mining, network analysis, information retrieval and many other areas related to the World Wide Web. There exist several algorithms for the problem with acceptable runtimes for certain classes of graphs, but many of them are infeasible for massive graphs. We present a new exact algorithm that employs novel pruning techniques and is able to quickly find maximum cliques in large sparse graphs. Extensive experiments on different kinds of synthetic and real-world graphs show that our new algorithm can be orders of magnitude faster than existing algorithms. We also present a heuristic that runs orders of magnitude faster than the exact algorithm while providing optimal or near-optimal solutions. 

Title : A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen a tool for analyzing and managing short-sequence repeat data 
Authors: H.N. Catanese, K.A. Brayton, A.H. Gebremedhin 
Status: BMC Bioinformatics 2018. 

AbstractBackground. Sequence similarity networks are useful for classifying and characterizing biologically important proteins. Threshold-based approaches to similarity network construction using exact distance measures are prohibitively slow to compute and rely on the difficult task of selecting an appropriate threshold, while similarity networks based on approximate distance calculations compromise useful structural information. 

Results. We present an alternative network representation for a set of sequence data that overcomes these drawbacks. In our model, called the Directed Weighted All Nearest Neighbors (DiWANN) network, each sequence is represented by a node and is connected via a directed edge to only the closest sequence, or sequences in the case of ties, in the dataset. Our contributions span several aspects. Specifically, we: (i) Apply an all nearest neighbors network model to protein sequence data from three different applications and examine the structural properties of the networks; (ii) Compare the model against threshold-based networks to validate their semantic equivalence, and demonstrate the relative advantages the model offers; (iii) Demonstrate the model’s resilience to missing sequences; and (iv) Develop an efficient algorithm for constructing a DiWANN network from a set of sequences. We find that the DiWANN network representation attains similar semantic properties to threshold-based graphs, while avoiding weaknesses of both high and low threshold graphs. Additionally, we find that approximate distance networks, using BLAST bitscores in place of exact edit distances, can cause significant loss of structural information. We show that the proposed DiWANN network construction algorithm provides a fourfold speedup over a standard threshold based approach to network construction. We also identify a relationship between the centrality of a sequence in a similarity network of an Anaplasma marginale short sequence repeat dataset and how broadly that sequence is dispersed geographically. 

Conclusion. We demonstrate that using approximate distance measures to rapidly construct similarity networks may lead to significant deficiencies in the structure of that network in terms centrality and clustering analyses. We present a new network representation that maintains the structural semantics of threshold-based networks while increasing connectedness, and an algorithm for constructing the network using exact distance measures in a fraction of the time it would take to build a threshold-based equivalent. 

TitleRepeatAnalyzer: a tool for analysing and managing short-sequence repeat data 
Authors: H.N. Catanese, K.A. Brayton, A.H. Gebremedhin 
Status: BMC Genomics 2016 17:422. 

AbstractBackground. Short-sequence repeats (SSRs) occur in both prokaryotic and eukaryotic DNA, inter- and intragenically, and may be exact or inexact copies. When heterogeneous SSRs are present in a given locus, we can take advantage of the pattern of different repeats to genotype strains based on the SSRs. Cataloguing and tracking these repeats can be difficult as diverse groups of researchers are involved in the identification of the repeats. Additionally, the task is error-prone when done manually. 

Results. We developed RepeatAnalyzer, a new software tool capable of tracking, managing, analysing and cataloguing SSRs and genotypes using Anaplasma marginale as a model species. RepeatAnalyzer’s analysis capability includes novel metrics for measuring regional genetic diversity (corresponding to variety and regularity of SSR occurrence). As a part of its visualization capabilities, RepeatAnalyzer produces high quality maps of the geographic distribution of genotypes or SSRs over a region of interest. RepeatAnalyzer’s repeat identification functionality was validated for all SSRs and genotypes reported in 21 publications, using 380 A. marginale isolates gathered from the five publications within that list that provided access to their isolates. The tool produced accurate genotyping results in every case. In addition, it uncovered a number of errors in the published literature: 11 cases where SSRs were misreported, 5 cases where two different SSRs had been given the same name, and 16 cases where two or more names had been given to a single SSR. The analysis and visualization functionalities of the tool are demonstrated using several examples. 

Conclusions. RepeatAnalyzer is a robust software tool that can be used for storing, managing, and analysing short-sequence repeats for the purpose of strain identification. The tool can be used for any set of SSRs regardless of species. When applied to A. marginale, our test case, we show that genotype lengths for a given region follow a normal distribution, while SSR frequencies follow a power-law-like distribution. Further, we find that over 90 % of repeats are 28 to 29 amino acids long, which is in agreement with conventional wisdom. Lastly, our analysis reveals that the most common edit distance is five or six, which is counter-intuitive since we expected that result to be closer to one, resulting from the simplest change from one repeat to another. 

TitleCharacterization of Anaplasma marginale subspecies centrale using msp1aS genotyping reveals wildfire reservoir 
Authors: Z.T.H. Khumalo, H.N. Catanese, N. Leisching, P. Hove, N.E. Collins, M.E. Chaisit, A.H. Gebremedhin, M.C. Oosthuizen and K.A. Brayton 
Status: Journal of Clinical Microbiology, 2016 54:10, 2503-2512. 

Abstract: Bovine anaplasmosis caused by the intraerythrocytic rickettsial pathogen Anaplasma marginale is endemic in South Africa. Anaplasma marginale subspecies centrale (A. centrale) also infects cattle, however, it causes a milder form of anaplasmosis and is used as a live vaccine against A. marginale. There has been less interest in the epidemiology of A. centrale, and, as a result, there are few reports detecting natural infections of this organism. When detected in cattle, it is often assumed that it is due to vaccination, and in most cases it is reported as co-infection with A. marginale without characterization of the strain. In this study a total of 380 blood samples from wild ruminant species and cattle collected from Biobanks, National Parks, and other regions of South Africa were used in duplex real-time PCR assays to simultaneously detect A. marginale and A. centrale. PCR results indicated high occurrence of A. centrale infections ranging from 25-100% in National Parks. Samples positive for A. centrale were further characterized using the msp1aS gene, a homolog of msp1a of A. marginale which contains repeats at the 5′ end that are useful for genotyping strains. A total of 47 Msp1aS repeats were identified which corresponded to 32 A. centrale genotypes detected in cattle, buffalo and wildebeest. RepeatAnalyzer was used to examine strain diversity. Our results demonstrate a diversity of A. centrale strains from cattle and wildlife hosts from South Africa and indicate the utility of msp1aS as a genotypic marker for A. centrale strain diversity. 

Title: Co-infections with multiple genotypes of Anaplasmamarginale in cattle indicate pathogen diversity 
Authors: P. Hove, M.E. Chaisi, K.A. Brayton, H. Ganesan, H.N. Catanese, M.S. Mtshali, A.M. Mutshembele, M.C. Oosthuizen, and N.E. Collins. 
Status: Journal of Clinical Microbiology, 2016 54:10, 2503-2512. 

AbstractBackground. Only a few studies have examined the presence of Anaplasma marginale and Anaplasma centrale in South Africa, and no studies have comprehensively examined these species across the whole country. To undertake this country-wide study we adapted a duplex quantitative PCR assay for use in South Africa, but found that one of the genes on which the assay was based was variable. Therefore, we sequenced a variety of field samples and tested the assay on the variants detected. We used the assay to screen 517 cattle samples sourced from all nine provinces of South Africa, and subsequently examined A. marginale positive samples for msp1 genotype to gauge strain diversity. 

Results.  Although the A. marginale msp1β gene is variable, the qPCR functions at an acceptable efficiency. The A. centralegroEL gene was not variable within the qPCR assay region. Of the cattle samples screened using the assay, 57% and 17% were found to be positive for A. marginale and A. centrale, respectively. Approximately 15% of the cattle were co-infected. Msp1 genotyping revealed 36 novel repeat sequences. Together with data from previous studies, we analyzed the Msp1a repeats from South Africa where a total of 99 repeats have been described that can be attributed to 190 msp1α genotypes. While 22% of these repeats are also found in other countries, only two South African genotypes are also found in other countries; otherwise the genotypes are unique to South Africa. 

Conclusions.  Anaplasma marginale was prevalent in the Western Cape, KwaZulu-Natal and Mpumalanga, and absent in the Northern Cape. Anaplasma centrale was prevalent in the Western Cape and KwaZulu-Natal and absent in the Northern Cape and Eastern Cape. None of the cattle in the study were known to be vaccinated with A. centrale, so finding positive cattle indicates that this organism appears to be naturally circulating in cattle. A diverse population of A. marginale strains is found in South Africa, with some msp1α genotypes widely distributed across the country, and others appearing only once in one province. This diversity should be taken into account in future vaccine development studies. 

TitleA Closed-loop Deep Learning Architecture for Robust Activity Recognition using Wearable Sensors 
Authors: R. Saeedi, S. Norgaard, and A.H. Gebremedhin 
Status: IEEE Big Data 2017 

Abstract: Human activity recognition (HAR) plays a central role in health-care, fitness and sport applications because of its potential to enable context-aware human monitoring. With the increase in popularity of wearable devices, we are witnessing a large influx in availability of human activity data. For effective analysis and interpretation of these heterogeneous and high-volume streaming data, we need powerful algorithms. In particular, there is a strong need for developing algorithms for robust classification of human activity data that specifically address challenges associated with dynamic environments (e.g. different users, signal heterogeneity). We use the term robust here in two, orthogonal senses: 1) leveraging related data in such a way that knowledge is transferred to a new context; and 2) actively re-configuring machine learning algorithms such that they can be applied in a new context. In this paper, we propose an architecture that combines an active learning approach with a novel deep network. Our deep neural network exploits both Convolutional and Long Short-Term Memory (LSTM) layers in order to learn hierarchical representation of features and capture time dependencies from raw-data. The active learning process allows us to choose the best instances for fine-tuning the deep network to the new setting in which the system operates (i.e. a new subject). We demonstrate the efficacy of the architecture using real data of human activity. We show that the accuracy of activity recognition reaches over 90% by annotating less than 20% of unlabeled data. 

TitleCo-MEAL: Cost-Optimal Multi-Expert Active Learning Architecture for Mobile Health Monitoring 
Authors: R. Saeedi, K. Sasani and A.H. Gebremedhin 
Status: ACM-BCB17 

Abstract: Mobile health monitoring plays a central role in a variety of health-care applications. Using mobile technology, health-care providers can access clinical information and communicate with subjects in real-time. Due to the sensitive nature of health-care applications, these systems need to process physiological signals highly accurately. However, as mobile devices are employed in dynamic environments, the accuracy of a machine learning model drops whenever a change in configuration of the system occurs. Therefore, data mining and machine learning techniques that specifically address challenges associated with dynamic environments (e.g. different users, signal heterogeneity) are needed. In this paper, using active learning as an organizing principle, we propose a cost-optimal multiple-expert architecture to adapt a machine learning model (e.g. classifier) developed in a given context to a new context or configuration. More specifically, in our architecture, a system’s machine learning model learns from experts available to the system, (e.g. another mobile device, human annotator) while minimizing the cost of data labeling. Our architecture also exploits collaboration between experts to enrich their knowledge which in turn decreases both cost and uncertainty of data labeling in future steps. We demonstrate the efficacy of the architecture using a publicly available dataset on human activity. We show that the accuracy of activity recognition reaches over 85% by labeling only 15% of unlabeled data. At the same time, the number of queries from human expert is reduced by up to 82%. 

TitleTransfer Learning Algorithms for Autonomous Configuration of Wearable Systems
Authors: R. Saeedi, H. Ghasemzadeh and A.H. Gebremedhin
Status: IEEE BigData 2016. 

Abstract: Wearables have emerged as a revolutionary technology in many application domains including healthcare and fitness. Machine learning algorithms, which form the core intelligence of wearables, traditionally deduce a computational model from a set of training examples to detect events of interest (e.g. activity type). However, in the dynamic environment in which wearables typically operate in, the accuracy of a computational model drops whenever changes in configuration of the system occur. Therefore, there is a need to develop systems which can adapt to the new configuration autonomously. In this paper, using transfer learning as an organizing principle, we develop several algorithms for data mapping. The data mapping algorithms employ effective signal similarity methods and are used to adapt the system to the new configuration. We demonstrate the efficacy of the data mapping algorithms using a publicly available dataset on human activity recognition.

Title : Algorithms for Balanced Graph Colorings with Applications in Parallel Computing 
Authors: H.Lu, M. Halappanavar, D. Chavarri-a-Miranda, A.H. Gebremedhin, A. Panyala and A. Kalyanaraman 
Status: IEEE Transactions on Parallel and Distributed Systems 28(5), 1240–1256, 2017. 

Abstract: Graph coloring–in a generic sense–is used to identify subsets of independent tasks in parallel scientific computing applications. Traditional coloring heuristics aim to reduce the number of colors used as that number also corresponds to the number of parallel steps in the application. However, if the color classes produced have a skew in their sizes, utilization of hardware resources becomes inefficient, especially for the smaller color classes. Equitable coloring is a theoretical formulation of coloring that guarantees a perfect balance among color classes, and its practical relaxation is referred to here as balanced coloring. In this paper, we consider balanced coloring models in the context of parallel computing. The goal is to achieve a balanced coloring of an input graph without increasing the number of colors that an algorithm oblivious to balance would have used. We propose and study multiple heuristics that aim to achieve such a balanced coloring for two variants of coloring problem, distance-1 coloring (the standard coloring problem) and partial distance-2 coloring (defined on a bipartite graph). We present parallelization approaches for multi-core and manycore architectures and cross-evaluate their effectiveness with respect to the quality of balance achieved and performance. Furthermore, we study the impact of the proposed balanced coloring heuristics on a concrete application – viz. parallel community detection, which is an example of an irregular application. In addition, we propose several extensions to our basic balancing schemes and evaluate their balancing efficacy and performance characteristics. The thorough treatment of balanced coloring presented in this paper from algorithms to application is expected to serve as a valuable resource to parallel application developers who seek to improve parallel performance of their applications using coloring. 

Title : Parallelization of Bin Packing on Multicore Systems 
Authors: S. Ghosh and A.H. Gebremedhin 
Status: 2016 IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC’16). 

Abstract: We study effective parallelization of approximation algorithms for the one-dimensional bin packing problem on a multicore platform. Bin packing is a classic combinatorial optimization problem that aims to pack a given sequence of items into a minimum number of equal-sized bins. The problem potentially serves as a model for a wide variety of applications. Examples include: packing data into chunks in a memory hierarchy in a given system to increase application performance, loading vehicles subject to weight limitations, and packing TV commercials into station breaks. Bin packing has long served as a proving ground for the analysis of approximation algorithms and played a crucial role in the development of much of the theory of approximation algorithms. Its parallelization, however, has received comparatively much less attention. In this work, we develop multipleparallel versions of an effective approximation algorithm (First Fit Decreasing) for the problem and investigate the trade-off between solution quality and execution time. We use OpenMP and Cilk Plus as mechanisms for achieving the parallelization. The new parallel algorithms obtain a speedup of more than 10x (on 32 cores) for moderate to large input sequences without sacrificing much on the quality of solution produced by the sequential algorithm — in particular, we see only about 3 to 30% increase in the number of bins compared to the sequential version. In turn, the solution obtained by the sequential First Fit Decreasing algorithm is provably almost optimal (the approximation ratio is less than 1.3). 

Title : One-sided Interface for Matrix Operations using MPI-3 RMA: A Case Study with Elemental 
Authors: S. Ghosh, J.R. Hammond, A.J. Pena, P. Balaji, A.H. Gebremedhin and B. Chapman 
Status: Proceedings of ICPP 2016. 

Abstract: A one-sided programming model separates communication from synchronization, and is the driving principle behind partitioned global address space (PGAS) libraries such as Global Arrays (GA) and SHMEM. PGAS models expose a rich set of functionality that a developer needs in order to implement mathematical algorithms that require frequent multidimensional array accesses. However, use of existing PGAS libraries in application codes often requires significant development effort in order to fully exploit these programming models. On the other hand, a vast majority of scientific codes use MPI either directly or indirectly via third-party scientific computation libraries, and need features to support application-specific communication requirements (e.g., asynchronous update of distributed sparse matrices, commonly arising in machine learning workloads). For such codes it is often impractical to completely shift programming models in favor of special one-sided communication middleware. Instead, an elegant and productive solution is to exploit the one-sided functionality already offered by MPI-3 RMA (Remote Memory Access). We designed a general one-sided interface using the MPI-3 passive RMA model for remote matrix operations in the linear algebra library Elemental; we call the interface we designed RMAInterface. Elemental is an open source library for distributed-memory dense and sparse linear algebra and optimization. 

Title : Balanced Coloring for Parallel Computing Applications 
Authors: H. Lu, M. Halappanavar, D. Charri-a-Miranda, A.H. Gebremedhin and A. Kalyanaraman 
Status: Proceedings of IPDPS 2015. 

Abstract: Graph coloring is used to identify subsets of independent tasks in parallel scientific computing applications. Traditional coloring heuristics aim to reduce the number of colors used as that number also corresponds to the number of parallel steps in the application. However, if the color classes produced have a skew in their sizes, utilization of hardware resources becomes inefficient, especially for the smaller color classes. Equitable coloring is a theoretical formulation of coloring that guarantees a perfect balance among color classes, and its practical relaxation is referred to as balanced coloring. In this paper, we revisit the problem of balanced coloring in the context of parallel computing. The goal is to achieve a balanced coloring of an input graph without increasing the number of colors that an algorithm oblivious to balance would have used. We propose and study multiple heuristics that aim to achieve such a balanced coloring, present parallelization approaches for multi-core and manycore architectures, and cross-evaluate their effectiveness with respect to the quality of balance achieved and performance. Furthermore, we study the impact of the proposed balanced coloring heuristics on a concrete application – viz. parallel community detection, which is an example of an irregular application. The thorough treatment of balanced coloring presented in this paper from algorithms to application is expected to serve as a valuable resource to parallel application developers who seek to improve parallel performance of their applications using coloring. 

TitleParallel Maximum Clique Algorithms with Applications to Network Analysis 
Authors: R.A. Rossi, D.F. Gleich, A.H. Gebremedhin 
Status: SIAM Journal on Scientific Computing, Vol 37, Issue 5, pages C589-C618, 2015. 

Abstract: We present a fast, parallel maximum clique algorithm for large sparse graphs that is designed to exploit characteristics of social and information networks. The method exhibits a roughly linear runtime scaling over real-world networks ranging from a thousand to a hundred million nodes. In a test on a social network with 1.8 billion edges, the algorithm finds the largest clique in about 20 minutes. At its heart the algorithm employs a branch-and-bound strategy with novel and aggressive pruning techniques. The pruning techniques include the combined use of core numbers of vertices along with a good initial heuristic solution to remove the vast majority of the search space. In addition, the exploration of the search tree is parallelized. During the search, processes immediately communicate changes to upper and lower bounds on the size of maximum clique. This occasionally results in a super-linear speedup because tasks with large search spaces can be pruned by other processes. We demonstrate the impact of the algorithm on applications using two different network analysis problems: computation of temporal strong components in dynamic networks and determination of compress-friendly ordering of nodes of massive networks. 

TitleFast Maximum Clique Algorithms for Large Graphs
Authors: R.A. Rossi, D.F. Gleich, A.H. Gebremedhin, M.M.A. Patwary 
Status: Proceedings of WWW2014. 

Abstract: We propose a fast, parallel maximum clique algorithm for large sparse graphs that is designed to exploit characteristics of social and information networks. Despite clique’s status as an NP-hard problem with poor approximation guarantees, our method exhibits nearly linear runtime scaling over real-world networks ranging from 1000 to 100 million nodes. In a test on a social network with 1.8 billion edges, the algorithm finds the largest clique in about 20 minutes. Key to the efficiency of our algorithm are an initial heuristic procedure that finds a large clique quickly and a parallelized branch and bound strategy with aggressive pruning tnd ordering echniques. We use the algorithm to compute the largest temporal strong components of temporal contact networks. 

TitleGraph Coloring Algorithms for Multi-core and Massively Multithreaded Architectures 
Authors: U. Catalyurek, J. Feo, A.H. Gebremedhin, M. Halappanavar, A. Pothen 
Status: Parallel Computing 38 (2012), 576-594. 

Abstract: We explore the interplay between architectures and algorithm design in the context of shared-memory platforms and a specific graph problem of central importance in scientific and high-performance computing, distance-1 graph coloring. We introduce two different kinds of multithreaded heuristic algorithms for the stated, NP-hard, problem. The first algorithm relies on speculation and iteration, and is suitable for any shared-memory system. The second algorithm uses dataflow principles, and is targeted at the non-conventional, massively multithreaded Cray XMT system. We study the performance of the algorithms on three different platforms—Cray XMT, Sun Niagara 2, and Intel Nehalem—representing varying degrees of multithreading capabilities. As testbed, we use synthetically generated massive graphs carefully designed to cover a wide spectrum of input types. The results show that the algorithms have scalable runtime performance and use nearly the same number of colors as the underlying serial algorithm, which in turn is effective in practice. The study provides insight into the design of high performance algorithms for irregular problems on many-core architectures. 

TitleNew Multithreaded Ordering and Coloring Algorithms for Multicore Architectures 
Authors: M.M.A. Patwary, A.H. Gebremedhin and A. Pothen 
Status: In E. Jeannot, R. Namyst and J. Roman, editors, Euro-Par 2011, volume 6853 of Lecture Notes in Computer Science, pages 250 — 262. Springer, 2011. 

Abstract: We present new multithreaded vertex ordering and distance-k graph coloring algorithms that are well-suited for the emerging and rapidly growing multicore platforms. The vertex ordering techniques rely on various notions of “degree”, are known to be effective in reducing the number of colors used by a greedy coloring algorithm, and are generic enough to be applicable to contexts other than coloring. We employ approximate degree computation in the ordering algorithms and speculation and iteration in the coloring algorithms as our primary remedies for breaking sequentiality and achieving effective parallelization. The algorithms have been implemented using OpenMP, and experiments run on Intel Nehalem and other multi-core machines using a set of carefully designed synthetic graphs and real-world graphs attest that the algorithms provide scalable runtime performance. The number of colors the algorithms use is nearly the same as in the serial case, which in turn is often very close to optimal. 

TitleDistributed-memory Parallel Algorithms for Matching and Coloring 
Authors: U. Catalyurek, F. Dobrian, A.H. Gebremedhin, M. Halappanavar, A. Pothen 
Status: Proceedings of IEEE International Parallel and Distributed Processing Symposium, Workshops and PhD Forums (IPDPSW), Workshop on Parallel Computing and Optimization (PCO’11), pages 1966–1975, 2011 

Abstract: We discuss the design and implementation of new highly-scalable distributed-memory parallel algorithms for two prototypical graph problems, edge-weighted matching and distance-1 vertex coloring. Graph algorithms in general have low concurrency, poor data locality, and high ratio of data access to computation costs, making it challenging to achieve scalability on massively parallel machines. We overcome this challenge by employing a variety of techniques, including speculation and iteration, optimized communication, and randomization. We present preliminary results on weak and strong scalability studies conducted on an IBM Blue Gene/P machine employing up to tens of thousands of processors. The results show that the algorithms hold strong potential for computing at petascale. 

TitleDistributed-memory Parallel Algorithms for Distance-2 Coloring and Related Problems in Derivative Computation
Authors: D. Bozdag, U. Catalyurek, A. Gebremedhin, F. Manne, E. Boman and F. Ozgunner 
Status: SIAM Journal on Scientific Computing, Vol 32, Issue 4, pp 2418–2446, 2010. 

Abstract: The distance-2 graph coloring problem aims at partitioning the vertex set of a graph into the fewest sets consisting of vertices pairwise at distance greater than two from each other. Its applications include derivative computation in numerical optimization and channel assignment in radio networks. We present efficient, distributed-memory, parallel heuristic algorithms for this NP-hard problem as well as for two related problems used in the computation of Jacobians and Hessians. Parallel speedup is achieved through graph partitioning, speculative (iterative) coloring, and a BSP-like organization of parallel computation. Results from experiments conducted on a PC cluster employing up to 96 processors and using large-size real-world as well as synthetically generated test graphs show that the algorithms are scalable. In terms of quality of solution, the algorithms perform remarkably well—the number of colors used by the parallel algorithms was observed to be very close to the number used by the sequential counterparts, which in turn are quite often near optimal. Moreover, the experimental results show that the parallel distance-2 coloring algorithm compares favorably with the alternative approach of solving the distance-2 coloring problem on a graph $G$ by first constructing the square graph $G^2$ and then applying a parallel distance-1 coloring algorithm on $G^2$. Implementations of the algorithms are made available via the Zoltan load-balancing library. 

TitleA framework for Scalable Greedy Coloring on Distributed Memory Parallel Computers 
Authors: D. Bozdag, A. Gebremedhin, F. Manne, E. Boman and U. Catalyurek 
Status: Journal of Parallel and Distributed Computing Vol 68, No 4, pp 515–535, 2008. 

Abstract: We present a scalable framework for parallelizing greedy graph coloring algorithms on distributed-memory computers. The framework unifies several existing algorithms and blends a variety of techniques for creating or facilitating concurrency. The latter techniques include exploiting features of the initial data distribution, the use of speculative coloring and randomization, and a BSP-style organization of computation and communication. We experimentally study the performance of several specialized algorithms designed using the framework and implemented using MPI. The experiments are conducted on two different platforms and the test cases include large-size synthetic graphs as well as real graphs drawn from various application areas. Computational results show that implementations that yield good speedup while at the same time using about the same number of colors as a sequential greedy algorithm can be achieved by setting parameters of the framework in accordance with the size and structure of the graph being colored. Our implementation is freely available as part of the Zoltan parallel data management and load-balancing library. 

TitleA Parallel Distance-2 Graph Coloring Algorithm for Distributed Memory Computers 
Authors: D. Bozdag, U. Catalyurek, A.H. Gebremedhin, F. Manne, E. G. Boman and F. Ozguner 
Status: Lecture Notes in Computer Science, vol 3726, 2005, pages 796 – 806,Springer. Proc. of HPCC 2005, Sept 21 – 25, 2005, Sorrento, Italy. 

Abstract: The distance-2 graph coloring problem aims at partitioning the vertex set of a graph into the fewest sets consisting of vertices pairwise at distance greater than two from each other. Application examples include numerical optimization and channel assignment. We present the first distributed-memory heuristic algorithm for this NP-hard problem. Parallel speedup is achieved through graph partitioning, speculative (iterative) coloring, and a BSP-like organization of computation. Experimental results show that the algorithm is scalable, and compares favorably with an alternative approach—solving the problem on a graph G by first constructing the square graph G^2 and then applying a parallel distance-1 coloring algorithm on G^2 

TitleA Scalable Parallel Graph Coloring Algorithm for Distributed Memory Computers 
Authors: E.G. Boman, D. Bozdag, U. Catalyurek, A.H. Gebremedhin and F. Manne 
Status: Lecture Notes in Computer Science, vol 3648 , 2005, pages 241 – 251,Springer. Proc. of EuroPar 2005, 30 Aug – 2 Sept, 2005, Lisboa, Portugal. 

Abstract: In large-scale parallel applications a graph coloring is often carried out to schedule computational tasks. In this paper, we describe a new distributed-memory algorithm for doing the coloring itself in parallel. The algorithm operates in an iterative fashion; in each round vertices are speculatively colored based on limited information, and then a set of incorrectly colored vertices, to be recolored in the next round, is identified. Parallel speedup is achieved in part by reducing the frequency of communication among processors. Experimental results on a PC cluster using up to 16 processors show that the algorithm is scalable. 

TitleSpeeding up Parallel Graph Coloring 
Authors: A.H. Gebremedhin, F.Manne and T. Woods 
Status: Lecture Notes in Computer Science, vol 3732, pp 1079-1088, 2005, Springer. Proc. of Para 2004, June 20 – 23, 2004, Lyngby, Denmark. 

Abstract: This paper presents new efficient parallel algorithms for finding approximate solutions to graph coloring problems. We consider an existing shared memory parallel graph coloring algorithm and suggest several enhancements both in terms of ordering the vertices so as to minimize cache misses, and performing vertex-to- processor assignments based on graph partitioning instead of random allocation. We report experimental results that demonstrate the performance of our algorithms on an IBM Regatta supercomputer when up to 12 processors are used. Our implementations use OpenMP for parallelization and Metis for graph partitioning. The experiments show that we get up to a 70 % reduction in runtime compared to the previous algorithm. 

Title:  Graph Coloring on Coarse Grained Multicomputers 
Authors: A. Gebremedhin, I.Guerrin-Lassous, J. Gustedt and J.A. Telle 
Status: Discrete Applied Mathematics, Vol 131, No 1, pp 179–198, 2003 

Abstract: We present an efficient and scalable Coarse Grained Multicomputer (CGM) coloring algorithm that colors a graph G with at most D + 1 colors where D is the maximum degree in G . This algorithm is given in two variants: a randomized and a deterministic . We show that on a p-processor CGM model the proposed algorithms require a parallel time of O(|G|/p) and a total work and overall communication cost of O(|G|) . These bounds correspond to the average case for the randomized version and to the worst case for the deterministic variant. 

TitleParallel Distance-k Coloring Algorithms for Numerical Optimization 
Authors: A.H. Gebremedhin, F. Manne and A. Pothen 
Status: In B. Monien and R. Feldmann (Eds.): EuroPar 2002, Lecture Notes in Computer Science 2400, pp. 912-921, Springer-Verlag 2002. 

Abstract: Matrix partitioning problems that arise in the efficient estimation of sparse Jacobians and Hessians can be modeled using variants of graph coloring problems. In a previous work, we argue that distance-2 coloring and distance-3/2 coloring [we now call this star coloring ] are robust and flexible formulations of the respective matrix estimation problems. The problem size in large-scale optimization contexts makes the matrix estimation phase an expensive part of the entire computation both in terms of execution time and memory space. Hence, there is a need for both shared- and distributed-memory parallel algorithms for the stated graph coloring problems. In the current work, we present the first practical shared address space parallel algorithms for these problems. The main idea in our algorithms is to randomly partition the vertex set equally among the available processors, let each processor speculatively color its vertices using information about already colored vertices, detect eventual conflicts in parallel, and finally re-color conflicting vertices sequentially. Randomization is also used in the coloring phases to further reduce conflicts. Our PRAM-analysis shows that the algorithms should give almost linear speedup for sparse graphs that are large relative to the number of processors. Experimental results from our OpenMP implementations on a Cray Origin2000 using various large graphs show that the algorithms indeed yield reasonable speedup for modest numbers of processors. 

TitleScalable Parallel Graph Coloring Algorithms 
Authors: A. Gebremedhin and F. Manne 
Status: Concurrency: Practice and Expereince, Vol 12, pp1131–1146, 2000. 

Abstract: Finding a good graph coloring quickly is often a crucial phase in the development of efficient, parallel algorithms for many scientific and engineering applications. In this paper we consider the problem of solving the graph coloring problem itself in parallel. We present a simple and fast parallel graph coloring heuristic that is well suited for shared memory programming and yields an almost linear speedup on the PRAM model. We also present a second heuristic that improves on the number of colors used. The heuristics have been implemented using OpenMP. Experiments conducted on an SGI Cray Origin 2000 super computer using very large graphs from finite element methods and eigenvalue computations validate the theoretical run-time analysis. 

TitleGraph Coloring on a Coarse Grained Multiprocessor 
Authors: A.H. Gebremedhin, I.G. Lassous, J. Gustedt and J.A. Telle 
Status: In Brandes, Ulrik, Wagner and Dorothea (Eds.): WG 2000, Lecture Notes in Computer Science 1928, pp. 184-195, 2000, Springer-Verlag. 

Abstract: We present the first efficient parallel coloring algorithm for the Coarse Grain Multicomputer model. The algorithm uses at most D+1 colors, where D is the maximum degree in the graph. 

TitleParallel Graph Coloring Algorithms using OpenMP 
Authors: A.H. Gebremedhin and F. Manne 
Status: In Proc. of EWOMP’99, First European Workshop on OpenMP, Sept.30 – Oct. 1, 1999, Lund, Sweden. 

Download Extended Abstract in PDF 


Title
Capitalizing on Live Variables: New Algorithms for Efficient Hessian Computation via Automatic Differentiation 
Authors: M. Wang, A.H. Gebremedhin and A. Pothen 
Status: Mathematical Programming Computation, pp 1–43, 2016. 

Abstract: We revisit an algorithm (called Edge Pushing (EP)) for computing Hessians using Automatic Differentiation (AD) recently proposed by Gower and Mello (2012). Here we give a new, simpler derivation for the EP algorithm based on the notion of live variables from data-flow analysis in compiler theory and redesign the algorithm with close attention to general applicability and performance. We call this algorithm LIVARH and develop an extension of LIVARH that incorporates preaccumulation to further reduce execution time–the resulting algorithm is called LIVARHACC. We engineer robust implementations for both algorithms LIVARH and LIVARHACC within ADOL-C, a widely-used operator overloading based AD software tool. Rigorous complexity analyses for the algorithms are provided, and the performance of the algorithms is evaluated using a mesh optimization application and several kinds of synthetic functions as testbeds. The results show that the new algorithms outperform state-of-the-art sparse methods (based on sparsity pattern detection, coloring, compressed matrix evaluation, and recovery) in some cases by orders of magnitude. We have made our implementation available online as open-source software and it will be included in a future release of ADOL-C. 

Title : ColPack: Software for Graph Coloring and Related Problems in Scientific Computing 
Authors: A.H. Gebremedhin, D. Nguyen, M.M.A. Patwary and A. Pothen 
Status: ACM Transactions on Mathematical Software, Vol 40, No 1, pp 1-31, 2013. 

AbstractWe present a suite of fast and effective algorithms, encapsulated in a software package called ColPack, for a variety of graph coloring and related problems. Many of the coloring problems model partitioning needs arising in compression-based computation of Jacobian and Hessian matrices using automatic differentiation. Several of the coloring problems also find important applications in many areas outside derivative computation, including frequency assignment in wireless networks, scheduling, facility location, and concurrency discovery and data movement operations in parallel and distributed computing. The presentation in this article includes a high-level description of the various coloring algorithms within a common design framework, a detailed treatment of the theory and efficient implementation of known as well as new vertex ordering techniques upon which the coloring algorithms rely, a discussion of the package’s software design, and an illustration of its usage. The article also includes an extensive experimental study of the major algorithms in the package using real-world as well as synthetically generated graphs. 

Title : Exploiting Sparsity in Automatic Differentiation on Multicore Architectures 
Authors: B. Letschert, K. Kulshreshtha, A. Walther, D. Nguyen, A.H. Gebremedhin and A. Pothen 
Status: In S. Forth et al. (Eds.), Recent Advances in Algorithmic Differentiation, Lecture Notes in Computational Science and Engineering 87, DOI 10.1007/978-3-642-30023-3_14, 2012, Springer. 

Abstract: We discuss the design, implementation and performance of algorithms suitable for the efficient computation of sparse Jacobian and Hessian matrices using Automatic Differentiation via operator overloading on multicore architectures. The procedure for exploiting sparsity (for runtime and memory efficiency) in serial computation involves a number of steps. Using nonlinear optimization problems as test cases, we show that the algorithms involved in the various steps can be adapted to multithreaded computations. 

Title : Implementation of Partial Separability in a Source to Source Transformation AD Tool 
Authors: S.H.K. Narayanan, B. Norris, P. Hovland and A. Gebremedhin 
Status: In S. Forth et al. (Eds.), Recent Advances in Algorithmic Differentiation, Lecture Notes in Computational Science and Engineering 87, DOI 10.1007/978-3-642-30023-3_31, 2012, Springer. 

Abstract : A significant number of large optimization problems exhibit structure known as partial separability, for example, least squares problems, where elemental functions are gathered into groups that are then squared. The sparsity of the Jacobian of a partially separable function can be exploited by computing the smaller Jacobians of the elemental functions and then assembling them into the full Jacobian. We implemented partial separability support in ADIC2 by using pragmas to identify partially separable function values, applying source transformations to subdivide the elemental gradient computations, and using the ColPack coloring toolkit to compress the sparse elemental Jacobians. We present experimental results for an elastic-plastic torsion optimization problem from the MINPACK-2 test suite. 

Title : Sparse Jacobian Computation using ADIC2 and ColPack 
Authors: S.H.K. Narayanan, B. Norris, P. Hovland, D. Nguyen and A. Gebremedhin 
Status: Procedia Computer Science, 4:2115 — 2123, 2011. Proceedings of the International Conference on Computational Science, ICCS 2011. 

Abstract : Many scientific applications benefit from the accurate and efficient computation of derivatives. Automatically generating these derivative computations from an application’s source code offers a competitive alternative to other approaches, such as less accurate numerical approximations or labor-intensive analytical implementations. ADIC2 is a source transformation tool for generating code for computing the derivatives (e.g., Jacobian or Hessian) of a function given the C or C++ implementation of that function. Often the Jacobian or Hessian is sparse and presents the opportunity to greatly reduce storage and computational requirements in the automatically generated derivative computation. ColPack is a tool that compresses structurally independent columns of the Jacobian and Hessian matrices through graph coloring approaches. In this paper, we describe the integration of ColPack coloring capabilities into ADIC2, enabling accurate and efficient sparse Jacobian computations. We present performance results for a case study of a simulated moving bed chromatography application. Overall, the computation of the Jacobian by integrating ADIC2 and ColPack is approximately 40% faster than a comparable implementation that integrates ADOL-C and ColPack when the Jacobian is computed multiple times. 

TitleDistributed-memory Parallel Algorithms for Distance-2 Coloring and Related Problems in Derivative Computation
Authors: D. Bozdag, U. Catalyurek, A. Gebremedhin, F. Manne, E. Boman and F. Ozgunner 
Status: SIAM Journal on Scientific Computing, Vol 32, Issue 4, pp 2418–2446, 2010. 

Abstract: The distance-2 graph coloring problem aims at partitioning the vertex set of a graph into the fewest sets consisting of vertices pairwise at distance greater than two from each other. Its applications include derivative computation in numerical optimization and channel assignment in radio networks. We present efficient, distributed-memory, parallel heuristic algorithms for this NP-hard problem as well as for two related problems used in the computation of Jacobians and Hessians. Parallel speedup is achieved through graph partitioning, speculative (iterative) coloring, and a BSP-like organization of parallel computation. Results from experiments conducted on a PC cluster employing up to 96 processors and using large-size real-world as well as synthetically generated test graphs show that the algorithms are scalable. In terms of quality of solution, the algorithms perform remarkably well—the number of colors used by the parallel algorithms was observed to be very close to the number used by the sequential counterparts, which in turn are quite often near optimal. Moreover, the experimental results show that the parallel distance-2 coloring algorithm compares favorably with the alternative approach of solving the distance-2 coloring problem on a graph $\G$ by first constructing the square graph $\G^2$ and then applying a parallel distance-1 coloring algorithm on $\G^2$. Implementations of the algorithms are made available via the Zoltan load-balancing library. 

Title : Efficient Computation of Sparse Hessians Using Coloring and Automatic Differentiation 
Authors: A. Gebremedhin, A. Pothen, A. Tarafdar, and A. Walther 
Status: INFORMS Journal on Computing, Vol 21, No 2, pp 209-223, 2009. 

Abstract : The computation of a sparse Hessian matrix H using automatic differentiation (AD) can be made efficient using the following four-step procedure. 

  • Determine the sparsity structure of H. 
  • Obtain a seed matrix S that defines a column partition of H using a specialized coloring on the adjacency graph of H. 
  • Compute the compressed Hessian matrix B=HS using AD. 
  • Recover the numerical values of the entries of H from B. 

The coloring variant used in the second step depends on whether the recovery in the fourth step is direct or indirect : a direct method uses star coloring and an indirect method uses acyclic coloring . In an earlier work, we had designed and implemented effective heuristic algorithms for these two NP-hard coloring problems. Recently, we integrated part of the developed software with the AD tool ADOL-C, which has recently acquired a sparsity detection capability. In this paper, we provide a detailed description and analysis of the recovery algorithms, and experimentally demonstrate the efficacy of the coloring techniques in the overall process of computing the Hessian of a given function using ADOL-C as an example of an AD tool. We also present new analytical results on star and acyclic coloring of chordal graphs. The experimental results show that sparsity exploitation via coloring yields enormous savings in runtime and makes the computation of Hessians of very large size feasible. The results also show that evaluating a Hessian via an indirect method is often faster than a direct evaluation. This speedup is achieved without compromising numerical accuracy. 

Title : Exploiting Sparsity in Jacobian Computation via Coloring and Automatic Differentiation: 
A Case Study in a Simulated Moving Bed Process 

Authors: A. Gebremedhin, A. Pothen and A. Walther 
Status: In C. Bischof et al. (Eds): Proc. of AD2008, Lecture Notes in Computational Science and Engineering 64, pp 339-349, 2008, Springer. 

Abstract : Using a model from a chromatographic separation process in chemical engineering, we demonstrate that large, sparse Jacobians of fairly complex structures can be computed accurately and efficiently by using automatic differentiation (AD) in combination with a four-step procedure involving matrix compression and de-compression. For the detection of sparsity pattern (step 1), we employ a new operator overloading-based implementation of a technique that relies on propagation of index domains. To obtain the seed matrix to be used for compression (step 2), we use a distance-2 coloring of the bipartite graph representation of the Jacobian. The compressed Jacobian is computed using the vector forward mode of AD (step 3). A simple routine is used to directly recover the entries of the Jacobian from the compressed representation (step 4). Experimental results using ADOL-C show that the runtimes of each of these steps is in complete agreement with theoretical analysis, and the total runtime is found to be only about a hundred times the time needed for evaluating the function itself. The alternative approach of computing the Jacobian without exploiting sparsity is infeasible. 

Title : New Acyclic and Star Coloring Algorithms with Applications to Hessian Computation 
Authors: A. Gebremedhin, A. Tarafdar, F. Manne, and A. Pothen 
Status: SIAM Journal on Scientific Computing, Vol 29, No 3, pp 1042–1072, 2007. 

Abstract : Acyclic and star coloring problems are specialized vertex coloring problems that arise in the efficient computation of Hessians using automatic differentiation or finite differencing, when both sparsity and symmetry are exploited. We present an algorithmic paradigm for finding heuristic solutions for these two NP-hard problems. The underlying common technique is the exploitation of the structure of two-colored induced subgraphs. For a graph G on n vertices and m edges, the time complexity of our star coloring algorithm is O(n d_2) , where d_k , a generalization of vertex degree, denotes the average number of distinct paths of length at most k edges starting at a vertex in G . The time complexity of our acyclic coloring algorithm is larger by a multiplicative factor involving the inverse of Ackermann’s function. The space complexity of both algorithms is O(m) . To the best of our knowledge, our work is the first practical algorithm for the acyclic coloring problem. For the star coloring problem, our algorithm uses fewer colors and is considerably faster than a previously known O(n d_3) time algorithm. Computational results from experiments on various large-size test graphs demonstrate that the algorithms are fast and produce highly effective solutions. The use of these algorithms in Hessian computation is expected to reduce overall runtime drastically. 

Title : What Color Is Your Jacobian? Graph Coloring for Computing Derivatives 
Authors : A. Gebremedhin, F. Manne, and A. Pothen 
Status: SIAM Review, Vol 47, No 4, pp 629–705, 2005. 

Abstract: Graph coloring has been employed since the 1980s to efficiently compute sparse Jacobian and Hessian matrices using either finite differences or automatic differentiation. Several coloring problems occur in this context, depending on whether the matrix is a Jacobian or a Hessian, and the specifics of the computational techniques employed. We consider eight variant vertex coloring problems here. This article begins with a gentle introduction to the problem of computing a sparse Jacobian, followed by an overview of the historical development of the research area. Then we present a unifying framework for the graph models of the variant matrix estimation problems. The framework is based upon the viewpoint that a partition of a matrix into structurally orthogonal groups of columns corresponds to distance-2 coloring an appropriate graph representation. The unified framework helps integrate earlier work and leads to fresh insights; enables the design of more efficient algorithms for many problems; leads to new algorithms for others; and eases the task of building graph models for new problems. We report computational results on two of the coloring problems to support our claims. Most of the methods for these problems treat a column or a row of a matrix as an atomic entity, and partition the columns or rows (or both). A brief review of methods that do not fit these criteria is provided. We also discuss results in discrete mathematics and theoretical computer science that intersect with the topics considered here.