Fahreddin Şükrü Torun

Asst. Prof. at Ankara Yildirim Beyazit University in Computer Engineering Department.
email: fstorun at aybu.edu.tr, fsukrutorun@gmail.com
website: avesis.aybu.edu.tr/ftorun

I am an assistant professor at Computer Engineering Department of Ankara Yildirim Beyazit University, Turkey. Prior to joining AYBU, I was a postdoctoral researcher at APO (Algorithmes Parallèles et Optimisation-Parallel Algorithms and Optimization) Team in IRIT/CNRS - ENSEEIHT, Toulouse, FRANCE. I got my PhD. from Bilkent University as a member of the Parallel and Distributed Computing Lab under the supervision of Prof. Cevdet Aykanat and Prof. Murat Manguoglu. My main area of research is parallel scientific computing. I am interested in algorithm development, parallel and distributed programming, iterative and direct linear system solvers and graph-hypergraph partitioning.

Research Interests

High Performance Computing
Parallel and Distributed Computing
Graph and Hypergraph Partitioning
Numerical Linear Algebra
Parallel Iterative Solvers
Parallel Sparse Matrix Operations

Selected Publications

A distributed memory parallel randomized Kaczmarz for sparse system of equations Ercan Bölükbaşı, F. Sukru Torun, Murat Manguoğlu Concurrency and Computation: Practice and Experience, 2024, https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.8274 Kaczmarz algorithm is an iterative projection method for solving system of linear equations that arise in science and engineering problems in various application domains. In addition to classical Kaczmarz, there are randomized and parallel variants. The main challenge of the parallel implementation is the dependency of each Kaczmarz iteration on its predecessor. Because of this dependency, frequent communication is required which results in a substantial overhead. In this study, a new distributed parallel method that reduces the communication overhead is proposed. The proposed method partitions the problem so that the Kaczmarz iterations on different blocks are less dependent. More A frequency parameter is introduced to see the effect of communication frequency on the performance. The communication overhead is also decreased by allowing communication between processes only if they have shared non-zero columns. The experiments are performed using problems from various domains to compare the effects of different partitioning methods on the communication overhead and performance. Finally, parallel speedups of the proposedmethod on larger problems are presented.
Row Replicated Block Cimmino Iain Duff, Philippe Leleux, Daniel Ruiz, F. Sukru Torun SIAM Journal on Scientific Computing, Volume 45, Issue 4, July 2023, https://epubs.siam.org/doi/abs/10.1137/22M1487710 We study a new technique for reducing the number of iterations of the block Cimmino method by replicating rows in the partitioned system, so that we obtain a nondisjoint partitioning of the rows. Since rows in different partitions that are close to colinear produce a poorly conditioned iteration matrix for the block Cimmino method, row replication can get around this problem. With intelligent replication choices, we can reduce the number of iterations for convergence of the replicated block Cimmino method. The downside is a slight increase of the computational workload associated with each partition. In order to find a trade-off between a lower number of iterations and a higher cost per iteration, selecting the proper set of rows for replication is crucial. In this paper, we use graph-based techniques to find good candidates for replication. More Since the block Cimmino method can be interpreted as a nonoverlapping additive Schwartz method applied to the normal equations, the replication techniques correspond to introducing an overlap between the subdomains defined by the partitions. We show analytically in the case of a two-block partitioning how the replication improves the condition number of the block Cimmino iteration matrix. We then use challenging two-dimensional PDE problems to show that our algebraic approach targets physically meaningful phenomena on the interface between partitions. We demonstrate the efficiency of the proposed method in improving the performance of the block Cimmino solver, even with a small amount of replication, on problems from the SuiteSparse Matrix Collection. Finally, we compare our approach to a BiCGStab preconditioned with an additive Schwartz method and show that our replication technique can be used to define the subdomains and overlaps in the context of domain decomposition methods.
Quadratic programming based partitioning for Block Cimmino with correct value representation Zuhal Tas, F. Sukru Torun Turkish Journal of Electrical Engineering and Computer Sciences, Volume 31, Issue 3, May 2023, https://journals.tubitak.gov.tr/elektrik/vol31/iss3/8/ The block Cimmino method is successfully used for the parallel solution of large linear systems of equations due to its amenability to parallel processing. Since the convergence rate of block Cimmino depends on the orthogonality between the row blocks, advanced partitioning methods are used for faster convergence. In this work, we propose a new partitioning method that is superior to the state-of-the-art partitioning method, GRIP, in several ways. More Firstly, our proposed method exploits the Mongoose partitioning library which can outperform the state-of-the-art methods by combining the advantages of classical combinatoric methods and continuous quadratic programming formulations. Secondly, the proposed method works on the numerical values in a floating-point format directly without converting them to integer format as in GRIP. This brings an additional advantage of obtaining higher quality partitionings via better representation of numerical values. Furthermore, the preprocessing time is also improved since there is no overhead in converting numerical values to integer format. Finally, we extend the Mongoose library, which originally partitions graphs into only two parts, by using the recursive bisection paradigm to partition graphs into more than two parts. Extensive experiments conducted on both shared and distributed memory architectures demonstrate the effectiveness of the proposed method for solving different types of real-world problems.
Enhancing Block Cimmino for Sparse Linear Systems with Dense Columns via Schur Complement F. Sukru Torun, Murat Manguoglu, and Cevdet Aykanat SIAM Journal on Scientific Computing, Volume 45, Issue 2, April 2023, https://epubs.siam.org/doi/abs/10.1137/21M1453475 The block Cimmino is a parallel hybrid row-block projection iterative method successfully used for solving general sparse linear systems. However, the convergence of the method degrades when angles between subspaces spanned by the row-blocks are far from being orthogonal. The density of columns as well as the numerical values of their nonzeros are more likely to contribute to the nonorthogonality between row-blocks. We propose a novel scheme to handle such “dense” columns. The proposed scheme forms a reduced system by separating these columns and the respective rows from the original coefficient matrix and handling them via the Schur complement. More Then the angles between subspaces spanned by the row-blocks of the reduced system are expected to be closer to orthogonal, and the reduced system is solved efficiently by the block conjugate gradient (CG) accelerated block Cimmino in fewer iterations. We also propose a novel metric for selecting “dense” columns considering the numerical values. The proposed metric establishes an upper bound on the sum of inner products between row-blocks. Then we propose an efficient algorithm for computing the proposed metric for the columns. Extensive numerical experiments for a wide range of linear systems confirm the effectiveness of the proposed scheme by achieving fewer iterations and faster parallel solution time compared to the classical CG accelerated block Cimmino algorithm.
Partitioning and Reordering for Spike-Based Distributed-Memory Parallel Gauss-Seidel Tugba Torun, F. Sukru Torun, Murat Manguoglu, and Cevdet Aykanat SIAM Journal on Scientific Computing, Volume 44, Issue 2, April 2022, https://epubs.siam.org/doi/abs/10.1137/21M1411603 Gauss--Seidel (GS) is a widely used iterative method for solving sparse linear sys-tems of equations and also known to be effective as a smoother in algebraic multigrid methods.Parallelization of GS is a challenging task since solving the sparse lower triangular system in GSconstitutes a sequential bottleneck at each iteration. We propose a distributed-memory parallel GS(dmpGS) by implementing a parallel sparse triangular solver (stSpike) based on the Spike algorithm. More stSpike decouples the global triangular system into smaller systems that can be solved concurrentlyand requires the solution of a much smaller reduced sparse lower triangular system which constitutesa sequential bottleneck. In order to alleviate this bottleneck and to reduce the communication over-head of dmpGS, we propose a partitioning and reordering model consisting of two phases. The firstphase is a novel hypergraph partitioning model whose partitioning objective simultaneously encodesminimizing the reduced system size and the communication volume. The second phase is an in-blockrow reordering method for decreasing the nonzero count of the reduced system. Extensive experiments on a dataset consisting of 359 sparse linear systems verify the effectiveness of the proposedpartitioning and reordering model in terms of reducing the communication and the sequential com-putational overheads. Parallel experiments on 12 large systems using up to 320 cores demonstratethat the proposed model significantly improves the scalability of dmpGS.
Parallel solution of Lambert’s problem using modified Chebyshev-Picard iteration method Majd AJROUDI, F. Şükrü TORUN Niğde Ömer Halisdemir University Journal of Engineering Sciences, 11 (4), 871-878, 2022 dergipark.org.tr/en/pub/ngumuh/issue/73005/1069509 Lambert’s problem is one of the classical methods for solving the multiple revolution problem in orbit determination. With the increasing interest in space exploration programs and using satellite networks, it is important to provide an accurate and rapid method that will provide the network control center with information regarding the orbit of each satellite in the network and help the satellites improve routing decisions in onboard processing satellites. More Lambert’s problem is one of the methods that solve the problem iteratively and this iteration was originally done using Newton’s iteration method. In recent studies, it is recommended to use the Chebyshev-Picard iteration method to solve this problem. Since the aim here is to provide a method that solves the problem rapidly, the Chebyshev-Picard iteration method serves our objective since it is highly parallelizable. In this work, we have developed a parallel algorithm that solves Lambert’s problem in a parallel environment. We have conducted experiments to demonstrate the parallel scalability of the algorithm on both shared and distributed memory architectures. The experimental results show that the parallel algorithm achieves 8.26- and 3.94-times faster execution time on distributed memory and shared memory architectures, respectively.
Parallel K-means Clustering With Naïve Sharding for Unsupervised Image Segmentation via MPI Ahmet Esad TOP, F. Şükrü TORUN, Hilal KAYA Journal of Engineering Sciences and Design ,Volume 8, Issue 3, 791 - 798, 2020, dergipark.org.tr/en/pub/jesd/issue/56892/748209 In digital image processing, image segmentation is an essential step in which an image is partitioned into groups of pixels. K-means clustering algorithm, which is often considered as fast and efficient, is one of the most widely used clustering algorithms to segment an image. More However, as the problem size gets larger, the k-means starts to spend a significant amount of time to process. At this point, parallelization techniques should be applied to reduce the required time. Designing an efficient parallel and distributed model is not a trivial job since it should correspond to the parallel computer architecture and take communication and load balancing among processors into account. In this study, we propose a parallel and distributed k-means clustering algorithm with naive sharding centroid initialization for image segmentation. The proposed algorithm adopts the Message Passing Interface (MPI) standard to take advantage of the computational power of distributed computing nodes in a High Performance Computing Cluster. We demonstrate the parallel scalability of the proposed algorithm using up to 128 cores that achieves approximately 104 times faster clustering time.
Improving the scalability of the ABCD Solver with a combination of new load balancing and communication minimization techniques Iain Duff, Philippe Leleux, Daniel Ruiz, F. Sukru Torun PARCO 2019, Prague, Czech Republic, 10-13 September 2019 The hybrid scheme block row-projection method implemented in the ABCD Solver is designed for solving large sparse unsymmetric systems of equations on distributed memory parallel computers. The method implements a block Cimmino iterative scheme, accelerated with a stabilized block conjugate gradient algorithm. An augmented pseudodirect variant has also been developed to overcome convergence issues. More Both methods are included in the ABCD solver with a hybrid parallelization scheme. The parallel performance of the ABCD Solver is improved in the first non-beta release, version 1.0, which we present in this paper. Novel algorithms for the distribution of partitions to processes are introduced to minimize communication as well as to balance the workload. Furthermore, the master-slave approach on each subsystem is also improved in order to achieve higher scalability through run-time placement of processes. We illustrate the improved parallel scalability of the ABCD Solver on a distributed memory architecture by solving several problems from the SuiteSparse Matrix Collection. Conference Program
Parallel Minimum Norm Solution of Sparse Block Diagonal Column Overlapped Underdetermined Systems F. Sukru Torun, Murat Manguoglu, Cevdet Aykanat ACM Transactions on Mathematical Software (TOMS), Volume 43 Issue 4, No:31 (January 2017), 21 pages. Download http://dl.acm.org/citation.cfm?id=3004280 Underdetermined systems of equations in which the minimum norm solution needs to be computed arise in many applications, such as geophysics, signal processing, and biomedical engineering. In this article, we introduce a new parallel algorithm for obtaining the minimum 2-norm solution of an underdetermined system of equations. The proposed algorithm is based on the Balance scheme, which was originally developed for the parallel solution of banded linear systems. The proposed scheme assumes a generalized banded form where the coefficient matrix has column overlapped block structure in which the blocks could be dense or sparse. More In this article, we implement the more general sparse case. The blocks can be handled independently by any existing sequential or parallel QR factorization library. A smaller reduced system is formed and solved before obtaining the minimum norm solution of the original system in parallel. We experimentally compare and confirm the error bound of the proposed method against the QR factorization based techniques by using true single-precision arithmetic. We implement the proposed algorithm by using the message passing paradigm. We demonstrate numerical effectiveness as well as parallel scalability of the proposed algorithm on both shared and distributed memory architectures for solving various types of problems.
Solving Sparse Underdetermined Linear Least Squares Problems on Parallel Computing Platforms F. Sukru Torun, Murat Manguoglu, Cevdet Aykanat 17th SIAM Conference on Parallel Processing for Scientific Computing, April 12-15 2016, Paris-France. link Chair of the Matrix Factorization Session Computing the minimum 2-norm solution is essential for many areas such as geophysics, signal processing and biomedical engineering. In this work, we present a new parallel algorithm for solving sparse underdetermined linear systems where the coefficient matrix is block diagonal with overlapping columns. The proposed parallel approach handles the diagonal blocks independently and the reduced system involving the shared unknowns. Experimental results show the effectiveness of the proposed scheme on various parallel computing platforms.
A Novel Partitioning Method for Accelerating the Block Cimmino Algorithm F. Sukru Torun, Murat Manguoglu, Cevdet Aykanat SIAM Journal on Scientific Computing (SISC), 40(6), C827–C850. (December 2018) We propose a novel block-row partitioning method in order to improve the convergence rate of the block Cimmino algorithm for solving general sparse linear systems of equations. The convergence rate of the block Cimmino algorithm depends on the orthogonality among the block rows obtained by the partitioning method. The proposed method takes numerical orthogonality among block rows into account by proposing a row inner-product graph model of the coefficient matrix. More In the graph partitioning formulation defined on this graph model, the partitioning objective of minimizing the cutsize directly corresponds to minimizing the sum of inter-block inner products between block rows thus leading to an improvement in the eigenvalue spectrum of the iteration matrix. This in turn leads to a significant reduction in the number of iterations required for convergence. Extensive experiments conducted on a large set of matrices confirm the validity of the proposed method against a state-of-the-art method. Paper
Minimizing Communication Through Computational Redundancy in Parallel Iterative Solvers F. Sukru Torun Thesis (Master's) Bilkent University, 2011. Thesis Sparse matrix vector multiplication (SpMxV) of the form y = Ax is a kernel operation in iterative linear solvers used in scientific applications. In these solvers, the SpMxV operation is performed repeatedly with the same sparse matrix through iterations until convergence. Depending on the matrix and its decomposition, parallel SpMxV operation necessitates communication among processors in the parallel environment. The communication can be reduced by intelligent decomposition. However, we can further decrease the communication through data replication and redundant computation. The communication occurs due to the transfer of x-vector entries in row-parallel SpMxV computation. The input vector x of the next iteration is computed from the output vector of the current iteration through linear vector operations. More Hence, a processor may compute a y-vector entry redundantly, which leads to a x-vector entry in the following iteration, instead of receiving that x-vector entry from another processor. Thus, redundant computation of that y-vector entry may lead to reduction in communication. In this thesis, we devise a directed-graph-based model that correctly captures the computation and communication pattern for above-mentioned iterative solvers. Moreover, we formulate the communication minimization by utilizing redundant computation of y-vector entries as a combinatorial problem on this directed graph model. We propose two heuristics to solve this combinatorial problem. Experimental results indicate that the communication reducing strategy by redundantly computing is promising.

Software:

Numerically aware partitioning for Block Cimmino

Reference: F. Sukru Torun, Murat Manguoglu, and Cevdet Aykanat. "A NOVEL PARTITIONING METHOD FOR ACCELERATING THE BLOCK CIMMINO ALGORITHM." SIAM Journal on Scientific Computing (2018 - Accepted for Publication) Paper

Source code for numerically aware row-block partitioning for Block Cimmino Iterative method in ABCD Solver is source code is distributed under the GNU Lesser General Public License. This is a simplified C++ implementation of the work that is published in SIAM SISC 2018.

>> GRIP (Row Inner-Product Graph partitioner for ABCD) version 1.1 <<

ParBaMiN: Parallel Balance Scheme Algorithm for Minimum Norm Solution of Underdetermined Systems

F. Sukru Torun
Reference: F. Sukru Torun, Murat Manguoglu, and Cevdet Aykanat. 2017. Parallel Minimum Norm Solution of Sparse Block Diagonal Column Overlapped Underdetermined Systems. ACM Trans. Math. Softw. 43, 4, Article 31. Paper

A distributed parallel algorithm for solving minimum 2-norm solution of an underdetermined system on distributed memory high performance computing (HPC) platforms. This software is implemented in C++ programming languages and it uses MPI library for communication among processors. Each processor concurrently and independently applies QR factorization on the sub-matrices. Sparse QR factorization of SuiteSparse (www.suitesparse.com) package is used in local sparse QR factorization operations. The performance of the proposed method may increase by adding the parallelism mechanisms of multi-threaded SuiteSparseQR and multi-threaded BLAS for the local QR factorizations. This algorithm can be considered as an scalable extension of any multithreaded general sparse QR factorization algorithm to distributed memory architectures for computing minimum 2-norm solution of underdetermined linear least squares problems.
ParBaMiN

ParBaMiN_matlab: A Sequential MATLAB program for Parallel Balance Scheme Algorithm for Minimum Norm Solution of Underdetermined Systems

A simple MATLAB program for the solving Minimum Norm Solution of Sparse Block Diagonal Column Overlapped Underdetermined Systems. It is a sequential version of Parallel Minimum Norm Solution of Sparse Block Diagonal Column Overlapped Underdetermined Systems. Sparse QR factorization with Householder reflection of spqr routine in SuiteSparseQR package is used in local sparse QR factorization operations.
ParBaMiN_matlab

Projects:

EoCoE: Energy oriented Centre of Excellence

Within the Horizon2020 framework of the European Union. Budget: 5.5 Million €
Post-Doctoral Research Fellow (2017)

The project coordinates a pan-European network covering 21 teams in 8 countries with a total budget of 5.5 M€. EoCoE assists the energy transition via targeted support to four renewable energy pillars: Meteo, Materials, Water and Fusion. These four pillars are anchored within a strong transversal multidisciplinary basis providing high-end expertise in applied mathematics and HPC.

TUSAS Lift Up 2021: Yüksek Doğrulukla Yabancı Madde Tespiti (YAMATE) - Detection of Foreign Objects With High Accuracy

LIFTUP Sanayi Odaklı Lisans Bitirme Projeleri Programı 2020-2021 Projeleri Bildiri Kitabı
Academic Advisor

The aim of our project, "Detection of foreign objects with high accuracy", is to prevent damage to aircraft caused by foreign objects in the flight area. Flight areas must be clean and safe to provide a suitable environment for the landing and take-off of aircraft produced at high costs. Foreign objects that cannot be detected in the flight area can damage aircraft and cause accidents. For this purpose, the first step of our project, which includes various stages, is the development of real-time image processing software that works with high accuracy and is trained with customized data sets in this context. More
Report

Teaching:

Undergraduate Courses:

CENG206 Concepts of Programming Languages
CENG305 Operating Systems
CENG342 Parallel Programming
CENG327 Introduction to Scientific Computing

Graduate Courses:

CENG502 Parallel Algorithms
CENG508 Distributed Systems
CENG595 High Performance Scientific Computing
CENG596 JULIA High Performance

Schedule:

Asst. Prof. at Ankara Yildirim Beyazit University in Computer Engineering Department. email: fstorun at aybu.edu.tr, fsukrutorun@gmail.com website: avesis.aybu.edu.tr/ftorun

A distributed memory parallel randomized Kaczmarz for sparse system of equations

Row Replicated Block Cimmino

Quadratic programming based partitioning for Block Cimmino with correct value representation

Enhancing Block Cimmino for Sparse Linear Systems with Dense Columns via Schur Complement

Partitioning and Reordering for Spike-Based Distributed-Memory Parallel Gauss-Seidel

Parallel solution of Lambert’s problem using modified Chebyshev-Picard iteration method

Parallel K-means Clustering With Naïve Sharding for Unsupervised Image Segmentation via MPI

Improving the scalability of the ABCD Solver with a combination of new load balancing and communication minimization techniques

Parallel Minimum Norm Solution of Sparse Block Diagonal Column Overlapped Underdetermined Systems

Solving Sparse Underdetermined Linear Least Squares Problems on Parallel Computing Platforms

A Novel Partitioning Method for Accelerating the Block Cimmino Algorithm

Minimizing Communication Through Computational Redundancy in Parallel Iterative Solvers