Publications
2025
2025
- ICLRBeyond Mere Token Analysis: A Hypergraph Metric Space Framework for Defending Against Socially Engineered LLM AttacksManohar Kaul, Aditya Saibewar, and Sadbhavana BabarIn The Thirteenth International Conference on Learning Representations, 2025
Recent jailbreak attempts on Large Language Models (LLMs) have shifted from algorithm-focused to human-like social engineering attacks, with persuasion-based techniques emerging as a particularly effective subset. These attacks evolve rapidly, demonstrate high creativity, and boast superior attack success rates. To combat such threats, we propose a promising approach to enhancing LLM safety by leveraging the underlying geometry of input prompt token embeddings using hypergraphs. This approach allows us to model the differences in information flow between benign and malicious LLM prompts. In our approach, each LLM prompt is represented as a metric hypergraph, forming a compact metric space. We then construct a higher-order metric space over these compact metric hypergraphs using the Gromov-Hausdorff distance as a generalized metric. Within this space of metric hypergraph spaces, our safety filter learns to classify between harmful and benign prompts. Our study presents theoretical guarantees on the classifier’s generalization error for novel and unseen LLM input prompts. Extensive empirical evaluations demonstrate that our method significantly outperforms both existing state-of-the-art generic defense mechanisms and naive baselines. Notably, our approach also achieves comparable performance to specialized defenses against algorithm-focused attacks.
- ICLREFFICIENT JAILBREAK ATTACK SEQUENCES ON LARGE LANGUAGE MODELS VIA MULTI-ARMED BANDIT-BASED CONTEXT SWITCHINGAditya Ramesh, Shivam Bhardwaj, Aditya Saibewar, and Manohar KaulIn The Thirteenth International Conference on Learning Representations, 2025
Content warning: This paper contains examples of harmful language and content. Recent advances in large language models (LLMs) have made them increasingly vulnerable to jailbreaking attempts, where malicious users manipulate models into generating harmful content. While existing approaches rely on either single-step attacks that trigger immediate safety responses or multi-step methods that inefficiently iterate prompts using other LLMs, we introduce “Sequence of Context" (SoC) attacks that systematically alter conversational context through strategically crafted context-switching queries (CSQs). We formulate this as a multi-armed bandit (MAB) optimization problem, automatically learning optimal sequences of CSQs that gradually weaken the model’s safety boundaries. Our theoretical analysis provides tight bounds on both the expected sequence length until successful jailbreak and the convergence of cumulative rewards. Empirically, our method achieves a 95% attack success rate, surpassing PAIR by 63.15%, AutoDAN by 60%, and ReNeLLM by 50%. We evaluate our attack across multiple open-source LLMs including Llama and Mistral variants. Our findings highlight critical vulnerabilities in current LLM safeguards and emphasize the need for defenses that consider sequential attack patterns rather than relying solely on static prompt filtering or iterative refinement.
- NAACLSafeQuant: LLM Safety Analysis via Quantized Gradient InspectionSindhu Padakandla, Sadbhavana Babar, Rathod Darshan D, and Manohar KaulIn The 2025 Annual Conference of the Nations of the Americas Chapter of the ACL, 2025
Contemporary jailbreak attacks on Large Language Models (LLMs) employ sophisticated techniques with obfuscated content to bypass safety guardrails. Existing defenses either use computationally intensive LLM verification or require adversarial fine-tuning, leaving models vulnerable to advanced attacks. We introduce SafeQuant, a novel defense framework that leverages quantized gradient patterns to identify harmful prompts efficiently. Our key insight is that when generating identical responses like "Sure", LLMs exhibit distinctly different internal gradient patterns for safe versus harmful prompts, reflecting conflicts with safety training. By capturing these patterns through selective gradient masking and quantization, SafeQuant significantly outperforms existing defenses across multiple benchmarks while maintaining model utility. The method demonstrates particular effectiveness against sophisticated attacks like WordGame prompts and persuasive adversarial attacks, achieving an F1-score of 0.80 on WordGame dataset and outperforming state-of-the-art (SoTA) methods like GradSafe by an absolute margin of 57%.
- WACVLearning Semantic Part-Based Graph Structure for 3D Point Cloud Domain GeneralizationG Ujwal Sai, Arkadipta De, Vartika Sengar, Anuj Rathore, Daksh Thapar, and 1 more authorIn Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Feb 2025
- WACVTowards a Training Free Approach for 3D Scene EditingVivek Madhavaram, Shivangana Rawat, Chaitanya Devaguptapu, Charu Sharma, and Manohar KaulIn Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Feb 2025
- WACVElemental Composite Prototypical Network: Few-Shot Object Detection on Outdoor 3D Point Cloud ScenesArkadipta De, Vartika Sengar, Daksh Thapar, Mahesh Chandran, and Manohar KaulIn Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Feb 2025
2024
2024
- TMLRTeacher-Guided Graph Contrastive LearningJay Nandy, Arnab Kumar Mondal, Manohar Kaul, and Prathosh APTransactions on Machine Learning Research, Feb 2024
State-of-the-art self-supervised representation learning methods for Graphs are typically based on contrastive learning (CL) principles. These CL objective functions can be posed as a supervised discriminative task using *’hard’* labels that consider any minor augmented pairs of graphs as ’equally positive’. However, such a notion of ’equal’ pairs is incorrect for graphs as even a smaller ’discrete’ perturbation may lead to large semantic changes that should be carefully encapsulated within the learned representations. This paper proposes a novel CL framework for GNNs, called *Teacher-guided Graph Contrastive Learning (TGCL)*, that incorporates ’soft’ pseudo-labels to facilitate a more regularized discrimination. In particular, we propose a teacher-student framework where the student learns the representation by distilling the teacher’s perception. Our TGCL framework can be adapted to existing CL methods to enhance their performance. Our empirical findings validate these claims on both inductive and transductive settings across diverse downstream tasks, including molecular graphs and social networks. Our experiments on benchmark datasets demonstrate that our framework consistently improves the average AUROC scores for molecules’ property prediction and social network link prediction. Our code is available at: https://github.com/jayjaynandy/TGCL.
- WACVSynergizing Contrastive Learning and Optimal Transport for 3D Point Cloud Domain AdaptationSiddharth Katageri, Arkadipta De, Chaitanya Devaguptapu, V. S. S. V. Prasad, Charu Sharma, and 1 more authorIn IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2024, Waikoloa, HI, USA, January 3-8, 2024, Feb 2024
Recently, the fundamental problem of unsupervised domain adaptation (UDA) on 3D point clouds has been motivated by a wide variety of applications in robotics, virtual reality, and scene understanding, to name a few. The point cloud data acquisition procedures manifest themselves as significant domain discrepancies and geometric variations among both similar and dissimilar classes. The standard domain adaptation methods developed for images do not directly translate to point cloud data because of their complex geometric nature. To address this challenge, we leverage the idea of multimodality and alignment between distributions. We propose a new UDA architecture for point cloud classification that benefits from multimodal contrastive learning to get better class separation in both domains individually. Further, the use of optimal transport (OT) aims at learning source and target data distributions jointly to reduce the cross-domain shift and provide a better alignment. We conduct a comprehensive empirical study on PointDA-10 and GraspNetPC-10 and show that our method achieves state-of-the-art performance on GraspNetPC-10 (with approx 4-12% margin) and best average performance on PointDA-10. Our ablation studies and decision boundary analysis also validate the significance of our contrastive learning module and OT alignment.
- ACLHOLMES: Hyper-Relational Knowledge Graphs for Multi-hop Question Answering using LLMsPranoy Panda, Ankush Agarwal, Chaitanya Devaguptapu, Manohar Kaul, and Prathosh A PIn Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, Feb 2024
Given unstructured text, Large Language Models (LLMs) are adept at answering simple (single-hop) questions. However, as the complexity of the questions increase, the performance of LLMs degrade. We believe this is due to the overhead associated with understanding the complex question followed by filtering and aggregating unstructured information in the raw text. Recent methods try to reduce this burden by integrating structured knowledge triples into the raw text, aiming to provide a structured overview that simplifies information processing. However, this simplistic approach is query-agnostic and the extracted facts are ambiguous as they lack context. To address these drawbacks and to enable LLMs to answer complex (multi-hop) questions with ease, we propose to use a knowledge graph (KG) that is context-aware and is distilled to contain query-relevant information. The use of our compressed distilled KG as input to the LLM results in our method utilizing up to 67% fewer tokens to represent the query relevant information present in the supporting documents, compared to the state-of-the-art (SoTA) method. Our experiments show consistent improvements over the SoTA across several metrics (EM, F1, BERTScore, and Human Eval) on two popular benchmark datasets (HotpotQA and MuSiQue).
2022
2022
- TAG-MLPrefaceAlexander Cloninger, Timothy Doster, Tegan Emerson, Manohar Kaul, Ira Ktena, and 5 more authorsIn Topological, Algebraic and Geometric Learning Workshops 2022, 25-22 July 2022, Virtual, Feb 2022
- IJCNNBERTops: Studying BERT Representations under a Topological LensJatin Chauhan, and Manohar KaulIn International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, July 18-23, 2022, Feb 2022
Proposing scoring functions to effectively understand, analyze and learn various properties of high dimensional hidden representations of large-scale transformer models like BERT can be a challenging task. In this work, we explore a new direction by studying the topological features of BERT hidden representations using persistent homology (PH). We propose a novel scoring function named "persistence scoring function (PSF)" which: (i) accurately captures the homology of the high-dimensional hidden representations and correlates well with the test set accuracy of a wide range of datasets and outperforms existing scoring metrics, (ii) captures interesting post fine-tuning "per-class" level properties from both qualitative and quantitative viewpoints, (iii) is more stable to perturbations as compared to the baseline functions, which makes it a very robust proxy, and (iv) finally, also serves as a predictor of the attack success rates for a wide category of black-box and white-box adversarial attack methods. Our extensive correlation experiments demonstrate the practical utility of PSF on various NLP tasks relevant to BERT.
2021
2021
- arXivTarget Model Agnostic Adversarial Attacks with Query Budgets on Language Understanding ModelsJatin Chauhan, Karan Bhukar, and Manohar KaulArXiv.org, Feb 2021
Despite significant improvements in natural language understanding models with the advent of models like BERT and XLNet, these neural-network based classifiers are vulnerable to blackbox adversarial attacks, where the attacker is only allowed to query the target model outputs. We add two more realistic restrictions on the attack methods, namely limiting the number of queries allowed (query budget) and crafting attacks that easily transfer across different pre-trained models (transferability), which render previous attack models impractical and ineffective. Here, we propose a target model agnostic adversarial attack method with a high degree of attack transferability across the attacked models. Our empirical studies show that in comparison to baseline methods, our method generates highly transferable adversarial sentences under the restriction of limited query budgets.
- arXivUnderstanding Higher-order Structures in Evolving Graphs: A Simplicial Complex based Kernel Estimation ApproachManohar Kaul, and Masaaki ImaizumiArXiv.org, Feb 2021
Dynamic graphs are rife with higher-order interactions, such as co-authorhip relationships and protein-protein interactions in biological networks, that naturally arise between more than two nodes at once. In spite of the ubiquitous presence of such higher-order interactions, limited attention has been paid to the higher-order counterpart of the popular pairwise link prediction problem. Existing higher-order structure prediction methods are mostly based on heuristic feature extraction procedures, which work well in practice but lack theoretical guarantees. Such heuristics are primarily focused on predicting links in a static snapshot of the graph. Moreover, these heuristic-based methods fail to effectively utilize and benefit from the knowledge of latent substructures already present within the higher-order structures. In this paper, we overcome these obstacles by capturing higher-order interactions succinctly as simplices, model their neighborhood by face-vectors, and develop a nonparametric kernel estimator for simplices that views the evolving graph from the perspective of a time process (i.e., a sequence of graph snapshots). Our method substantially outperforms several baseline higher-order prediction methods. As a theoretical achievement, we prove the consistency and asymptotic normality in terms of the Wasserstein distance of our estimator using Stein’s method.
- WWWRECON: Relation Extraction using Knowledge Graph Context in a Graph Neural NetworkAnson Bastos, Abhishek Nadgeri, Kuldeep Singh, Isaiah Onando Mulang, Saeedeh Shekarpour, and 2 more authorsWeb Conference, Feb 2021
In this paper, we present a novel method named RECON, that automatically identifies relations in a sentence (sentential relation extraction) and aligns to a knowledge graph (KG). RECON uses a graph neural network to learn representations of both the sentence as well as facts stored in a KG, improving the overall extraction quality. These facts, including entity attributes (label, alias, description, instance-of) and factual triples, have not been collectively used in the state of the art methods. We evaluate the effect of various forms of representing the KG context on the performance of RECON. The empirical evaluation on two standard relation extraction datasets shows that RECON significantly outperforms all state of the art methods on NYT Freebase and Wikidata datasets.
2020
2020
- ArXivA Weighted Quiver Kernel using Functor HomologyManohar Kaul, and Dai TamakiArXiv.org, Feb 2020
In this paper, we propose a new homological method to study weighted directed networks. Our model of such networks is a directed graph \(Q\)equipped with a weight function ww on the set \(Q_1\)of arrows in \(Q\). We require that the range \(W\)of our weight function is equipped with an addition or a multiplication, i.e., \(W\)is a monoid in the mathematical terminology. When \(W\)is equipped with a representation on a vector space \(M\), the standard method of homological algebra allows us to define the homology groups \(H_*(Q,w;M)\). It is known that when \(Q\)has no oriented cycles, \(H_n(Q,w;M)=0\)for \(n≥2\)and \(H_1(Q,w;M)\)can be easily computed. This fact allows us to define a new graph kernel for weighted directed graphs. We made two sample computations with real data and found that our method is practically applicable.
- NeurIPSSelf-Supervised Few-Shot Learning on Point CloudsCharu Sharma, and Manohar KaulNeural Information Processing Systems, Feb 2020
The increased availability of massive point clouds coupled with their utility in a wide variety of applications such as robotics, shape synthesis, and self-driving cars has attracted increased attention from both industry and academia. Recently, deep neural networks operating on labeled point clouds have shown promising results on supervised learning tasks like classification and segmentation. However, supervised learning leads to the cumbersome task of annotating the point clouds. To combat this problem, we propose two novel self-supervised pre-training tasks that encode a hierarchical partitioning of the point clouds using a cover-tree, where point cloud subsets lie within balls of varying radii at each level of the cover-tree. Furthermore, our self-supervised learning network is restricted to pre-train on the support set (comprising of scarce training examples) used to train the downstream network in a few-shot learning (FSL) setting. Finally, the fully-trained self-supervised network’s point embeddings are used to initialize the downstream task’s network. We present a comprehensive empirical evaluation of our method on both downstream classification and segmentation tasks and show that supervised methods pre-trained with our self-supervised learning method significantly improve the accuracy of state-of-the-art methods. Additionally, our method also outperforms previous unsupervised methods in downstream classification tasks.
- ECCVSimplicial Complex based Point Correspondence between Images warped onto ManifoldsCharu Sharma, and Manohar KaulEuropean Conference on Computer Vision, Feb 2020
Recent increase in the availability of warped images projected onto a manifold (e.g., omnidirectional spherical images), coupled with the success of higher-order assignment methods, has sparked an interest in the search for improved higher-order matching algorithms on warped images due to projection. Although currently, several existing methods ’flatten’ such 3D images to use planar graph / hypergraph matching methods, they still suffer from severe distortions and other undesired artifacts, which result in inaccurate matching. Alternatively, current planar methods cannot be trivially extended to effectively match points on images warped onto manifolds. Hence, matching on these warped images persists as a formidable challenge. In this paper, we pose the assignment problem as finding a bijective map between two graph induced simplicial complexes, which are higher-order analogues of graphs. We propose a constrained quadratic assignment problem (QAP) that matches each p-skeleton of the simplicial complexes, iterating from the highest to the lowest dimension. The accuracy and robustness of our approach are illustrated on both synthetic and real-world spherical / warped (projected) images with known ground-truth correspondences. We significantly outperform existing state-of-the-art spherical matching methods on a diverse set of datasets.
- IJCNNLearning Representations using Spectral-Biased Random Walks on GraphsCharu Sharma, Jatin Chauhan, and Manohar KaulIEEE International Joint Conference on Neural Networks, Feb 2020
Several state-of-the-art neural graph embedding methods are based on short random walks (stochastic processes) because of their ease of computation, simplicity in capturing complex local graph properties, scalability, and interpretibility. In this work, we are interested in studying how much a probabilistic bias in this stochastic process affects the quality of the nodes picked by the process. In particular, our biased walk, with a certain probability, favors movement towards nodes whose neighborhoods bear a structural resemblance to the current node’s neighborhood. We succinctly capture this neighborhood as a probability measure based on the spectrum of the node’s neighborhood subgraph represented as a normalized Laplacian matrix. We propose the use of a paragraph vector model with a novel Wasserstein regularization term. We empirically evaluate our approach against several state-of-the-art node embedding techniques on a wide variety of real-world datasets and demonstrate that our proposed method significantly improves upon existing methods on both link prediction and node classification tasks.
- ICLRFew-shot Learning on Graphs via Super-classes based on Graph Spectral MeasuresJatin Chauhan, Deepak Nathani, and Manohar KaulInternational Conference on Learning Representations, Feb 2020
We propose to study the problem of few-shot graph classification in graph neural networks (GNNs) to recognize unseen classes, given limited labeled graph examples. Despite several interesting GNN variants being proposed recently for node and graph classification tasks, when faced with scarce labeled examples in the few-shot setting, these GNNs exhibit significant loss in classification performance. Here, we present an approach where a probability measure is assigned to each graph based on the spectrum of the graph’s normalized Laplacian. This enables us to accordingly cluster the graph base-labels associated with each graph into super-classes, where the L^p Wasserstein distance serves as our underlying distance metric. Subsequently, a super-graph constructed based on the super-classes is then fed to our proposed GNN framework which exploits the latent inter-class relationships made explicit by the super-graph to achieve better class label separation among the graphs. We conduct exhaustive empirical evaluations of our proposed method and show that it outperforms both the adaptation of state-of-the-art graph classification methods to few-shot scenario and our naive baseline GNNs. Additionally, we also extend and study the behavior of our method to semi-supervised and active learning scenarios.
2019
2019
- ACLLearning Attention-based Embeddings for Relation Prediction in Knowledge Graphs (long paper)Deepak Nathani, Jatin Chauhan, Charu Sharma, and Manohar KaulAssociation for Computational Linguistics, Feb 2019
The recent proliferation of knowledge graphs (KGs) coupled with incomplete or partial information, in the form of missing relations (links) between entities, has fueled a lot of research on knowledge base completion (also known as relation prediction). Several recent works suggest that convolutional neural network (CNN) based models generate richer and more expressive feature embeddings and hence also perform well on relation prediction. However, we observe that these KG embeddings treat triples independently and thus fail to cover the complex and hidden information that is inherently implicit in the local neighborhood surrounding a triple. To this effect, our paper proposes a novel attention-based feature embedding that captures both entity and relation features in any given entity’s neighborhood. Additionally, we also encapsulate relation clusters and multi-hop relations in our model. Our empirical study offers insights into the efficacy of our attention-based model and we show marked performance gains in comparison to state-of-the-art methods on all datasets.
2018
2018
- ICMLSolving Partial Assignment Problems using Random Clique ComplexesCharu Sharma, Deepak Nathani, and Manohar KaulInternational Conference on Machine Learning, Feb 2018
We present an alternate formulation of the partial assignment problem as matching random clique complexes, that are higher-order analogues of random graphs, designed to provide a set of invariants that better detect higher-order structure. The proposed method creates random clique adjacency matrices for each k-skeleton of the random clique complexes and matches them, taking into account each point as the affine combination of its geometric neighborhood. We justify our solution theoretically, by analyzing the algorithm’s runtime and storage complexity along with the asymptotic behavior of the quadratic assignment problem (QAP) that is associated with the underlying random adjacency matrices. Accuracy experiments on both synthetic and real-world datasets containing severe occlusions and distortions, demonstrate the robustness of our approach. We outperform diverse matching algorithm by a significant margin.
2017
2017
- TKDEOn Fault Tolerance for Distributed Iterative Dataflow ProcessingChen Xu, Markus Holzemer, Manohar Kaul, Juan Soto, and Volker MarklTransactions on Knowledge and Data Engineering, Feb 2017
Large-scale graph and machine learning analytics widely employ distributed iterative processing. Typically, these analytics are a part of a comprehensive workflow, which includes data preparation, model building, and model evaluation. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the entire pipeline. Here, graph and machine learning analytics are known to incur a long runtime since they require multiple passes over the data until convergence is reached. Thus, fault tolerance and a fast-recovery from any intermittent failure is critical for efficient analysis. In this paper, we propose novel fault-tolerant mechanisms for graph and machine learning analytics that run on distributed dataflow systems. We seek to reduce checkpointing costs and shorten failure recovery times. For graph processing, rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner that does not break pipelined tasks. In contrast to the conventional approach for unblocking checkpointing (e.g., that manage checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating checkpoint creation during iterative graph processing. Moreover, we are able to rapidly rebound, via confined recovery, by exploiting the fact that log files exist locally on healthy nodes and managing to avoid a complete recomputation from scratch. In addition, we propose replica recovery for machine learning algorithms, whereby we employ a broadcast variable that enables us to quickly recover without having to introduce any checkpoints. In order to evaluate our fault tolerance strategies, we conduct both a theoretical study and empirical experimental analyses using Apache Flink and discover that they outperform blocking checkpointing and complete recovery
- CIDRElementary, dear Watson!Manohar KaulConference on Innovative Data Systems Research, Feb 2017
In T̈he Sign of Four\", Sherlock Holmes makes some very startling inferences about Watson’s whereabouts by informing Watson that he was at the Wigmore Street Post Office earlier that morning. On further inquiry by Watson, Holmes explains that he combined (joined) the observation that Ẅatson’s shoes had a reddish mud on their instep\", with the fact that \“just opposite the Wigmore Post Office the pavement was freshly dug up to expose a similar reddish mud and it was so positioned that it would be challenging to enter the office without treading into the mud\". Contrary to the popular belief that this is deduction, this method of reasoning is actually an example of abductive inference. Abduction begins from the facts observed and then seeks the simplest hypothesis (\“Occam’s razor\") that best explains the facts, while deduction finds data to support a hypothesis. Given the big data challenge that we presently face, is it then possible to utilize an abductive model (effect to cause reasoning) to find the best explanation for the observations, as opposed to the traditional method of forming hypotheses and testing them with observations (cause to effect reasoning)? Can our databases extend our understanding of this data automatically by inferring explanations from incomplete observations?
2016
2016
- ArXivSharing Hash Codes for Multiple PurposesWikor Pronobis, Danny Panknin, Johannes Kirschnick, Vignesh Srinivasan, Wojciech Samek, and 4 more authorsArXiv.org, Feb 2016
Locality sensitive hashing (LSH) is a powerful tool for sublinear-time approximate nearest neighbor search, and a variety of hashing schemes have been proposed for different similarity measures. However, hash codes significantly depend on the similarity, which prohibits users from adjusting the similarity at query time. In this paper, we propose multiple purpose LSH (mp-LSH) which shares the hash codes for different similarities. By using vector/code augmentation and cover tree techniques, our mp-LSH supports L2, cosine, and inner product similarities, and their corresponding weighted sums, where the weights can be adjusted at query time. It also allows us to modify the importance of pre-defined groups of features. Thus, mp-LSH enables us, for example, to retrieve similar items to a query with the user preference taken into account, to find a similar material to a query with some properties (stability, utility, etc.) optimized, and to turn on or off a part of multi-modal information (brightness, color, audio, text, etc.) in image/video retrieval. We theoretically and empirically analyze the performance of three variants of mp-LSH, and demonstrate their usefulness on several real-world data sets.
- ICIQImproving Data Quality by Leveraging Statistical Relational LearningLarysa Visengeriyeva, Alan Akbik, and Manohar KaulInternational Conference on Information Quality, Feb 2016
Digitally collected data suffers from many data quality issues, such as duplicate, incorrect, or incomplete data. A common approach for counteracting these issues is to formulate a set of data cleaning rules to identify and repair incorrect, duplicate and missing data. Data cleaning systems must be able to treat data quality rules holistically, to incorporate heterogeneous constraints within a single routine, and to automate data curation. We propose an approach to data cleaning based on statistical relational learning (SRL). We argue that a formalism - Markov logic - is a natural fit for modeling data quality rules. Our approach allows for the usage of probabilistic joint inference over interleaved data cleaning rules to improve data quality. Furthermore, it obliterates the need to specify the order of rule execution. We describe how data quality rules expressed as formulas in first-order logic directly translate into the predictive model in our SRL framework.
- ICDEEfficient Fault-tolerance for Iterative Graph Processing on Distributed Dataflow SystemsChen Xu, Markus Holzemer, Manohar Kaul, and Volker MarklIEEE International Conference on Data Engineering, Feb 2016
Real-world graph processing applications in many instances require combining the graph data with tabular data or making the graph processing a part of a larger analytics pipeline. General-purpose distributed dataflow frameworks typically execute such pipelines while analyzing the entire pipeline in a holistic manner to further optimize the processing. A majority of big graph processing algorithms are iterative in nature and incur a long runtime, therefore it becomes all the more necessary to tolerate and recover quickly from any intermittent failures. In this work, we propose an efficient fault-tolerance mechanism for iterative graph processing on distributed data-flow systems with the objective to reduce the checkpointing cost and failure recovery time. Rather than writing checkpoints that block downstream operators, we write checkpoints in an unblocking manner. Also, in comparison to the typical unblocking checkpointing approach of managing checkpoints independently, we inject the checkpoints into the dataflow itself. It not only inherits the advantage of a low execution latency without breaking the pipelined tasks, but simplifies the system design to coordinate the checkpoint writing. Further, we achieve speedier recovery, i.e., confined recovery, by using the local log files on each node to avoid a complete re- computation from scratch. Our theoretical studies as well as experimental analysis on Flink give further insight into our fault- tolerance strategies and show that they are more efficient than blocking checkpointing and complete recovery for iterative graph processing on dataflow systems.
2015
2015
- PVLDBNew Lower and Upper Bounds for Shortest Distance Queries on TerrainsManohar Kaul, Raymond Chi-Wing Wong, and Christian S. JensenProceedings of Very Large Databases, Feb 2015
The increasing availability of massive and accurate laser data enables the processing of spatial queries on terrains. As shortest-path computation, an integral element of query processing, is inherently expensive on terrains, a key approach to enabling efficient query processing is to reduce the need for exact shortest-path computation in query processing. We develop new lower and upper bounds on terrain shortest distances that are provably tighter than any existing bounds. Unlike existing bounds, the new bounds do not rely on the quality of the triangulation. We show how use of the new bounds speeds up query processing by reducing the need for exact distance computations. Speedups of of nearly an order of magnitude are demonstrated empirically for well-known spatial queries.
- PIKMR-Apriori : An Efficient Apriori based Algorithm on SparkSanjay Rathee, Manohar Kaul, and Arti KashyapPIKM, Ph.D. student workshop at CIKM, Feb 2015
Association rule mining remains a very popular and effective method to extract meaningful information from large datasets. It tries to find possible associations between items in large transaction based datasets. In order to create these associations, frequent patterns have to be generated. The Äpriori\"algorithm along with its set of improved variants, which were one of the earliest proposed frequent pattern generation algorithms still remain a preferred choice due to their ease of implementation and natural tendency to be parallelized.
While many efficient single-machine methods for Apriori exist, the massive amount of data available these days is far beyond the capacity of a single machine. Hence, there is a need to scale across multiple machines to meet the demands of this ever- growing data. MapReduce is a popular fault-tolerant framework for distributed applications. Nevertheless, heavy disk I/O at each MapReduce operation hinders the implementation of efficient iterative data mining algorithms, such as Apriori, on MapReduce platforms.
A newly proposed in-memory distributed dataflow platform called Spark overcomes the disk I/O bottlenecks in MapReduce. Therefore, Spark presents an ideal platform for distributed Apriori. However, in the implementation of Apriori, the most computationally expensive task is the generation of candidate sets having all possible pairs for singleton frequent items and comparing each pair with every transaction record. Here, we propose a new approach which dramatically reduces this computational complexity by eliminating the candidate generation step and avoiding costly comparisons.
We conduct in-depth experiments to gain insight into the effectiveness, efficiency and scalability of our approach. Our studies show that our approach outperforms the classical Apriori and state-of-the-art on Spark by many times for different datasets.
2014
2014
- PVLDBTerrain-Toolkit: A Multi-Functional Tool for Terrain Data (Demo)Qi Wang, Manohar Kaul, Cheng Long, and Raymond Chi-Wing WongProceedings of Very Large Databases, Feb 2014
Terrain data is becoming increasingly popular both in industry and in academia. Many tools have been developed for visualizing terrain data. However, we find that (1) they usually accept very few data formats of terrain data only; (2) they do not support terrain simplification well which, as will be shown, is used heavily for query processing in spatial databases; and (3) they do not provide the surface distance operator which is fundamental for many applications based on terrain data. Motivated by this, we developed a tool called Terrain-Toolkit for terrain data which accepts a comprehensive set of data formats, supports terrain simplification and provides the surface distance operator.
- ICDEMulti-Cost Optimal Route Planning Under Time-Varying UncertaintyBin Yang, Chenjuan Guo, Christian S. Jensen, and Manohar KaulIEEE International Conference on Data Engineering, Feb 2014
Different uses of a road network call for the consideration of different travel costs: in route planning, travel time and distance are typically considered, and green house gas (GHG) emissions are increasingly being considered. Further, costs such as travel time and GHG emissions are time-dependent and uncertain. To support such uses, we propose techniques that enable the construction of a multi-cost, time-dependent, uncertain graph (MTUG) model of a road network based on GPS data from vehicles that traversed the road network. Based on the MTUG, we define optimal routes that consider multiple costs and time-dependent uncertainty, and we propose efficient algorithms to retrieve optimal routes for a given source-destination pair and a start time. Empirical studies with three road networks in Denmark and a substantial GPS data set offer insight into the design properties of the MTUG and the efficiency of the optimal route algorithms.
- PVLDBFinding Shortest Paths on Terrains by Killing Two Birds with One StoneManohar Kaul, Raymond Chi-Wing Wong, Bin Yang, and Christian S. JensenProceedings of Very Large Databases, Feb 2014
With the increasing availability of terrain data, e.g., from aerial laser scans, the management of such data is attracting increasing attention in both industry and academia. In particular, spatial queries, e.g., k-nearest neighbor and reverse nearest neighbor queries, in Euclidean and spatial network spaces are being extended to terrains. Such queries all rely on an important operation, that of finding shortest surface distances. However, shortest surface distance computation is very time consuming. We propose techniques that enable efficient computation of lower and upper bounds of the shortest surface distance, which enable faster query processing by eliminating expensive distance computations. Empirical studies show that our bounds are much tighter than the best-known bounds in many cases and that they enable speedups of up to 43 times for some well-known spatial queries.
2013
2013
- TKDEUsing Incomplete Information for Complete Weight Annotation of Road NetworksBin Yang, Manohar Kaul, and Christian S. JensenTransactions on Knowledge and Data Engineering, Feb 2013
We are witnessing increasing interests in the effective use of road networks. For example, to enable effective vehicle routing, weighted-graph models of transportation networks are used, where the weight of an edge captures some cost associated with traversing the edge, e.g., greenhouse gas (GHG) emissions or travel time. It is a precondition to using a graph model for routing that all edges have weights. Weights that capture travel times and GHG emissions can be extracted from GPS trajectory data collected from the network. However, GPS trajectory data typically lack the coverage needed to assign weights to all edges. This paper formulates and addresses the problem of annotating all edges in a road network with travel cost based weights from a set of trips in the network that cover only a small fraction of the edges, each with an associated ground-truth travel cost. A general framework is proposed to solve the problem. Specifically, the problem is modeled as a regression problem and solved by minimizing a judiciously designed objective function that takes into account the topology of the road network. In particular, the use of weighted PageRank values of edges is explored for assigning appropriate weights to all edges, and the property of directional adjacency of edges is also taken into account to assign weights. Empirical studies with weights capturing travel time and GHG emissions on two road networks (Skagen, Denmark, and North Jutland, Denmark) offer insight into the design properties of the proposed techniques and offer evidence that the techniques are effective.
- MDMBuilding Accurate 3D Spatial Networks to Enable Next Generation Intelligent Transportation Systems (Best Paper Award)Manohar Kaul, Bin Yang, and Christian S. JensenIEEE Proceedings of International Conference on Mobile Data Management, Feb 2013
The use of accurate 3D spatial network models can enable substantial improvements in vehicle routing. Notably, such models enable eco-routing, which reduces the environmental impact of transportation. We propose a novel filtering and lifting framework that augments a standard 2D spatial network model with elevation information extracted from massive aerial laser scan data and thus yields an accurate 3D model. We present a filtering technique that is capable of pruning irrelevant laser scan points in a single pass, but assumes that the 2D network fits in internal memory and that the points are appropriately sorted. We also provide an external-memory filtering technique that makes no such assumptions. During lifting, a triangulated irregular network (TIN) surface is constructed from the remaining points. The 2D network is projected onto the TIN, and a 3D network is constructed by means of interpolation. We report on a large-scale empirical study that offers insight into the accuracy, efficiency, and scalability properties of the framework.
2012
2012
- SIGPATIALEcoMark: Evaluating Models of Vehicular Environmental ImpactChenjuan Guo, Yu Ma, Bin Yang, Christian S. Jensen, and Manohar KaulProceedings of ACM SIGSPATIAL GIS, Feb 2012
The reduction of greenhouse gas (GHG) emissions from transportation is essential for achieving politically agreed upon emissions reduction targets that aim to combat global climate change. So-called eco-routing and eco-driving are able to substantially reduce GHG emissions caused by vehicular transportation. To enable these, it is necessary to be able to reliably quantify the emissions of vehicles as they travel in a spatial network. Thus, a number of models have been proposed that aim to quantify the emissions of a vehicle based on GPS data from the vehicle and a 3D model of the spatial network the vehicle travels in. We develop an evaluation framework, called EcoMark, for such environmental impact models. In addition, we survey all eleven state-of-the-art impact models known to us. To gain insight into the capabilities of the models and to understand the effectiveness of the EcoMark, we apply the framework to all models.
2011
2011
- SIGPATIALFrequent Route Based Continuous Moving Object Location and Density Prediction on Road NetworksGyozo Gidofalvi, Manohar Kaul, Christian Borgelt, and Torben Bach PedersenProceedings of ACM SIGSPATIAL GIS, Feb 2011
Emerging trends in urban mobility have accelerated the need for effective traffic prediction and management systems. The present paper proposes a novel approach to using continuously streaming moving object trajectories for traffic prediction and management. The approach continuously performs three functions for streams of moving object positions in road networks: 1) management of current evolving trajec tories, 2) incremental mining of closed frequent routes, and 3) prediction of near-future locations and densities based on 1) and 2). The approach is empirically evaluated on a large real-world data set of moving object trajectories, originating from a fleet of taxis, illustrating that detailed closed frequent routes can be efficiently discovered and used for prediction.
2003
2003
- CIRAIntelligent Packet Shaper to avoid Network Congestion for Improved Streaming Video Quality at ClientsManohar Kaul, R. Khosla, and Y. MitsukuraProceedings of IEEE International Symposium on Computational Intelligence in Robotics and Automation, Feb 2003
This paper proposes a traffic shaping algorithm based on neural networks, which adapts to a network over which streaming video is being transmitted. The purpose of this intelligent shaper is to eradicate traffic congestion and improve end-user video quality. It posseses the capability to predict, to a very high level of accuracy, a state of congestion based upon the training data collected about the network’s behaviour.