How You Can Use Logs To Feed Security
Participate in DZone Research Surveys: You Can Shape Trend Reports! (+ Enter the Raffles)
Modern API Management
When assessing prominent topics across DZone — and the software engineering space more broadly — it simply felt incomplete to conduct research on the larger impacts of data and the cloud without talking about such a crucial component of modern software architectures: APIs. Communication is key in an era when applications and data capabilities are growing increasingly complex. Therefore, we set our sights on investigating the emerging ways in which data that would otherwise be isolated can better integrate with and work alongside other app components and across systems.For DZone's 2024 Modern API Management Trend Report, we focused our research specifically on APIs' growing influence across domains, prevalent paradigms and implementation techniques, security strategies, AI, and automation. Alongside observations from our original research, practicing tech professionals from the DZone Community contributed articles addressing key topics in the API space, including automated API generation via no and low code; communication architecture design among systems, APIs, and microservices; GraphQL vs. REST; and the role of APIs in the modern cloud-native landscape.
Getting Started With Vector Databases
Open Source Migration Practices and Patterns
As the volume of data increases exponentially and queries become more complex, relationships become a critical component for data analysis. In turn, specialized solutions such as graph databases that explicitly optimize for relationships are needed. Other databases aren’t designed to be able to search and query data based on the intricate relationships found in complex data structures. Graph databases are optimized to handle connected data by modeling the information into a graph, which maps data through nodes and relationships. With this article, readers will traverse a beginner’s guide to graph databases, their terminologies, and comparisons with relational databases. They will also explore graph databases from cloud providers like AWS Neptune to open-source solutions. Additionally, this article can help develop a better understanding of how graph databases are useful for applications such as social network analysis, fraud detection, and many other areas. Readers will also learn how graph databases are used for applications like knowledge graph databases and social media analytics. What Is a Graph Database? A graph database is a purpose-built NoSQL database specializing in data structured in complex network relationships, where entities and their relationships have interconnections. Data is modeled using graph structures, and the essential elements of this structure are nodes, which represent entities, and edges, which represent the relationships between entities. The nodes and edges of a graph can all have attributes. Critical Components of Graph Databases Nodes These are the primary data elements representing entities such as people, businesses, accounts, or any other item you might find in a database. Each node can store a set of key-value pairs as properties. Edges Edges are the lines that connect nodes, defining their relationships. In addition to nodes, edges can also have properties – such as weight, type, or strength – that clarify their relationship. Properties Nodes and edges can each have properties that can be used to store metadata about those objects. These can include names, dates, or any other relevant descriptive attributes to a node or edge. How Graph Databases Store and Process Data In a graph database, nodes and relationships are considered first-class citizens — in contrast to relational databases, nodes are stored in tabular forms, and relationships are computed at query time. This lets graph databases treat the data relationships as having as much value as the data, which enables faster traversal of connected data. With their traversal algorithms, graph databases can explore the relationships between nodes and edges to answer complicated queries like the shortest path, fraud detection, or network analysis. Various graph-specific query languages – Neo4j’s Cypher and Tinkerpop’s Gremlin – enable these operations by focusing on pattern matching and deep-link analytics. Practical Applications and Benefits Graph databases shine in any application where the relationships between the data points are essential, such as web and social networks, recommendation engines, and a whole host of other apps where it’s necessary to know how deep and wide the relationships go. In areas such as fraud detection and network security, it’s essential to adjust and adapt dynamically; this is something graph databases do very well. In conclusion, graph databases offer a solid infrastructure for working with complex, highly connected data. They offer many advantages over relational databases regarding modeling relationships and the interactions between the data. Key Components and Terminology Nodes and Their Properties Nodes are the basic building blocks of a graph database. They typically represent some object or a specific instance, be it a person, place, or thing. For each node, we have a vertex in the graph structure. The node can also contain several properties (also called "labels" in the database context). Each of these properties is a key-value pair, where the value expands or further clarifies the object, and its content depends on the application of the graph database. Edges: Defining Relationships Edges, on the other hand, are the links that tie the nodes together. They are directional, so they can have a start node and an end node (thus defining the flow between one node and another). These edges also define the nature of the relationship—whether it is internalizational or social. Labels: Organizing Nodes The labels help group nodes that might have similarities (Person nodes, Company nodes, etc.) so that graph databases can retrieve sets of nodes more quickly. For example, in a social network analysis, Person and Company nodes might be grouped using labels. Relationships and Their Characteristics Relationships connect nodes, but they also have properties, such as strength, status, or duration, that can define how the relationship might differ between nodes. Graph Query Languages: Cypher and Gremlin Graph databases require unique particular languages to use their often complicated structure, and these languages differ from graph databases. Cypher, used with Neo4j, is a reasonably relatively pattern-based language. Gremlin, used with other graph databases, is more procedural and can traverse more complex graph structures. Both languages are expressive and powerful, capable of queries that would be veritable nightmares written in the languages used with traditional databases. Tools for Managing and Exploring Graph Data Neo4j offers a suite of tools designed to enhance the usability of graph databases: Neo4j Bloom: Explore graph data visually without using a graph query language. Neo4j Browser: A web-based application for executing Cypher queries and visualizing the results. Neo4j Data Importer and Neo4j Desktop: These tools for importing data into a Neo4j database and handling Neo4j database instances, respectively. Neo4j Ops Manager: Useful for managing multiple Neo4j instances to ensure that large-scale deployments can be managed and optimized. Neo4j Graph Data Science: This library is an extension of Neo4j that augments its capabilities, which are more commonly associated with data science. It enables sophisticated analytical tasks to be performed directly on graph data. Equipped with these fundamental components and tools, users can wield the power of graph databases to handle complex data and make knowledgeable decisions based on networked knowledge systems. Comparing Graph Databases With Other Databases While graph and relational databases are designed to store and help us make sense of data, they fundamentally differ in how they accomplish this. Graph databases are built on the foundation of nodes and edges, making them uniquely fitted for dealing with complex relationships between data points. That foundation’s core is structure, representing connected entities through nodes and their relationships through edges. Relational databases arrange data in ‘rows and columns’ – tables, whereas graph databases are ‘nodes and edges.’ This difference in structure makes such a direct comparison between the two kinds of databases compelling. Graph databases organize data in this way naturally, whereas it’s not as easy to represent relationships between certain types of data points in relational databases. After all, they were invented to deal with transactions (i.e., a series of swaps of ‘rows and columns’ between two sides, such as a payment or refund between a seller and a customer). Data Models and Scalability Graph databases store data in a graph with nodes, edges, and properties. They are instrumental in domains with complex relationships, such as social networks or recommendation engines. As an example of the opposite end of the spectrum, relational databases contain data in tables, which is well-suited for applications requiring high levels of data integrity (i.e., applications such as those involved in financial systems or managing customer relationships). Another benefit, for example, is their horizontal scalability: graph databases grow proportionally to their demands by adding more machines to a network instead of the vertical scalability (adding more oomph to an existing machine) typical for a relational database. Query Performance and Flexibility One reason is that graph databases are generally much faster at executing complex queries with deep relationships because they can traverse nodes and edges—unlike relational databases, which might have to perform lots of joins that could speed up or slow down depending on the size of the data set. In addition, graph databases excel in the ease with which the data model can be changed without severe consequences. As business requirements evolve and users learn more about how their data should interact, a graph database can be more readily adapted without costly redesigns. Though better suited for providing strong transactional guarantees or ACID compliance, relational databases are less adept at model adjustments. Use of Query Languages The different languages of query also reflect the distinct nature of these databases. Whereas graph databases tend to use a language tailored to the way a graph is traversed—such as Gremlin or Cypher—relational databases have long been managed and queried through SQL, a well-established language for structured data. Suitability for Different Data Types Relational databases are well suited for handling large datasets with a regular and relatively simple structure. In contrast, graph databases shine in environments where the structures are highly interconnected, and the relationships are as meaningful as the data. In conclusion, while graph and relational databases have pros and cons, which one to use depends on the application’s requirements. Graph databases are better for analyzing intricate and evolving relationships, which makes them ideal for modern applications that demand a detailed understanding of networked data. Advantages of Graph Databases Graph databases are renowned for their efficiency and flexibility, mainly when dealing with complex, interconnected data sets. Here are some of the key advantages they offer: High Performance and Real-Time Data Handling Performance is a huge advantage for graph databases. It comes from the ease, speed, and efficiency with which it can query linked data. Graph databases often beat relational databases at handling complex, connected data. They are well suited to continual, real-time updates and queries, unlike, e.g., Hadoop HDFS. Enhanced Data Integrity and Contextual Awareness Keeping these connections intact across channels and data formats, graph databases maintain rich data relationships and allow that data to be easily linked. This structure surfaces nuance in interactions humans could not otherwise discern, saving time and making the data more consumable. It gives users relevant insights to understand the data better and helps businesses make more informed decisions. Scalability and Flexibility Graph databases have been designed to scale well. They can accommodate the incessant expansion of the underlying data and the constant evolution of the data schema without downtime. They can also scale well in terms of the number of data sources they can link, and again, this linking can temporarily accommodate a continuous evolution of the schema without interrupting service. They are, therefore, particularly well-suited to environments in which rapid adaptation is essential. Advanced Query Capabilities These graphs-based systems can quickly run powerful recursive path queries to retrieve direct (‘one hop’) and indirect (‘two hops’ and ‘twenty hops’) connections, making running complex subgraph pattern-matching queries easy. Moreover, complex group-by-aggregate queries (such as Netflix’s tag aggregation) are also natively supported, allowing arbitrary degree flexibility in aggregating selective dimensions, such as in big-data setups with multiple dimensions, such as time series, demographics, or geographics. AI and Machine Learning Readiness The fact that graph databases naturally represent entities and inter-relations as a structured set of connections makes them especially well-suited for AI and machine-learning foundational infrastructures since they support fast real-time changes and rely on expressive, ergonomic declarative query languages that make deep-link traversal and scalability a simple matter – two features that are critical in the case of next-generation data analytics and inference. These advantages make graph databases a good fit for an organization that needs to manage and efficiently draw meaningful insights from dataset relationships. Everyday Use Cases for Graph Databases Graph databases are being used by more industries because they are particularly well-suited for handling complex connections between data and keeping the whole system fast. Let’s look at some of the most common uses for graph databases. Financial and Insurance Services The financial and insurance services sector increasingly uses graph databases to detect fraud and other risks; how these systems model business events and customer data as a graph allows them to detect fraud and suspicious links between various entities, and the technique of Entity Link Analysis takes this a step further, allowing the detection of potential fraud in the interactions between different kinds of entities. Infrastructure and Network Management Graph databases are well-suited for infrastructure mapping and keeping network inventories up to date. Serving up an interactive map of the network estate and performing network tracing algorithms to walk across the graph is straightforward. Likewise, it makes writing new algorithms to identify problematic dependencies, vulnerable bottlenecks, or higher-order latency issues much easier. Recommendation Systems Many companies – including major e-commerce giants like Amazon – use graph databases to power recommendation engines. These keep track of which products and services you’ve purchased and browsed in the past to suggest things you might like, improving the customer experience and engagement. Social Networking Platforms Social networks such as Facebook, Twitter, and LinkedIn all use graph databases to manage and query huge amounts of relational data concerning people, their relationships, and interactions. This makes them very good at quickly navigating across vast social networks, finding influential users, detecting communities, and identifying key players. Knowledge Graphs in Healthcare Healthcare organizations assemble critical knowledge about patient profiles, past ailments, and treatments in knowledge graphs, while graph queries implemented on graph databases identify patient patterns and trends. These can influence how treatments proceed positively and how patients fare. Complex Network Monitoring Graph databases are used to model and monitor complex network infrastructures, including telecommunications networks or end-to-end environments of clouds (data-center infrastructure including physical networking, storage, and virtualization). This application is undoubtedly crucial for the robustness and scalability of those systems and environments that form the essential backbone of the modern information infrastructure. Compliance and Governance Organizations also use graph databases to manage data related to compliance and governance, such as access controls, data retention policies, and audit trails, to ensure they can continue to meet high standards of data security and regulatory compliance. AI and Machine Learning Graph databases are also essential for developing artificial intelligence and machine learning applications. They allow developers to create standardized means of storing and querying data for applications such as natural language processing, computer vision, and advanced recommendation systems, which is essential for making AI applications more intelligent and responsive. Unraveling Financial Crimes Graphs provide a way to trace the structure of shell corporate entities that criminals use to launder money, studying whether the patterns of supplies to shell companies and cash flows from shell companies to other entities are suspicious. Such applications are helpful for law enforcement and regulatory agencies to unravel complex money laundering networks and fight against financial crime. Automotive Industry In the automotive industry, graph queries help analyze the relationships between tens of thousands of car parts, enabling real-time interactive analysis that has the potential to improve manufacturing and maintenance processes. Criminal Network Analysis In law enforcement, graph databases are used to identify criminal networks, address patterns, and identify critical links in criminal organizations to bring operations down efficiently from all sides. Data Lineage Tracking Graph technology can also track data lineage (the details of where an item of data, such as a fact or number, was created, how it was copied, and where it was used). This is important for auditing and verifying that data assets are not corrupted. This diverse array of applications underscores the versatility of graph databases and their utility in representing and managing complex, interconnected data across multiple diverse fields. Challenges and Considerations Graph databases are built around modeling structures in a specific domain, in a process resembling both knowledge or ontology engineering, and a practical challenge that can require specialized "graph data engineers." All these requirements point to important scalability issues and potentially limit the appeal of this technology to many beyond the opponents of a data web. Inconsistency of data across the system remains a critical issue since developing homogeneous systems that can maintain data consistency while maintaining flexibility and expressivity is challenging. While graph queries don’t require as much coding as SQL, paths for traversal across the data still have to be spelled out explicitly. This increases the effort needed to write queries and prevents graph queries from being as easily abstracted and reused as SQL code, impairing their generalization. Furthermore, because there isn’t a unified standard for capabilities or query languages, developers invent their own – a further step in API fragmentation. Another significant issue is knowing which machine is the best place to put that data, given all the subtle relationships between nodes, deciding that is crucial to performance but hard to do on the fly. As necessary, many existing graph database systems weren’t architected for today’s high volumes of data, so they can end up being performance bottlenecks. From a project management standpoint, failure to accurately capture and map business requirements to technical requirements often results in confusion and delay. Poor data quality, inadequate access to data sources, verbose data modeling, or time-consuming data modeling will magnify the pain of a graph data project. On the end-user side, asking people to learn new languages or skills in order to read some graphs could deter adoption, while the difficulty of sharing those graphs or collaborating on the analysis will eventually lower the range and impact of the insights. The Windows 95 interface had an excellent early advantage in the virtues of simplicity: we can tell the same story about graph technologies nowadays. Adopting this technology is also hindered when the analysis process is criticized as too time-consuming. From a technical perspective, managing large graphs by storing and querying complex structures presents more significant challenges. For example, the data must be distributed on a cluster of multiple machines, adding another level of complexity for developers. Data is typically sharded (split) into smaller parts and stored on various machines, coordinated by an "intelligent" virtual server managing access control and query across multiple shards. Choosing the Right Graph Database When selecting a graph database, it’s crucial to consider the queries’ complexity and the data’s interconnectedness. A well-chosen graph database can significantly enhance the performance and scalability of data-driven applications. Key Factors to Consider Native graph storage and processing: Opt for databases designed from the ground up to handle graph data structures. Property graphs and Graph Query Languages: Ensure the database supports robust graph query languages and can handle property graphs efficiently. Data ingestion and integration capabilities: The ability to seamlessly integrate and ingest data from various sources is vital for dynamic data environments. Development tools and graph visualization: Tools that facilitate development and allow intuitive graph visualizations to improve usability and insights. Graph data science and analytics: Databases with advanced analytics and data science capabilities can provide deeper insights. Support for OLTP, OLAP, and HTAP: Depending on the application, support for transactional (OLTP), analytical (OLAP), and hybrid (HTAP) processing may be necessary. ACID compliance and system durability: Essential for ensuring data integrity and reliability in transaction-heavy environments Scalability and performance: The database should scale vertically and horizontally to handle growing data loads. Enterprise security and privacy features: Robust security features are crucial to protect sensitive data and ensure privacy. Deployment flexibility: The database should match the organization’s deployment strategy, whether on-premises or cloud. Open-source foundation and community support: A strong community and open-source foundation can provide extensive support and flexibility. Business and technology partnerships: Partnerships can offer additional support and integration options, enhancing the database’s capabilities. Comparing Popular Graph Databases Dgraph: This is the most performant and scalable option for enterprise systems that need to handle massive amounts of fast-flowing data. Memgraph: An open-source, in-memory storage database with a query language specially designed for real-time data and analytics Neo4j: Offers a comprehensive graph data science library and is well-suited for static data storage and Java-oriented developers Each of these databases has its advantages: Memgraph is the strongest contender in the Python ecosystem (you can choose Python, C++, or Rust for your custom stored procedures), and Neo4j’s managed solution offers the most control over your deployment into the cloud (its AuraDB service provides a lot of power and flexibility). Community and Free Resources Memgraph has a free community edition and a paid enterprise edition, and Neo4j has a community "Labs" edition, a free enterprise trial, and hosting services. These are all great ways for developers to get their feet wet without investing upfront. In conclusion, choosing the proper graph database to use is contingent upon understanding the realities of your project well enough and the potential of the database to which you are selecting. If you bear this notion in mind, your organization will be using graph databases to their full potential to enhance its data infrastructure and insights. Conclusion Having navigated through the expansive realm of graph databases, the hope is that you now know not only the basics of these beautiful databases, from nodes to edges, from vertex storage to indexing, but also those of their applications across industries, including finance, government, and healthcare. This master guide comprehensively introduces graph databases, catering to sophomores and seniors in the database field. Now, every reader of this broad stratum is fully prepared to take the following steps in understanding how graph databases work, how they compare against traditional and non-relational databases, and where they are utilized in the real world. We have seen that choosing a graph database requires careful consideration of the project’s requirements and features. The reflections and difficulties highlighted the importance of correct implementation and the advantage of the graph database in changing our way of processing and looking at data. The graph databases’ complexity and power allow us to provide new insights and be more efficient in computation. In this way, new data management and analysis methods may be developed. References Graph Databases for Beginners How to choose a graph database: we compare 6 favorites What is A Graph Database? A Beginner's Guide Video: What is a graph database? (in 10 minutes) AWS: What Is a Graph Database? Neo4j: What is a graph database? Wikipedia: Graph database Geeks for Geeks: What is Graph Database – Introduction Memgraph: What is a Graph Database? Geeks for Geeks: Introduction to Graph Database on NoSQL Graph Databases: A Comprehensive Overview and Guide. Part1 Graph database concepts AWS: What’s the Difference Between a Graph Database and a Relational Database? Comparison of Relational Databases and Graph Databases Nebula Graph: Graph Database vs Relational Database: What to Choose? Graph database vs. relational database: Key differences The ultimate guide to graph databases Neo4j: Why Graph Databases? What is a Graph Database and What are the Benefits of Graph Databases What Are the Major Advantages of Using a Graph Database? Graph Databases for Beginners: Why Graph Technology Is the Future Understanding Graph Databases: Unleashing the Power of Connected Data in Data Science Use cases for graph databases 7 Graph Database Use Cases That Will Change Your Mind When Connected Data Matters Most 17 Use Cases for Graph Databases and Graph Analytics The Challenges of Working with a Graph Database Where the Path Leads: State of the Art and Challenges of Graph Database Systems 5 Reasons Graph Data Projects Fail 16 Things to Consider When Selecting the Right Graph Database How to Select a Graph Database: Best Practices at RoyalFlush Neo4j vs Memgraph - How to Choose a Graph Database?
With recent achievements and attention to LLMs and the resultant Artificial Intelligence “Summer,” there has been a renaissance in model training methods aimed at getting to the most optimal, performant model as quickly as possible. Much of this has been achieved through brute scale — more chips, more data, more training steps. However, many teams have been focused on how we can train these models more efficiently and intelligently to achieve the desired results. Training LLMs typically include the following phases: Pretraining: This initial phase lays the foundation, taking the model from a set of inert neurons to a basic language generator. While the model ingests vast amounts of data (e.g., the entire internet), the outputs at this stage are often nonsensical, though not entirely gibberish. Supervised Fine-Tuning (SFT): This phase elevates the model from its unintelligible state, enabling it to generate more coherent and useful outputs. SFT involves providing the model with specific examples of desired behavior, and teaching it what is considered "helpful, useful, and sensible." Models can be deployed and used in production after this stage. Reinforcement Learning (RL): Taking the model from "working" to "good," RL goes beyond explicit instruction and allows the model to learn implicit preferences and desires of users through labeled preference data. This enables developers to encourage desired behaviors without needing to explicitly define why those behaviors are preferred. In-context learning: Also known as prompt engineering, this technique allows users to directly influence model behavior at inference time. By employing methods like constraints and N-shot learning, users can fine-tune the model's output to suit specific needs and contexts. Note this is not an exhaustive list, there are many other methods and phases that may be incorporated into idiosyncratic training pipelines Introducing Reward and Reinforcement Learning Humans excel at pattern recognition, often learning and adapting without conscious effort. Our intellectual development can be seen as a continuous process of increasingly complex pattern recognition. A child learns not to jump in puddles after experiencing negative consequences, much like an LLM undergoing SFT. Similarly, a teenager observing social interactions learns to adapt their behavior based on positive and negative feedback – the essence of Reinforcement Learning. Reinforcement Learning in Practice: The Key Components Preference data: Reinforcement Learning in LLMs typically require multiple (often 2) example outputs and a prompt/input in order to demonstrate a ‘gradient’. This is intended to show that certain behaviors are preferred relative to others. As an example, in RLHF, human users may be presented with a prompt and two examples and asked to choose which they prefer, or in other methods, they may be presented with an output and asked to improve on it in some way (where the improved version will be captured as the ‘preferred’ option). Reward model: A reward model is trained directly on the preference data. For a set of responses to a given input, each response can be assigned a scalar value representing its ‘rank’ within the set (for binary examples, this can be 0 and 1). The reward model is then trained to predict these scalar values given a novel input and output pair. That is, the RM is able to reproduce or predict a user’s preference Generator model: This is the final intended artifact. In simplified terms, during the Reinforcement Training Process, the Generator model generates an output, which is then scored by the Reward Model, and the resultant reward is fed back to the algorithm which decides how to mutate the Generator Model. For example, the algorithm will update the model to increase the odds of generating a given output when provided a positive reward and do the opposite in a negative reward scenario. In the LLM landscape, RLHF has been a dominant force. By gathering large volumes of human preference data, RLHF has enabled significant advancements in LLM performance. However, this approach is expensive, time-consuming, and susceptible to biases and vulnerabilities. This limitation has spurred the exploration of alternative methods for obtaining reward information at scale, paving the way for the emergence of RLAIF – a revolutionary approach poised to redefine the future of AI development. Understanding RLAIF: A Technical Overview of Scaling LLM Alignment With AI Feedback The core idea behind RLAIF is both simple and profound: if LLMs can generate creative text formats like poems, scripts, and even code, why can't they teach themselves? This concept of self-improvement promises to unlock unprecedented levels of quality and efficiency, surpassing the limitations of RLHF. And this is precisely what researchers have achieved with RLAIF. As with any form of Reinforcement Learning, the key lies in assigning value to outputs and training a Reward Model to predict those values. RLAIF's innovation is the ability to generate these preference labels automatically, at scale, without relying on human input. While all LLMs ultimately stem from human-generated data in some form, RLAIF leverages existing LLMs as "teachers" to guide the training process, eliminating the need for continuous human labeling. Using this method, the authors have been able to achieve comparable or even better results from RLAIF as opposed to RLHF. See below the graph of ‘Harmless Response Rate’ comparing the various approaches: To achieve this, the authors developed a number of methodological innovations. In-context learning and prompt engineering: RLAIF leverages in-context learning and carefully designed prompts to elicit preference information from the teacher LLM. These prompts provide context, examples (for few-shot learning), and the samples to be evaluated. The teacher LLMs output then serves as the reward signal. Chain-of-thought reasoning: To enhance the teacher LLM's reasoning capabilities, RLAIF employs Chain-of-Thought (CoT) prompting. While the reasoning process itself isn't directly used, it leads to more informed and nuanced preference judgments from the teacher LLM. Addressing position bias: To mitigate the influence of response order on the teacher's preference, RLAIF averages preferences obtained from multiple prompts with varying response orders. To understand this a little more directly, imagine the AI you are trying to train as a student, learning and improving through a continuous feedback loop. And then imagine an off-the-shelf AI, that has been through extensive training already, as the teacher. The teacher rewards the student for taking certain actions, coming up with certain responses, and so on, and punishes it otherwise. The way it does this is by ‘testing’ the student, by giving it quizzes where the student must select the optimal response. These tests are generated via ‘contrastive’ prompts, where the teacher generates slightly variable responses by slightly varying prompts in order to generate the responses. For example, in the context of code generation, one prompt might encourage the LLM to generate efficient code, potentially at the expense of readability, while the other emphasizes code clarity and documentation. The teacher then assigns its own preference as the ‘ground truth’ and asks the Student to indicate what it thinks is the preferred output. By comparing the students’ responses under these contrasting prompts, RLAIF assesses which response better aligns with the desired attribute. The student, meanwhile, aims to maximize the accumulated reward. So every time it gets punished, it decides to change something about itself so it doesn’t make a mistake again, and get punished again. When it is rewarded, it aims to reinforce that behavior so it is more likely to reproduce the same response in the future. In this way, over successive quizzes, the student gets better and better and punished less and less. While punishments never go to zero, the Student does converge to some minimum which represents the optimal performance it is able to achieve. From there, future inferences made by the student are likely to be of much higher quality than if RLAIF was not employed. The evaluation of synthetic (LLM-generated) preference data is crucial for effective alignment. RLAIF utilizes a "self-rewarding" score, which compares the generation probabilities of two responses under contrastive prompts. This score reflects the relative alignment of each response with the desired attribute. Finally, Direct Preference Optimization (DPO), an efficient RL algorithm, leverages these self-rewarding scores to optimize the student model, encouraging it to generate responses that align with human values. DPO directly optimizes an LLM towards preferred responses without needing to explicitly train a separate reward model. RLAIF in Action: Applications and Benefits RLAIF's versatility extends to various tasks, including summarization, dialogue generation, and code generation. Research has shown that RLAIF can achieve comparable or even superior performance to RLHF, while significantly reducing the reliance on human annotations. This translates to substantial cost savings and faster iteration cycles, making RLAIF particularly attractive for rapidly evolving LLM development. Moreover, RLAIF opens doors to a future of "closed-loop" LLM improvement. As the student model becomes better aligned through RLAIF, it can, in turn, be used as a more reliable teacher model for subsequent RLAIF iterations. This creates a positive feedback loop, potentially leading to continual improvement in LLM alignment without additional human intervention. So how can you leverage RLAIF? It’s actually quite simple if you already have an RL pipeline: Prompt set: Start with a set of prompts designed to elicit the desired behaviors. Alternatively, you can utilize an off-the-shelf LLM to generate these prompts. Contrastive prompts: For each prompt, create two slightly varied versions that emphasize different aspects of the target behavior (e.g., helpfulness vs. safety). LLMs can also automate this process. Response generation: Capture the responses from the student LLM for each prompt variation. Preference elicitation: Create meta-prompts to obtain preference information from the teacher LLM for each prompt-response pair. RL pipeline integration: Utilize the resulting preference data within your existing RL pipeline to guide the student model's learning and optimization. Challenges and Limitations Despite its potential, RLAIF faces challenges that require further research. The accuracy of AI annotations remains a concern, as biases from the teacher LLM can propagate to the student model. Furthermore, biases incorporated into this preference data can eventually become ‘crystallized’ in the teacher LLM which makes it difficult to remove afterward. Additionally, studies have shown that RLAIF-aligned models can sometimes generate responses with factual inconsistencies or decreased coherence. This necessitates exploring techniques to improve the factual grounding and overall quality of the generated text. Addressing these issues necessitates exploring techniques to enhance the reliability, quality, and objectivity of AI feedback. Furthermore, the theoretical underpinnings of RLAIF require careful examination. While the effectiveness of self-rewarding scores has been demonstrated, further analysis is needed to understand its limitations and refine the underlying assumptions. Emerging Trends and Future Research RLAIF's emergence has sparked exciting research directions. Comparing it with other RL methods like Reinforcement Learning from Execution Feedback (RLEF) can provide valuable insights into their respective strengths and weaknesses. One direction involves investigating fine-grained feedback mechanisms that provide more granular rewards at the individual token level, potentially leading to more precise and nuanced alignment outcomes. Another promising avenue explores the integration of multimodal information, incorporating data from images and videos to enrich the alignment process and foster a more comprehensive understanding within LLMs. Drawing inspiration from human learning, researchers are also exploring the application of curriculum learning principles in RLAIF, gradually increasing the complexity of tasks to enhance the efficiency and effectiveness of the alignment process. Additionally, investigating the potential for a positive feedback loop in RLAIF, leading to continual LLM improvement without human intervention, represents a significant step towards a more autonomous and self-improving AI ecosystem. Furthermore, there may be an opportunity to improve the quality of this approach by grounding feedback in the real world. As an example, if the agent were able to execute code, perform real-world experiments, or integrate with a robotic system to ‘instantiate’ feedback in the real world to capture more objective feedback, it would be able to capture more accurate, and reliable preference information without losing scaling advantages. However, ethical considerations remain paramount. As RLAIF empowers LLMs to shape their own alignment, it's crucial to ensure responsible development and deployment. Establishing robust safeguards against potential misuse and mitigating biases inherited from teacher models are essential for building trust and ensuring the ethical advancement of this technology. As mentioned previously, RLAIF has the potential to propagate and amplify biases present in the source data, which must be carefully examined before scaling this approach. Conclusion: RLAIF as a Stepping Stone To Aligned AI RLAIF presents a powerful and efficient approach to LLM alignment, offering significant advantages over traditional RLHF methods. Its scalability, cost-effectiveness, and potential for self-improvement hold immense promise for the future of AI development. While acknowledging the current challenges and limitations, ongoing research efforts are actively paving the way for a more reliable, objective, and ethically sound RLAIF framework. As we continue to explore this exciting frontier, RLAIF stands as a stepping stone towards a future where LLMs seamlessly integrate with human values and expectations, unlocking the full potential of AI for the benefit of society.
A retry mechanism is a critical component of many modern software systems. It allows our system to automatically retry failed operations to recover from transient errors or network outages. By automatically retrying failed operations, retry mechanisms can help software systems recover from unexpected failures and continue functioning correctly. Today, we'll take a look at these topics: What Is a Retry Pattern? What is it for, and why do we need to implement it in our system? When to Retry Your Request Only some requests should be retried. It's important to understand what kind of errors from the downstream service can be retried to avoid problems with business logic. Retry Backoff Period When we retry the request to the downstream service, how long should we wait to send the request again after it fails? How to Retry We'll look at ways to retry from the basic to more complex. What Is a Retry Pattern? Retrying is an act of sending the same request if the request to the downstream service fails. By using a retry pattern, you'll be improving the downstream resiliency aspect of your system. When an error happens when calling a downstream service, our system will try to call it again instead of returning an error to the upstream service. So, why do we need to do it, exactly? Microservices architecture has been gaining popularity in recent decades. While this approach has many benefits, one of the downsides of microservices architecture is introducing network communication between services. Additional network communication leads to the possibility of errors in the network while services are communicating with each other (read Fallacies of distributed computing). Every call to other services has a chance of getting those errors. In addition, whether you're using monolith or microservices architecture, there is a big chance that you still need to call other services that are not within your company's internal network. Calling service within a different network means your request will go through more network layers and have more chance of failure. Other than network errors, you can also get system errors like rate-limit errors, service downtime, and processing timeout. The errors you get may or may not be suitable to be retried. Let's head to the next section to explore it in more detail. When To Retry Your Request Although adding a retry mechanism in your system is generally a good idea, not every request to the downstream service should be retried. As a simple baseline, things you should consider when you want to retry are as follows: Is It a Transient Error? You'll need to consider whether the type of errors you're getting is transient (temporary). For example, you can retry a connection timeout error because it's usually only temporary, but not a bad request error because you need to change the request. Is It a System Error? When you're getting an error message from the downstream service, it can be categorized as either a system error or an application error. System error is generally okay to be retried because your request hasn't been processed by the downstream service yet. On the other hand, an application error usually means that something is wrong with your request, and you should not retry it. For example, if you're getting a bad request error from the downstream service, you'll always get the same error no matter how many times you've retried. Idempotency Even when you're getting an error from the downstream service, there is still a chance it has processed your request. The downstream service could send the error after it has processed the main process, but another sub-process causes errors. Idempotent API means that even if the API gets the same request twice, it will only process the first request. We can achieve it by adding some ID in the request that's unique to the request so the downstream service can determine whether it should process the request. Usually, you can differentiate this with the Request method. GET, DELETE, and PUT are usually idempotent, and POST is not. However, you need to confirm the API's idempotency to the service owner. The Cost of Retrying When you retry your request to the downstream service, there will be additional resource usage. The additional resource usage can be in the form of additional CPU usage, blocked thread, additional memory usage, additional bandwidth usage, etc. You need to consider this, especially if your service expects large traffic. The Implementation Cost of the Retry Mechanism Many programming languages already have a library that implements a retry mechanism, but you still need to determine which request to retry. You can also create your retry mechanism or every system if you want to, but of course, this means that there will be a high implementation cost for the retry mechanism. Note 1: Many libraries have already implemented the retry mechanism gracefully. For example, if you're using the Spring Mongo library in Java Spring Boot and the connection between your apps and MongoDB is severed, it will try to reconnect. Note 2: Some libraries also implement a retry mechanism by default. It's sometimes dangerous because you can be unaware that the library will retry your request. I've also compiled some common errors and whether or not they're suitable for retrying: Let's describe the errors shortly one by one. Connection timeout: Your app failed to connect to the downstream service; hence, the downstream service isn't aware of your request, and you can retry it. Read timeout: The downstream app has processed your request but has not returned any response for a long time. Circuit breaker tripped: This is an error if you use a circuit breaker in your service. You can retry this kind of error because your service hasn't sent its request to the downstream service. 400 - Bad Request: This error means your request to the downstream service was flagged your request as a wrong request after validating it. You shouldn't retry this error because it will always return the same error if the request is the same. 401 - Unauthorized: You need to authorize before sending the request. Whether you can retry this error will depend on the authentication method and the error. But generally, you will always get the same error if your request is the same. 429 - Too many requests: Your request is rate-limited by the downstream service. You can retry this error, although you should confirm with the downstream service's owner how long your request will be rate-limited. 500 - Internal Server Error: This means the downstream service had started processing your request but failed in the middle of it. Usually, it's okay to retry this error. 503 - Service Unavailable: The downstream service is unavailable due to downtime. It is okay to retry this kind of error. Retry Backoff Period When your request fails to the downstream service, your system will need to wait for some time before trying again. This period is called the retry backoff period. Generally, there are three strategies for wait time between calls: Fixed Backoff, Exponential Backoff, and Random Backoff. All three of them have their advantages and disadvantages. Which one you use should depend on your API and service use case. Fixed Backoff Fixed backoff means that every time you retry your request, the delay between requests is always the same. For example, if you do a retry twice with a backoff of 5 seconds, then if the first call fails, the second request will be sent 5 seconds later. If it fails again, the third call will be sent 5 seconds after the failure. A fixed backoff period is suitable for a request coming directly from the user and needs a quick response. If the request is important and you need it to come back ASAP, then you can set the backoff period to none or close to 0. Exponential Backoff When downstream service is having a problem, it doesn't always recover quickly. What you don't want to do when the downstream service is trying to recover is to hit it multiple times in a short interval. Exponential backoff works by adding some additional backoff time every time our service attempts to call the downstream service. For example, we can configure our retry mechanism with a 5-second initial backoff and add two as the multiplier every attempt. This means when our first call to the downstream service fails, our service will wait 5 seconds before the next call. If the second call fails again, the service will wait 10 seconds instead of 5 seconds before the next call. Due to its longer interval nature, exponential backoff is unsuitable for retrying a user request. But it will be perfect for a background process like notification, sending email, or webhook system. Random Backoff Random backoff is a backoff strategy introducing randomness in its backoff interval calculation. Suppose that your service is getting a burst of traffic. Your service then calls a downstream service for every request, and then you get errors from it because the downstream service gets overwhelmed by your request. Your service implements a retry mechanism and will retry the requests in 5 seconds. But there is a problem: when it's time to retry the requests, all of them will be retried at once, and you might get an error from the downstream service again. With the randomness introduced by the random backoff mechanism, you can avoid this. A random backoff strategy will help your service to level the request to the downstream service by introducing a random value for retry. Let's say you configure the retry mechanism with a 5-second interval and two retries. If the first call fails, the second one could be attempted after 500ms; if it fails again, the third one could be attempted after 3.8 seconds. If many requests fail the downstream service, they won't be retried simultaneously. Where To Store the Retry State When doing a retry, you'll need to store the state of the retry somewhere. The state includes how many retries have been made, the request to be retried, and the additional metadata you want to save. Generally, there are three places you can use to store the retry state, which are as follows: Thread Thread is the most common place to store the retry state. If you're using a library with a built-in retry mechanism, it will most likely use the Thread to store the state. The simplest way to do this is to sleep the Thread. Let's see an example in Java: int retryCount = 0; while (retryCount < 3) { try { thirdPartyOutboundService.getData(); } catch (Exception e) { retryCount += 1; Thread.sleep(3000); } } The code above basically sleeps the Thread when getting an exception and calling the process again. While this is simple, it has the disadvantage of blocking the Thread and making other processes unable to use the Thread. This method is suitable for a fixed backoff strategy with a low interval like processes that direct response to the user and need a response as soon as possible. Messaging We could use a popular messaging broker like RabbitMQ (delayed queue) to save a retry state. When you're getting a request from the upstream, and you fail to process it (it can be because of downstream service or not), you can publish the message to the delayed queue to consume it later (depending on your backoff). Using messaging to save the retry state is suitable for a background process request because the upstream service can't directly get the response of the retry process. The advantage of using this approach is that it's usually easy to implement because the broker/library already supports the retry function. Messaging as a storage system of retry state also works well with distributed systems. One problem that can happen is your service suddenly has a problem like downtime when waiting for the next retry. By saving the retry state in the messaging broker, your service can continue the retry after the issue has been resolved. Database The database is the most customizable solution to store the retry state, either by using a persistent storage or an in-memory KV store like Redis. When the request to the downstream service fails, you can save the data in the database and use a cron job to check the database every second or minute to retry failed messages. While this is the most customizable solution, the implementation cost will be very high because you'll need to implement your retry mechanism. You can either create the mechanism in your service with the downside of sacrificing a bit of performance when a retry is happening or make an entirely new service for retry purposes. Takeaways This article has explored what is and what aspects to consider when implementing a retry pattern. You need to know what request and how to retry it. If you do the retry mechanism correctly, you'll help with the user experience and reduced operation of the service you're building. But, if you do it incorrectly, you risk worsening the user experience and business error. You need to understand when the request can be retried and how to retry it so you can implement the mechanism correctly. There is much more. In this article, we've covered the retry pattern. This pattern increases the downstream resiliency aspect of a system, but there is more to the downstream resiliency. We can combine the retry pattern with a timeout (which we explored in this article) and circuit breaker to make our system more resilient to downstream failure.
Traditional machine learning (ML) models and AI techniques often suffer from a critical flaw: they lack uncertainty quantification. These models typically provide point estimates without accounting for the uncertainty surrounding their predictions. This limitation undermines the ability to assess the reliability of the model's output. Moreover, traditional ML models are data-hungry and often require correctly labeled data, and as a result, tend to struggle with problems where data is limited. Furthermore, these models lack a systematic framework for incorporating expert domain knowledge or prior beliefs into the model. Without the ability to leverage domain-specific insights, the model might overlook crucial nuances in data and tend not to perform up to its potential. ML models are becoming more complex and opaque, while there is a growing demand for more transparency and accountability in decisions derived from data and AI. Probabilistic Programming: A Solution To Addressing These Challenges Probabilistic programming provides a modeling framework that addresses these challenges. At its core lies Bayesian statistics, a departure from the frequentist interpretation of statistics. Bayesian Statistics In frequentist statistics, probability is interpreted as the long-run relative frequency of an event. Data is considered random and a result of sampling from a fixed-defined distribution. Hence, noise in measurement is associated with the sampling variations. Frequentists believe that probability exists and is fixed, and infinite experiments converge to that fixed value. Frequentist methods do not assign probability distributions to parameters, and their interpretation of uncertainty is rooted in the long-run frequency properties of estimators rather than explicit probabilistic statements about parameter values. In Bayesian statistics, probability is interpreted as a measure of uncertainty in a particular belief. Data is considered fixed, while the unknown parameters of the system are regarded as random variables and are modeled using probability distributions. Bayesian methods capture uncertainty within the parameters themselves and hence offer a more intuitive and flexible approach to uncertainty quantification. Frequentist vs. Bayesian Statistics [1] Probabilistic Machine Learning In frequentist ML, model parameters are treated as fixed and estimated through Maximum Likelihood Estimation (MLE), where the likelihood function quantifies the probability of observing the data given the statistical model. MLE seeks point estimates of parameters maximizing this probability. To implement MLE: Assume a model and the underlying model parameters. Derive the likelihood function based on the assumed model. Optimize the likelihood function to obtain point estimates of parameters. Hence, frequentist models which include Deep Learning rely on optimization, usually gradient-based, as its fundamental tool. To the contrary, Bayesian methods model the unknown parameters and their relationships with probability distributions and use Bayes' theorem to compute and update these probabilities as we obtain new data. Bayes Theorem: "Bayes’ rule tells us how to derive a conditional probability from a joint, conditioning tells us how to rationally update our beliefs, and updating beliefs is what learning and inference are all about" [2]. This is a simple but powerful equation. Prior represents the initial belief about the unknown parameters Likelihood represents the probability of the data based on the assumed model Marginal Likelihood is the model evidence, which is a normalizing coefficient. The Posterior distribution represents our updated beliefs about the parameters, incorporating both prior knowledge and observed evidence. In Bayesian machine learning inference is the fundamental tool. The distribution of parameters represented by the posterior distribution is utilized for inference, offering a more comprehensive understanding of uncertainty. Bayesian update in action: The plot below illustrates the posterior distribution for a simple coin toss experiment across various sample sizes and with two distinct prior distributions. This visualization provides insights into how the combination of different sample sizes and prior beliefs influences the resulting posterior distributions. Impact of Sample Size and Prior on Posterior Distribution How to Model the Posterior Distribution The seemingly simple posterior distribution in most cases is hard to compute. In particular, the denominator i.e. the marginal likelihood integral tends to be interactable, especially when working with a higher dimension parameter space. And in most cases there's no closed-form solution and numerical integration methods are also computationally intensive. To address this challenge we rely on a special class of algorithms called Markov Chain Monte Carlo simulations to model the posterior distribution. The idea here is to sample from the posterior distribution rather than explicitly modeling it and using those samples to represent the distribution of the model parameters Markov Chain Monte Carlo (MCMC) "MCMC methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a sample of the desired distribution by recording states from the chain" [3]. A few of the commonly used MCMC samplers are: Metropolis-Hastings Gibbs Sampler Hamiltonian Monte Carlo (HMC) No-U-Turn Sampler (NUTS) Sequential Monte Carlo (SMC) Probabilistic Programming Probabilistic Programming is a programming framework for Bayesian statistics i.e. it concerns the development of syntax and semantics for languages that denote conditional inference problems and develop "solvers” for those inference problems. In essence, Probabilistic Programming is to Bayesian Modeling what automated differentiation tools are to classical Machine Learning and Deep Learning models [2]. There exists a diverse ecosystem of Probabilistic Programming languages, each with its own syntax, semantics, and capabilities. Some of the most popular languages include: BUGS (Bayesian inference Using Gibbs Sampling) [4]: BUGS is one of the earliest probabilistic programming languages, known for its user-friendly interface and support for a wide range of probabilistic models. It implements Gibbs sampling and other Markov Chain Monte Carlo (MCMC) methods for inference. JAGS (Just Another Gibbs Sampler) [5]: JAGS is a specialized language for Bayesian hierarchical modeling, particularly suited for complex models with nested structures. It utilizes the Gibbs sampling algorithm for posterior inference. STAN: A probabilistic programming language renowned for its expressive modeling syntax and efficient sampling algorithms. STAN is widely used in academia and industry for a variety of Bayesian modeling tasks. "Stan differs from BUGS and JAGS in two primary ways. First, Stan is based on a new imperative probabilistic programming language that is more flexible and expressive than the declarative graphical modeling languages underlying BUGS or JAGS, in ways such as declaring variables with types and supporting local variables and conditional statements. Second, Stan’s Markov chain Monte Carlo (MCMC) techniques are based on Hamiltonian Monte Carlo (HMC), a more efficient and robust sampler than Gibbs sampling or Metropolis-Hastings for models with complex posteriors" [6]. BayesDB: BayesDB is a probabilistic programming platform designed for large-scale data analysis and probabilistic database querying. It enables users to perform probabilistic inference on relational databases using SQL-like queries [7] PyMC3: PyMC3 is a Python library for Probabilistic Programming that offers an intuitive and flexible interface for building and analyzing probabilistic models. It leverages advanced sampling algorithms such as Hamiltonian Monte Carlo (HMC) and Automatic Differentiation Variational Inference (ADVI) for inference [8]. TensorFlow Probability: "TensorFlow Probability (TFP) is a Python library built on TensorFlow that makes it easy to combine probabilistic models and deep learning on modern hardware (TPU, GPU)" [9]. Pyro: "Pyro is a universal probabilistic programming language (PPL) written in Python and supported by PyTorch on the backend. Pyro enables flexible and expressive deep probabilistic modeling, unifying the best of modern deep learning and Bayesian modeling" [10]. These languages share a common workflow, outlined below: Model definition: The model defines the processes governing data generation, latent parameters, and their interrelationships. This step requires careful consideration of the underlying system and the assumptions made about its behavior. Prior distribution specification: Define the prior distributions for the unknown parameters within the model. These priors encode the practitioner's beliefs, domain, or prior knowledge about the parameters before observing any data. Likelihood specification: Describe the likelihood function, representing the probability distribution of observed data conditioned on the unknown parameters. The likelihood function quantifies the agreement between the model predictions and the observed data. Posterior distribution inference: Use a sampling algorithm to approximate the posterior distribution of the model parameters given the observed data. This typically involves running Markov Chain Monte Carlo (MCMC) or Variational Inference (VI) algorithms to generate samples from the posterior distribution. Case Study: Forecasting Stock Index Volatility In this case study, we will employ Bayesian modeling techniques to forecast the volatility of a stock index. Volatility here measures the degree of variation in a stock's price over time and is a crucial metric for assessing the risk associated with a particular stock. Data: For this analysis, we will utilize historical data from the S&P 500 stock index. The S&P 500 is a widely used benchmark index that tracks the performance of 500 large-cap stocks in the United States. By examining the percentage change in the index's price over time, we can gain insights into its volatility. S&P 500 — Share Price and Percentage Change From the plot above, we can see that the time series — price change between consecutive days has: Constant Mean Changing variance over time, i.e., the time series exhibits heteroscedasticity Modeling Heteroscedasticity: "In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance" [11]. Auto-regressive Conditional Heteroskedasticity (ARCH) models are specifically designed to address heteroscedasticity in time series data. Bayesian vs. Frequentist Implementation of ARCH Model The key benefits of Bayesian modeling include the ability to incorporate prior information and quantify uncertainty in model parameters and predictions. These are particularly useful in settings with limited data and when prior knowledge is available. In conclusion, Bayesian modeling and probabilistic programming offer powerful tools for addressing the limitations of traditional machine-learning approaches. By embracing uncertainty quantification, incorporating prior knowledge, and providing transparent inference mechanisms, these techniques empower data scientists to make more informed decisions in complex real-world scenarios. References Fornacon-Wood, I., Mistry, H., Johnson-Hart, C., Faivre-Finn, C., O'Connor, J.P. and Price, G.J., 2022. Understanding the differences between Bayesian and frequentist statistics. International journal of radiation oncology, biology, physics, 112(5), pp.1076-1082. Van de Meent, J.W., Paige, B., Yang, H. and Wood, F., 2018. An Introduction to Probabilistic Programming. arXiv preprint arXiv:1809.10756. Markov chain Monte Carlo Spiegelhalter, D., Thomas, A., Best, N. and Gilks, W., 1996. BUGS 0.5: Bayesian inference using Gibbs sampling manual (version ii). MRC Biostatistics Unit, Institute of Public Health, Cambridge, UK, pp.1-59. Hornik, K., Leisch, F., Zeileis, A. and Plummer, M., 2003. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of DSC (Vol. 2, No. 1). Carpenter, B., Gelman, A., Hoffman, M.D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M.A., Guo, J., Li, P. and Riddell, A., 2017. Stan: A probabilistic programming language. Journal of statistical software, 76. BayesDB PyMC TensorFlow Probability Pyro AI Homoscedasticity and heteroscedasticity Introduction to ARCH Models pymc.GARCH11
Salesforce Analytics Query Language (SAQL) is a Salesforce proprietary query language designed for analyzing Salesforce native objects and CRM Analytics datasets. SAQL enables developers to query, transform, and project data to facilitate business insights by customizing the CRM dashboards. SAQL is very similar to SQL (Structured Query Language); however, it is designed to explore data within Salesforce and has its own unique syntax which is somewhat like Pig Latin (pig-ql). You can also use SAQL to implement complex logic while preparing datasets using dataflows and recipes. Key Features Key features of SAQL include the following: It enables users to specify filter conditions, and group and summarize input data streams to create aggregated values to derive actionable insights and analyze trends. SAQL supports conditional statements such as IF-THEN-ELSE and CASE. This feature can be used to execute complex conditions for data filtering and transformation. SAQL DATE and TIME-related functions make it much easier to work with date and time attributes, allowing users to execute time-based analysis, like comparing the data over various time intervals. Supports a variety of data transformation functions to cleanse, format, and typecast data to alter the structure of data to suit the requirements SAQL enables you to create complex calculated fields using existing data fields by applying mathematical, logical, or string functions. SAQL provides seamless integration with the Salesforce objects and CRM Analytics datasets. SAQL queries can be used to design visuals like charts, graphs, and dashboards within the Salesforce CRM Analytics platform. The rest of this article will focus on explaining the fundamentals of writing the SAQL queries, and delve into a few use cases where you can use SAQL to analyze the Salesforce data. Basics of SAQL Typical SAQL queries work like any other ETL tool: queries load the datasets, perform operations/transformations, and create an output data stream to be used in visualization. SAQL statements can run into multiple lines and are concluded with a semicolon. Every line of the query works on a named stream, which can serve as input for any subsequent statements in the same query. The following SAQL query can be used to create a data stream to analyze the opportunities booked in the previous year by month. SQL 1. q = load "OpportunityLineItems"; 2. q = filter q by 'StageName' == "6 - Closed Won" and date('CloseDate_Year', 'CloseDate_Month', 'CloseDate_Day') in ["1 year ago".."1 year ago"]; 3. q = group q by ('CloseDate_Year', 'CloseDate_Month'); 4. q = foreach q generate q.'CloseDate_Year' as 'CloseDate_Year', q.'CloseDate_Month' as 'CloseDate_Month', sum(q.'ExpectedTotal__c') as 'Bookings'; 5. q = order q by ('CloseDate_Year' asc, 'CloseDate_Month' asc); 6. q = limit q 2000; Line Number Description 1 This statement loads the CRM analytics dataset named “OpportunityLineItems” into an input stream q. 2 The input stream q is filtered to look for the opportunities closed won in the previous year. This is similar to the WHERE clause in SQL. 3 The statement focuses on grouping the records by the close date year and month so that we can visualize this data by the months. This is similar to the GROUP BY clause in SQL. 4 Statement 4 is selecting the attributes we want to project from the input stream. Here the expected total is being summed up for each group. 5 Statement 5 is ordering the records by the close of the year and month so that we can create a line chart to visualize this by month. 6 The last statement in the code above focuses on restricting the stream to a limited number of rows. This is mainly used for debugging purposes. Joining Multiple Data Streams The SAQL cogroup function joins input data streams like Salesforce objects or CRM analytics datasets. The data sources being joined should have a related column to facilitate the join. cogroup also supports the execution of both INNER and OUTER joins. For example, if you had two datasets, with one containing sales data and another containing customer data, you could use cogroup to join them based on a common field like customer ID. The resultant data stream contains both fields from both tables. Use Case The following code block can be used for a data stream for NewPipeline and Bookings for the customers. The pipeline built and bookings are coming from two different streams. We can join these two streams by Account Name. SQL q = load "Pipeline_Metric"; q = filter q by 'Source' in ["NewPipeline"]; q = group q by 'AccountName'; q = foreach q generate q.'AccountName' as 'AccountName', sum(ExpectedTotal__c) as 'NewPipeline'; q1 = load "Bookings_Metric"; q1 = filter q1 by 'Source' in ["Bookings"]; q1 = group q1 by 'AccountName'; q1 = foreach q1 generate q1.'AccountName' as 'AccountName', sum(q1.ExpectedTotal__c) as 'Bookings'; q2 = cogroup q by 'AccountName', q1 by 'AccountName'; result = foreach q2 generate q.'AccountName' as 'AccountName', sum(q.'NewPipeline') as 'NewPipeline',sum(q1.'Bookings') as 'Bookings'; You can also use a left outer cogroup to join the right data table with the left. This will result in all the records from the left data stream and all the matching records from the right stream. Use the coalesce function to replace all the null values from the right stream with another value. In the example above, if you want to report all the accounts with or without bookings, you can use the query below. SQL q = load "Pipeline_Metric"; q = filter q by 'Source' in ["NewPipeline"]; q = group q by 'AccountName'; q = foreach q generate q.'AccountName' as 'AccountName', sum(ExpectedTotal__c) as 'NewPipeline'; q1 = load "Bookings_Metric"; q1 = filter q1 by 'Source' in ["Bookings"]; q1 = group q1 by 'AccountName'; q1 = foreach q1 generate q1.'AccountName' as 'AccountName', sum(q1.ExpectedTotal__c) as 'Bookings'; q2 = cogroup q by 'AccountName' left, q1 by 'AccountName'; result = foreach q2 generate q.'AccountName' as 'AccountName', sum(q.'NewPipeline') as 'NewPipeline', coalesce(sum(q1.'Bookings'), 0) as 'Bookings'; Top N Analysis Using Windowing SAQL enables Top N analysis across value groups using the windowing functions within the input data stream. These functionalities are utilized for deriving the moving averages, cumulative totals, and rankings within the groups. You can specify the set of records where you want to execute these calculations using the “over” keyword. SAQL allows you to specify an offset to identify the number of records before and after the selected row. Optionally you can choose to work on all the records within a partition. These records are called windows. Once the set of records is identified for a window, you can apply an aggregation function to all the records within the defined window. Optionally you can create partitions to group the records based on a set of fields and perform aggregate calculations for each partition independently. Use Case The following SAQL code can be used to prepare data for the percentage contribution of new pipelines for each customer to the total pipeline by the region and the ranking of these customers by the region. SQL q = load "Pipeline_Metric"; q = filter q by 'Source' in ["NewPipeline"]; q = group q by ('Region','AccountName'); q = foreach q generate q.'Region' as 'Region',q.'AccountName' as 'AccountName', ((sum('ExpectedTotal__c')/sum(sum('ExpectedTotal__c')) over ([..] partition by 'Region')) * 100) as 'PCT_PipelineContribution', rank() over ([..] partition by ('Region') order by sum('ExpectedTotal__c') desc ) as 'Rank'; q = filter q by 'Rank' <=5; Data Aggregation: Grand Totals and Subtotals With SAQL SAQL offers rollup and grouping functions to aggregate the data streams based on pre-defined groups. While the rollup construct is used with the group by statement, grouping is used as part of foreach statements while projecting the input data stream. The rollup function aggregates the input data stream at various levels of hierarchy allowing you to create calculated fields on summarized datasets at higher levels of granularity. For example, in case you have datasets by the day, rollup can be used to aggregate the results by week, month, or year. The grouping function is used to group data based on specific dimensions or fields in order to segment the data into meaningful subsets for analysis. For example, you might group sales data by product category or region to analyze performance within each group. Use Case Use the code below to prepare data for the total number of accounts and accounts engaged by the region and theater. Also, add the grand total to look at the global numbers and subtotals for both regions and theaters. SQL q = load "ABXLeadandOpportunities_Metric"; q = filter q by 'Source' == "ABX Opportunities" and 'CampaignType' == "Growth Sprints" and 'Territory_Level_01__c' is not null; q = foreach q generate 'Territory_Level_01__c' as 'Territory_Level_01__c','Territory_Level_02__c' as 'Territory_Level_02__c','Territory_Level_03__c' as 'Territory_Level_03__c', q.'AccountName' as 'AccountName',q.'OId' as 'OId','MarketingActionedOppty' as 'MarketingActionedOppty','AccountActionedAcct' as 'AccountActionedAcct','ADRActionedOppty' as 'ADRActionedOppty','AccountActionedADRAcct' as 'AccountActionedADRAcct'; q = group q by rollup ('Territory_Level_01__c', 'Territory_Level_02__c'); q = foreach q generate case when grouping('Territory_Level_01__c') == 1 then "TOTAL" else 'Territory_Level_01__c' end as 'Level1', case when grouping('Territory_Level_02__c') == 1 then "LEVEL1 TOTAL" else 'Territory_Level_02__c' end as 'Level2', unique('AccountName') as 'Total Accounts',unique('AccountActionedAcct') as 'Engaged',((unique('AccountActionedAcct') / unique('AccountName'))) as '% of Engaged'; q = limit q 2000; Filling the Missing Date Fields You can use the fill() function to create a record for missing date, week, month, quarter, and year records in your dataset. This comes very handy when you want to show the result as 0 for these missing days/weeks/months instead of not showing them at all. Use Case The following SAQL code allows you to track the number of tasks for the sales agents by the days of the week. In case the agents are on PTO you want to show 0 tasks. SQL q = load "Tasks_Metric"; q = filter q by 'Source' == "Tasks"; q = filter q by date('MetricDate_Year', 'MetricDate_Month', 'MetricDate_Day') in [dateRange([2024,4,23], [2024,4,30])]; q = group q by ('MetricDate_Year', 'MetricDate_Month', 'MetricDate_Day'); q = foreach q generate q.'MetricDate_Year' as 'MetricDate_Year', q.'MetricDate_Month' as 'MetricDate_Month', q.'MetricDate_Day' as 'MetricDate_Day', unique(q.'Id') as 'Tasks'; q = order q by ('MetricDate_Year' asc, 'MetricDate_Month' asc, 'MetricDate_Day' asc); q = limit q 2000; The code above will be missing two days where there were no tasks created. You can use the code below to fill in the missing days. SQL q = load "Tasks_Metric"; q = filter q by 'Source' == "Tasks"; q = filter q by date('MetricDate_Year', 'MetricDate_Month', 'MetricDate_Day') in [dateRange([2024,4,23], [2024,4,30])]; q = group q by ('MetricDate_Year', 'MetricDate_Month', 'MetricDate_Day'); q = foreach q generate q.'MetricDate_Year' as 'MetricDate_Year', q.'MetricDate_Month' as 'MetricDate_Month', q.'MetricDate_Day' as 'MetricDate_Day', unique(q.'Id') as 'Tasks'; q = fill q by (dateCols=(MetricDate_Year, MetricDate_Month, MetricDate_Day, "Y-M-D")); q = order q by ('MetricDate_Year' asc, 'MetricDate_Month' asc, 'MetricDate_Day' asc); q = limit q 2000; You can also specify the start date and end date to populate the missing records between these dates. Conclusion In the end, SAQL has proven itself as a powerful tool for the Salesforce developer community, empowering them to extract actionable business insights from the CRM datasets using capabilities like filtering, aggregation, windowing, time-analysis, blending, custom calculation, Salesforce integration, and performance optimization. In this article, we have explored various capabilities of this technology and focused on targeted use cases. As a next step, I would recommend continuing your learnings by exploring Salesforce documentation, building your data models using dataflow, and using SAQL capabilities to harness the true potential of Salesforce as a CRM.
Deployment strategies provide a systematic approach to releasing software changes, minimizing risks, and maintaining consistency across projects and teams. Without a well-defined strategy and systematic approach, deployments can lead to downtime, data loss, or system failures, resulting in frustrated users and revenue loss. Before we start exploring different deployment strategies in more detail, let’s take a look at the short overview of each deployment strategy mentioned in this article: All-at-once deployment: This strategy involves updating all the target environments at once, making it the fastest but riskiest approach. In-place deployment: This involves stopping the current application and replacing it with a new version, directly affecting availability. Blue/Green deployment: A zero-downtime approach that involves running two identical environments and switching from old to new Canary deployment: Introduces new changes incrementally to a small subset of users before a full rollout Shadow deployment: Mirrors real traffic to a shadow environment where the new deployment is tested without affecting the live environment All-At-Once Deployment All-at-once deployment strategy, also known as the "Big Bang" deployment strategy, involves simultaneously releasing your application's new version to all servers or environments. This method is straightforward and can be implemented quickly, as it does not require complex orchestration or additional infrastructure. The primary benefit of this approach is its simplicity and the ability to immediately transition all users to the new version of the application. However, the all-at-once method carries significant risks. Since all instances are updated together, any issues with the new release immediately impact all users. There is no opportunity to mitigate risks by gradually rolling out the change or testing it with a subset of the user base first. Additionally, if something goes wrong, the rollback process can be just as disruptive as the initial deployment. Despite these risks, all-at-once deployment can be suitable for small applications or environments where downtime is more acceptable and the impact of potential issues is minimal and is used pretty often. It is also useful in scenarios where applications are inherently simple or have been thoroughly tested to ensure compatibility and stability before release. In-Place (Recreate) Deployment In-place or recreate deployment strategy is another strategy that is used pretty often when developing projects. It is the simplest and does not require additional infrastructure. Its essence lies in the fact that when we deploy a new version, we stop the application and start it with new changes. The disadvantage of this approach is that the service we are updating will experience downtime that will affect its users. Also, in case of problems with new software changes, we might need to roll back the latest changes, which will lead to service downtime. To avoid downtime and be able to roll back changes without it during the deployment process, there are deployment strategies that are created for this purpose and used in the industry. Blue/Green Deployment The first zero downtime deployment strategy we’re going to talk about is the Blue/Green deployment strategy. Its main goal is to minimize downtime and risks while deploying new software versions. This is done by having 2 identical environments of our service. One environment contains the original application (Blue environment) that serves users' requests and the other environment (Green environment) is where new software changes are deployed. This allows us to verify and test new changes with near zero downtime for users and the service, with the ability to safely roll back in case of any problems, except for some cases that we will discuss a bit later. Typically, the process is the following: after verifying and testing the new changes in the Green environment, we reroute traffic from the Blue environment to our identical Green environment with the new changes. Sounds easy, isn’t it? ... it depends. The problem is that we can easily reroute traffic between environments only when our services are stateless. If they interact with any data sources, things get more complicated, and here's why: Our identical Green and Blue environments share common data source(s). While sharing data sources such as NoSQL databases or object stores (AWS S3, for example) between our identical environment is easier to accomplish, this is completely not true for relational databases because it requires additional efforts (NoSQL also might require some effort) to support Blue/Green deployments. Since approaches to handle schema updates without downtime are out of the scope of this article, you can check out the article, "Upgrading database schema without downtime," to learn more about updating schemas without downtime (if you have any interesting resources on updating schemas without downtime - please, share with us in the comments). A general recommendation is that If your services are not stateless and use data sources with schemas, implementing a Blue/Green deployment strategy is not always recommended because of the additional risk and failure points this can introduce minimizing the benefits of the Blue/Green deployment strategy. But if you’ve decided that you need to integrate a Blue/Green deployment strategy and your infrastructure is running on Amazon Web Services, you might find this document by AWS on how to implement Blue/Green deployments and its infrastructure useful. Canary Deployment The idea of the Canary deployment strategy is to reduce the risks of deploying new software versions in production by rolling out new changes to users slowly. In the same manner, as we do in the Blue/Green deployment strategy, we roll out new software versions to our identical environment; but instead of completely rerouting traffic from one environment to another, we, for example, route a portion of users to our environment with new software version using a load balancer. The size of the portion of users getting new software versions and/or the criteria used to determine them - may be specific for every company/project. Some roll out new changes only to their internal stuff first, some determine users randomly and some may use algorithms to match users based on some criteria. Pick anything that best suits your needs. Shadow Deployment Shadow deployment strategy is the next strategy I find interesting, personally. This strategy also uses the concept of identical environments, just as the Blue/Green and Canary deployment strategies do. The main difference is that instead of completely rerouting or rerouting only a portion of real users we duplicate entire traffic to our second environment where we deployed our new changes. This way, we can test and verify our changes without negatively affecting our users, thus mitigating risks of broken software updates or performance bottlenecks. Conclusion In this article, we walked through five different deployment strategies, each with its own set of advantages and challenges. The all-at-once and in-place deployment strategies stand out for their speed and minimal effort required to deploy new versions of software. While these two strategies will be your go-to deployment strategies in most cases, it’s still useful to understand and know about more complex and resource-intensive strategies. Ultimately, implementing any deployment strategy requires careful consideration of the potential impact on both the system and its users. The choice of deployment strategy should align with your project’s needs, risk tolerance, and operational capabilities.
Java ORM world is very steady and few libraries exist, but none of them brought any breaking change over the last decade. Meanwhile, application architecture evolved with some trends such as Hexagonal Architecture, CQRS, Domain Driven Design, or Domain Purity. Stalactite tries to be more suitable to these new paradigms by allowing to persist any kind of Class without the need to annotate them or use external XML files: its mapping is made of method reference. As a benefit, you get a better view of the entity graph since the mapping is made through a fluent API that chains your entity relations, instead of spreading annotations all over entities. This is very helpful to see the complexity of your entity graph, which would impact its load as well as the memory. Moreover, since Stalactite only fetches data eagerly, we can say that what you see is what you get. Here is a very small example: Java MappingEase.entityBuilder(Country.class, Long.class) .mapKey(Country::getId, IdentifierPolicy.afterInsert()) .mapOneToOne(Country::getCapital, MappingEase.entityBuilder(City.class, Long.class) .mapKey(City::getId, IdentifierPolicy.afterInsert()) .map(City::getName)) First Steps The release 2.0.0 is out for some weeks and is available as a Maven dependency, hereafter is an example with HSQLDB. For now, Stalactite is compatible with the following databases (mainly in their latest version): HSQLDB, H2, PostgreSQL, MySQL, and MariaDB. XML <dependency> <groupId>org.codefilarete.stalactite</groupId> <artifactId>orm-hsqldb-adapter</artifactId> <version>2.0.0</version> </dependency> If you're interested in a less database-vendor-dedicated module, you can use the orm-all-adapter module. Just be aware that it will bring you extra modules and extra JDBC drivers, heaving your artifact. After getting Statactite as a dependency, the next step is to have a JDBC DataSource and pass it to a org.codefilarete.stalactite.engine.PersistenceContext: Java org.hsqldb.jdbc.JDBCDataSource dataSource= new org.hsqldb.jdbc.JDBCDataSource(); dataSource.setUrl("jdbc:hsqldb:mem:test"); dataSource.setUser("sa"); dataSource.setPassword(""); PersistenceContext persistenceContext = new PersistenceContext(dataSource, new HSQLDBDialect()); Then comes the interesting part: the mapping. Supposing you get a Country, you can quickly set up its mapping through the Fluent API, starting with the org.codefilarete.stalactite.mapping.MappingEase class as such: Java EntityPersister<Country, Long> countryPersister = MappingEase.entityBuilder(Country.class, Long.class) .mapKey(Country::getId, IdentifierPolicy.afterInsert()) .map(Country::getName) .build(persistenceContext); the afterInsert() identifier policy means that the country.id column is an auto-increment one. Two other policies exist: the beforeInsert() for identifier given by a database Sequence (for example), and the alreadyAssigned() for entities that have a natural identifier given by business rules, any non-declared property is considered transient and not managed by Stalactite. The schema can be generated with the org.codefilarete.stalactite.sql.ddl.DDLDeployer class as such (it will generate it into the PersistenceContext dataSource): Java DDLDeployer ddlDeployer = new DDLDeployer(persistenceContext); ddlDeployer.deployDDL(); Finally, you can persist your entities thanks to the EntityPersister obtained previously, please find the example below. You might notice that you won't find JPA methods in Stalactite persister. The reason is that Stalactite is far different from JPA and doesn't aim at being compatible with it: no annotation, no attach/detach mechanism, no first-level cache, no lazy loading, and many more. Hence, the methods are quite straight to their goal: Java Country myCountry = new Country(); myCountry.setName("myCountry"); countryPersister.insert(myCountry); myCountry.setName("myCountry with a different name"); countryPersister.update(myCountry); Country loadedCountry = countryPersister.select(myCountry.getId()); countryPersister.delete(loadedCountry); Spring Integration There was a raw usage of Stalactite, meanwhile, you may be interested in its integration with Spring to benefit from the magic of its @Repository. Stalactite provides it, just be aware that it's still a work-in-progress feature. The approach to activate it is the same as for JPA: enable Stalactite repositories thanks to the @EnableStalactiteRepositories annotation on your Spring application. Then you'll declare the PersistenceContext and EntityPersister as @Bean : Java @Bean public PersistenceContext persistenceContext(DataSource dataSource) { return new PersistenceContext(dataSource); } @Bean public EntityPersister<Country, Long> countryPersister(PersistenceContext persistenceContext) { return MappingEase.entityBuilder(Country.class, long.class) .mapKey(Country::getId, IdentifierPolicy.afterInsert()) .map(Country::getName) .build(persistenceContext); } Then you can declare your repository as such, to be injected into your services : Java @Repository public interface CountryStalactiteRepository extends StalactiteRepository<Country, Long> { } As mentioned earlier, since the paradigm of Stalactite is not the same as JPA (no annotation, no attach/detach mechanism, etc), you won't find the same methods of JPA repository in Stalactite ones : save : Saves the given entity, either inserting it or updating it according to its persistence states saveAll : Same as the previous one, with a massive API findById : Try to find an entity by its id in the database findAllById : Same as the previous one, with a massive API delete : Delete the given entity from the database deleteAll : Same as the previous one, with a massive API Conclusion In these chapters we introduced the Stalactite ORM, more information about the configuration, the mapping, and all the documentation are available on the website. The project is open-source with the MIT license and shared through Github. Thanks for reading, any feedback is appreciated!
Through my years of building services, the RESTful API has been my primary go-to. However, even though REST has its merits, that doesn’t mean it’s the best approach for every use case. Over the years, I’ve learned that, occasionally, there might be better alternatives for certain scenarios. Sticking with REST just because I’m passionate about it — when it’s not the right fit — only results in tech debt and a strained relationship with the product owner. One of the biggest pain points with the RESTful approach is the need to make multiple requests to retrieve all the necessary information for a business decision. As an example, let’s assume I want a 360-view of a customer. I would need to make the following requests: GET /customers/{some_token} provides the base customer information GET /addresses/{some_token} supplies a required address GET /contacts/{some_token} returns the contact information GET /credit/{some_token} returns key financial information While I understand the underlying goal of REST is to keep responses laser-focused for each resource, this scenario makes for more work on the consumer side. Just to populate a user interface that helps an organization make decisions related to future business with the customer, the consumer must make multiple calls In this article, I’ll show why GraphQL is the preferred approach over a RESTful API here, demonstrating how to deploy Apollo Server (and Apollo Explorer) to get up and running quickly with GraphQL. I plan to build my solution with Node.js and deploy my solution to Heroku. When To Use GraphQL Over REST? There are several common use cases when GraphQL is a better approach than REST: When you need flexibility in how you retrieve data: You can fetch complex data from various resources but all in a single request. (I will dive down this path in this article.) When the frontend team needs to evolve the UI frequently: Rapidly changing data requirements won’t require the backend to adjust endpoints and cause blockers. When you want to minimize over-fetching and under-fetching: Sometimes REST requires you to hit multiple endpoints to gather all the data you need (under-fetching), or hitting a single endpoint returns way more data than you actually need (over-fetching). When you’re working with complex systems and microservices: Sometimes multiple sources just need to hit a single API layer for their data. GraphQL can provide that flexibility through a single API call. When you need real-time data pushed to you: GraphQL features subscriptions, which provide real-time updates. This is useful in the case of chat apps or live data feeds. (I will cover this benefit in more detail in a follow-up article.) What Is Apollo Server? Since my skills with GraphQL aren’t polished, I decided to go with Apollo Server for this article. Apollo Server is a GraphQL server that works with any GraphQL schema. The goal is to simplify the process of building a GraphQL API. The underlying design integrates well with frameworks such as Express or Koa. I will explore the ability to leverage subscriptions (via the graphql-ws library) for real-time data in my next article. Where Apollo Server really shines is the Apollo Explorer, a built-in web interface that developers can use to explore and test their GraphQL APIs. The studio will be a perfect fit for me, as it allows for the easy construction of queries and the ability to view the API schema in a graphical format. My Customer 360 Use Case For this example, let’s assume we need the following schema to provide a 360-view of the customer: TypeScript type Customer { token: String name: String sic_code: String } type Address { token: String customer_token: String address_line1: String address_line2: String city: String state: String postal_code: String } type Contact { token: String customer_token: String first_name: String last_name: String email: String phone: String } type Credit { token: String customer_token: String credit_limit: Float balance: Float credit_score: Int } I plan to focus on the following GraphQL queries: TypeScript type Query { addresses: [Address] address(customer_token: String): Address contacts: [Contact] contact(customer_token: String): Contact customers: [Customer] customer(token: String): Customer credits: [Credit] credit(customer_token: String): Credit } Consumers will provide the token for the Customer they wish to view. We expect to also retrieve the appropriate Address, Contact, and Credit objects. The goal is to avoid making four different API calls for all this information rather than with a single API call. Getting Started With Apollo Server I started by creating a new folder called graphql-server-customer on my local workstation. Then, using the Get Started section of the Apollo Server documentation, I followed steps one and two using a Typescript approach. Next, I defined my schema and also included some static data for testing. Ordinarily, we would connect to a database, but static data will work fine for this demo. Below is my updated index.ts file: TypeScript import { ApolloServer } from '@apollo/server'; import { startStandaloneServer } from '@apollo/server/standalone'; const typeDefs = `#graphql type Customer { token: String name: String sic_code: String } type Address { token: String customer_token: String address_line1: String address_line2: String city: String state: String postal_code: String } type Contact { token: String customer_token: String first_name: String last_name: String email: String phone: String } type Credit { token: String customer_token: String credit_limit: Float balance: Float credit_score: Int } type Query { addresses: [Address] address(customer_token: String): Address contacts: [Contact] contact(customer_token: String): Contact customers: [Customer] customer(token: String): Customer credits: [Credit] credit(customer_token: String): Credit } `; const resolvers = { Query: { addresses: () => addresses, address: (parent, args, context) => { const customer_token = args.customer_token; return addresses.find(address => address.customer_token === customer_token); }, contacts: () => contacts, contact: (parent, args, context) => { const customer_token = args.customer_token; return contacts.find(contact => contact.customer_token === customer_token); }, customers: () => customers, customer: (parent, args, context) => { const token = args.token; return customers.find(customer => customer.token === token); }, credits: () => credits, credit: (parent, args, context) => { const customer_token = args.customer_token; return credits.find(credit => credit.customer_token === customer_token); } }, }; const server = new ApolloServer({ typeDefs, resolvers, }); const { url } = await startStandaloneServer(server, { listen: { port: 4000 }, }); console.log(`Apollo Server ready at: ${url}`); const customers = [ { token: 'customer-token-1', name: 'Acme Inc.', sic_code: '1234' }, { token: 'customer-token-2', name: 'Widget Co.', sic_code: '5678' } ]; const addresses = [ { token: 'address-token-1', customer_token: 'customer-token-1', address_line1: '123 Main St.', address_line2: '', city: 'Anytown', state: 'CA', postal_code: '12345' }, { token: 'address-token-22', customer_token: 'customer-token-2', address_line1: '456 Elm St.', address_line2: '', city: 'Othertown', state: 'NY', postal_code: '67890' } ]; const contacts = [ { token: 'contact-token-1', customer_token: 'customer-token-1', first_name: 'John', last_name: 'Doe', email: 'jdoe@example.com', phone: '123-456-7890' } ]; const credits = [ { token: 'credit-token-1', customer_token: 'customer-token-1', credit_limit: 10000.00, balance: 2500.00, credit_score: 750 } ]; With everything configured as expected, we run the following command to start the server: Shell $ npm start With the Apollo server running on port 4000, I used the http://localhost:4000/ URL to access Apollo Explorer. Then I set up the following example query: TypeScript query ExampleQuery { addresses { token } contacts { token } customers { token } } This is how it looks in Apollo Explorer: Pushing the Example Query button, I validated that the response payload aligned with the static data I provided in the index.ts: JSON { "data": { "addresses": [ { "token": "address-token-1" }, { "token": "address-token-22" } ], "contacts": [ { "token": "contact-token-1" } ], "customers": [ { "token": "customer-token-1" }, { "token": "customer-token-2" } ] } } Before going any further in addressing my Customer 360 use case, I wanted to run this service in the cloud. Deploying Apollo Server to Heroku Since this article is all about doing something new, I wanted to see how hard it would be to deploy my Apollo server to Heroku. I knew I had to address the port number differences between running locally and running somewhere in the cloud. I updated my code for starting the server as shown below: TypeScript const { url } = await startStandaloneServer(server, { listen: { port: Number.parseInt(process.env.PORT) || 4000 }, }); With this update, we’ll use port 4000 unless there is a PORT value specified in an environment variable. Using Gitlab, I created a new project for these files and logged into my Heroku account using the Heroku command-line interface (CLI): Shell $ heroku login You can create a new app in Heroku with either their CLI or the Heroku dashboard web UI. For this article, we’ll use the CLI: Shell $ heroku create jvc-graphql-server-customer The CLI command returned the following response: Shell Creating jvc-graphql-server-customer... done https://jvc-graphql-server-customer-b62b17a2c949.herokuapp.com/ | https://git.heroku.com/jvc-graphql-server-customer.git The command also added the repository used by Heroku as a remote automatically: Shell $ git remote heroku origin By default, Apollo Server disables Apollo Explorer in production environments. For my demo, I want to leave it running on Heroku. To do this, I need to set the NODE_ENV environment variable to development. I can set that with the following CLI command: Shell $ heroku config:set NODE_ENV=development The CLI command returned the following response: Shell Setting NODE_ENV and restarting jvc-graphql-server-customer... done, v3 NODE_ENV: development Now we’re in a position to deploy our code to Heroku: Shell $ git commit --allow-empty -m 'Deploy to Heroku' $ git push heroku A quick view of the Heroku Dashboard shows my Apollo Server running without any issues: If you’re new to Heroku, this guide will show you how to create a new account and install the Heroku CLI. Acceptance Criteria Met: My Customer 360 Example With GraphQL, I can meet the acceptance criteria for my Customer 360 use case with the following query: TypeScript query CustomerData($token: String) { customer(token: $token) { name sic_code token }, address(customer_token: $token) { token customer_token address_line1 address_line2 city state postal_code }, contact(customer_token: $token) { token, customer_token, first_name, last_name, email, phone }, credit(customer_token: $token) { token, customer_token, credit_limit, balance, credit_score } } All I need to do is pass in a single Customer token variable with a value of customer-token-1: JSON { "token": "customer-token-1" } We can retrieve all of the data using a single GraphQL API call: JSON { "data": { "customer": { "name": "Acme Inc.", "sic_code": "1234", "token": "customer-token-1" }, "address": { "token": "address-token-1", "customer_token": "customer-token-1", "address_line1": "123 Main St.", "address_line2": "", "city": "Anytown", "state": "CA", "postal_code": "12345" }, "contact": { "token": "contact-token-1", "customer_token": "customer-token-1", "first_name": "John", "last_name": "Doe", "email": "jdoe@example.com", "phone": "123-456-7890" }, "credit": { "token": "credit-token-1", "customer_token": "customer-token-1", "credit_limit": 10000, "balance": 2500, "credit_score": 750 } } } Below is a screenshot from Apollo Explorer running from my Heroku app: Conclusion I recall earlier in my career when Java and C# were competing against each other for developer adoption. Advocates on each side of the debate were ready to prove that their chosen tech was the best choice … even when it wasn’t. In this example, we could have met my Customer 360 use case in multiple ways. Using a proven RESTful API would have worked, but it would have required multiple API calls to retrieve all of the necessary data. Using Apollo Server and GraphQL allowed me to meet my goals with a single API call. I also love how easy it is to deploy my GraphQL server to Heroku with just a few commands in my terminal. This allows me to focus on implementation—offloading the burdens of infrastructure and running my code to a trusted third-party provider. Most importantly, this falls right in line with my personal mission statement: “Focus your time on delivering features/functionality that extends the value of your intellectual property. Leverage frameworks, products, and services for everything else.” – J. Vester If you are interested in the source code for this article, it is available on GitLab. But wait… there’s more! In my follow-up post, we will build out our GraphQL server further, to implement authentication and real-time data retrieval with subscriptions. Have a really great day!
Snowflake is a leading cloud-based data storage and analytics service that provides various solutions for data warehouses, data engineering, AI/ML modeling, and other related services. It has multiple features and functionalities; one powerful data recovery feature is Time Travel. It allows users to access historical data from the past. It is beneficial when a user comes across any of the below scenarios: Retrieving the previous row or column value before the current DML operation Recovering the last state of data for backup or redundancy Updating or deleting records from the table by mistake Restoring the previous state of the table, schema, or database Snowflake's Continuous Data Protection Life Cycle allows time travel within a window of 1 to 90 days. For the Enterprise edition, up to 90 days of retention is allowed. Time Travel SQL Extensions Time Travel can be achieved using Offsets, Timestamps, and Statements keywords in addition to the AT or BEFORE clause. Offset If a user wants to retrieve past data or recover a table from the older state data using time parameters, then the user can use the query below, where offset is defined in seconds. SQL SELECT * FROM any_table AT(OFFSET => -60*5); -- For 5 Minutes CREATE TABLE recoverd_table CLONE any_table AT(OFFSET => -3600); -- For 1 Hour Timestamp Suppose a user wants to query data from the past or recover a schema for a specific timestamp. Then, the user can utilize the below query. SQL SELECT * FROM any_table AT(TIMESTAMP => 'Sun, 05 May 2024 16:20:00 -0700'::timestamp_tz); CREATE SCHEMA recovered_schema CLONE any_schema AT(TIMESTAMP => 'Wed, 01 May 2024 01:01:00 +0300'::timestamp_tz); Statement Users can also use any unique query ID to get the latest data until the statement. SQL SELECT * FROM any_table BEFORE(STATEMENT => '9f6e1bq8-006f-55d3-a757-beg5a45c1234'); CREATE DATABASE recovered_db CLONE any_db BEFORE(STATEMENT => '9f6e1bq8-006f-55d3-a757-beg5a45c1234'); The command below sets the data retention time and increases or decreases. SQL CREATE TABLE any_table(id NUMERIC, name VARCHAR, created_date DATE) DATA_RETENTION_TIME_IN_DAYS=90; ALTER TABLE any_table SET DATA_RETENTION_TIME_IN_DAYS=30; If data retention is not required, then we can also use SET DATA_RETENTION_TIME_IN_DAYS=0;. Objects that do not have an explicitly defined retention period can inherit the retention from the upper object level. For instance, tables that do not have a specified retention period will inherit the retention period from schema, and schema that does not have the retention period defined will inherit from the database level. The account level is the highest level of the hierarchy and should be set up with 0 days for data retention. Now consider a case where a table, schema, or database accidentally drops, causing all the data to be lost. During such cases, when any data object gets dropped, it's kept in Snowflake's back-end until the data retention period. For such cases, Snowflake has a similar great feature that will bring those objects back with below SQL. SQL UNDROP TABLE any_table; UNDROP SCHEMA any_schema; UNDROP DATABASE any_database; If a user creates a table with the same name as the dropped table, then Snowflake creates a new table, not restore the old one. When the user uses the above UNDROP command, Snowflake restores the old object. Also, the user needs permission or ownership to restore the object. After the Time Travel period, if the object isn't retrieved within the data retention period, it is transferred to Snowflake Fail-Safe, where users can't query. The only way to recover that is by using Snowflake's help, and it stores the data for a maximum of 7 days. Challenges Time travel, though useful, has a few challenges, as shown below. The Time Travel has a default one-day setup for transient and temporary tables in Snowflake. Any objects except tables, such as views, UDFs, and stored procedures, are not supported. If a table is recreated with the same name, referring to the older version of the same name requires renaming the current table as, by default, Time Travel will refer to the latest version. Conclusion The Time Travel feature is quick, easy, and powerful. It's always handy and gives users more comfort while operating production-sensitive data. The great thing is that users can run these queries themselves without having to involve admins. With a maximum retention of 90 days, users have more than enough time to query back in time or fix any incorrectly updated data. In my opinion, it is Snowflake's strongest feature. Reference Understanding & Using Time Travel
In this tutorial, I'll explore how to set up and utilize docTR, the open-source OCR (Optical Character Recognition) solution of the document parsing API startup Mindee. I’ll go through what you need to install docTR on Ubuntu. It accepts PDFs, images, and even a website URL as an input. In this example, I will parse a grocery store receipt. Let’s get started. Setting Up docTR on Ubuntu docTR is compatible with any Linux distribution, macOS, and Windows. It is also available as a Docker image. I will use Ubuntu 22.04 LTS (Jammy Jellyfish) for this tutorial. Hardware-wise, you don’t need anything specific, but if you want to do extensive testing, I recommend using a GPU instance; OVHcloud offers affordable options, with servers starting at less than a dollar per hour. Let’s start by installing Python. At the time of writing, docTR requires Python 3.8 (or higher). Shell sudo apt install -y python3 To avoid messing with system libraries, let’s use a virtual environment. Shell sudo apt install -y python3.10-venv python3 -m venv testing-Mindee-docTR Then we install the OpenGL Mesa 3D Graphics Library, used for the computer vision part of docTR. Shell sudo apt install -y libgl1-mesa-glx We install pango, which is a text layout engine library. Shell sudo apt-get install -y libpango-1.0-0 libpangoft2-1.0-0 Then, we install pip so that we can install docTR. Shell sudo apt install -y python3-pip Finally, we install docTR within our virtual environment. This version is specifically for PyTorch. If you choose to use TensorFlow, change the command accordingly. Shell testing-Mindee-docTR/bin/pip3 install "python-doctr[torch]" Using docTR Now that docTR is installed, let’s start playing with it. In this example, I will test it with a grocery store receipt. You can download the receipt using the command below. Shell wget "https://media.istockphoto.com/id/889405434/vector/realistic-paper-shop-receipt-vector-cashier-bill-on-white-background.jpg?s=612x612&w=0&k=20&c=M2GxEKh9YJX2W3q76ugKW23JRVrm0aZ5ZwCZwUMBgAg=" -O receipt.jpeg Create a testing-docTR.py file and insert the following code into it. Python from doctr.io import DocumentFile from doctr.models import ocr_predictor # Load the grocery receipt doc = DocumentFile.from_images("receipt.jpeg") # Load the OCR model model = ocr_predictor(pretrained=True) # Perform OCR result = model(doc) # Display the OCR result print(result.export()) Note that docTR uses a two-stage approach: First, it performs text detection to localize words. Then, it conducts text recognition to identify all characters in the word. The ocr_predictor function accepts additional parameters to select the text detection and recognition architecture. For simplicity, I used the default ones in this example. You can find information about other models on the docTR documentation. Reading a Receipt Using docTR Now you just need to run your Python script: Shell testing-Mindee-docTR/bin/python3 testing-docTR.py You will get an output such as the one below: JSON {"pages": [{"page_idx": 0, "dimensions": [612, 612], "orientation": {"value": null, "confidence": null}, "language": {"value": null, "confidence": null}, "blocks": [{"geometry": [[0.44140625, 0.1201171875], [0.548828125, 0.14453125]], "lines": [{"geometry": [[0.44140625, 0.1201171875], [0.548828125, 0.14453125]], "words": [{"value": "RECEIPT", "confidence": 0.9695481061935425, "geometry": [[0.44140625, 0.1201171875], [0.548828125, 0.14453125]]}]}], "artefacts": []}]}]} Note that I drastically shortened the JSON output for readability and only kept the part showing the “RECEIPT” word. Here is the JSON structure you’d be looking at without truncating the result. I have expanded the part of the tree that I kept in the JSON output. docTR will provide a bunch of information about the document but the important part is about how it breaks down the document into lines, and for each line, provides an array containing the words it detected along with the degree of confidence. Here, we can see it spotted the word RECEIPT with a confidence of 96%. docTR offers an efficient OCR solution that simplifies text recognition processes. Depending on the document type, you may need to change the text detection and text recognition architectures to improve accuracy. Comprehensive docTR documentation is available here. Considerations When Using docTR Deploying docTR entails certain complexities. First, you must create a dataset and train docTR to achieve satisfactory accuracy. This means dealing with data annotation on many images. Since OCR systems typically serve as backend services for other apps, it may be necessary to integrate docTR via an API and scale it according to the app’s needs. docTR does not provide this out of the box, but there are many open-source technologies that can help facilitate this step. Conclusion Document processing technologies have come a long way since the advent of OCR tools, which are limited to character recognition. Intelligent Document Processing (IDP) platforms represent the next step; they utilize OCR (such as docTR) along with additional layers of intelligence like table reconstruction, document classification, and natural language understanding, to achieve better accuracy and precision. Additionally, for those seeking a scalable IDP solution without the complexities of data collection and model training, I recommend trying out Mindee’s latest solution, docTI. This training-free IDP solution leverages Large Language Models (LLMs) to eliminate the need for data collection, annotation, and the model training process. You can use the free-tier plan, configure an instance, and start querying the API in minutes.
Breaking Barriers and Empowering Women in Tech: Insights From Boomi World 2024
May 10, 2024 by CORE
Mastering Unit Testing and Test-Driven Development in Java
May 8, 2024 by
The Beginner's Guide To Understanding Graph Databases
May 14, 2024 by
Explainable AI: Making the Black Box Transparent
May 16, 2023 by CORE
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by
How You Can Use Logs To Feed Security
May 14, 2024 by CORE
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by
The Beginner's Guide To Understanding Graph Databases
May 14, 2024 by
Generative AI: RAG Implementation vs. Prompt Engineering
May 14, 2024 by
Five IntelliJ Idea Plugins That Will Change the Way You Code
May 15, 2023 by