Peter Chan is the web archivist at Stanford University Libraries. He served as the project manager for the ePADD initiative from 2012-2019.
This blog was created with the assistance of ChatGPT.
Introduction
Email archives are a valuable resource for individuals and organizations alike. They contain a wealth of information and insights that can be harnessed for various purposes. However, navigating through extensive email archives can be a daunting task. In this article, we will explore three effective ways to unlock the potential of email archives: search, browse, and question and answer.
Search
Search functionality stands as a fundamental and widely utilized method for navigating email archives. Most email clients offer standard search features, including full-text search and structured data search. Full-text search enables users to find specific keywords or phrases within email content, while structured data search allows for searches based on attributes such as sender, recipient, date, and subject.
Specialized software such as ePADD transcends traditional search capabilities by offering enhanced features. ePADD enables users to define lexicons, which are groups of keywords designed to streamline searches for specific topics, themes, or subjects within email archives. This concurrent search query approach not only enhances efficiency in entering keywords into the search bar but also significantly boosts the effectiveness of exploring email archives.
Browse
Browsing serves as an exceptionally effective method for navigating email archives, especially when users are unsure of what to search for or desire a broader perspective. Tools like ePADD employ entity extraction techniques to identify entities like names, organizations, places, universities, awards, etc. within email archives. By utilizing these extracted fine grain entities, ePADD empowers users to browse through the archives, facilitating the discovery of interconnected information and revealing hidden patterns and relationships.
Question and Answer
With recent advancements in large language models (LLMs) like GPT-4, PaLM 2, and Falcon 40B, industry-leading companies such as OpenAI, Google and Hugging Face have introduced tools like chatGPT, Google Bard, and HuggingChat. These tools empower users to engage in question and answer sessions for chatting or analyzing designated data, including email archives. These tools can comprehend user queries and deliver valuable insights and information.
Using these tools gives rise to two primary issues: artificial hallucination and data privacy and security. Artificial hallucination refers to the problem inherent in chatGPT and similar AI products, where the generated responses may appear confident but lack sufficient justification from the training data. A lawyer finds himself in a precarious situation as he confesses to employing ChatGPT's assistance in composing court filings that referenced six fabricated cases created by the AI tool. One potential solution involves restricting the tools to derive answers solely from the provided data, ensuring a more reliable and well-founded response.
Data privacy and security concerns also arise when utilizing cloud-based question and answer services. Notably, Samsung Electronics has prohibited the use of AI-powered chatbots, including ChatGPT, by its employees due to these concerns. To address such issues, alternative solutions have emerged, such as privateGPT, GPT4ALL and h2oGPT. These tools allow users to install and operate question and answering systems on their local machines, ensuring that the data remains within the organization and is not transmitted to external servers over the internet.
With products like privateGPT, GPT4ALL or h2oGPT, users can conduct question and answer sessions directly on their designated email archives, ensuring data privacy, security, and mitigating the risk of artificial hallucination. These tools provide organizations with the ability to leverage the advantages of large language models while retaining control over their valuable email data. This empowers organizations to benefit from the capabilities of these models without compromising sensitive information or encountering unwarranted responses.
Conclusion
Mastery of email archives is critical for efficient information retrieval and knowledge discovery. By employing search, browse, and question and answer strategies, individuals and organizations can unlock the full potential of their email archives. While traditional search and browsing methods provide valuable insights, the emergence of tools based on large language models like privateGPT, GPT4ALL and h2oGPT opens up new possibilities for exploring and extracting knowledge from email archives. These approaches ensure that email archives retain their value as indispensable resources for historical reference, research, and informed decision-making within the secure confines of an organization's data infrastructure.
References
ePADD: https://library.stanford.edu/projects/epadd
Lawyer cited 6 fake cases made up by ChatGPT; judge calls it “unprecedented”: https://arstechnica.com/tech-policy/2023/05/lawyer-cited-6-fake-cases-made-up-by-chatgpt-judge-calls-it-unprecedented/
Samsung Electronics has prohibited the use of AI-powered chatbots, including ChatGPT, by its employees due to these concerns. https://www.theverge.com/2023/5/2/23707796/samsung-ban-chatgpt-generative-ai-bing-bard-employees-security-concerns
privateGPT: https://github.com/imartinez/privateGPT
GPT4ALL: https://docs.gpt4all.io/gpt4all_chat.html
h2ogpt: https://github.com/h2oai/h2ogpt