Add a new 'b4 dig' subcommand that uses AI agents to discover related
emails for a given message ID. This helps developers find all relevant
context around patches including previous versions, bug reports, reviews,
and related discussions.

The command:
- Takes a message ID and constructs a detailed prompt about email relationships
- Calls a configured AI agent script to analyze and find related messages
- Downloads all related threads from lore.kernel.org
- Combines them into a single mbox file for easy review

Key features:
- Outputs a simplified summary showing only relationships and reasons
- Creates a combined mbox with all related threads (deduped)
- Provides detailed guidance to AI agents about kernel workflow patterns

Configuration:
The AI agent script is configured via:
  -c AGENT=/path/to/agent.sh  (command line)
  dig-agent: /path/to/agent.sh (config file)

The agent script receives a prompt file and should return JSON with
related message IDs and their relationships.

Example usage:

$ b4 -c AGENT=agent.sh dig 20250909142722.101790-1-harry....@oracle.com
Analyzing message: 20250909142722.101790-1-harry....@oracle.com
Fetching original message...
Looking up https://lore.kernel.org/20250909142722.101790-1-harry....@oracle.com
Grabbing thread from 
lore.kernel.org/all/20250909142722.101790-1-harry....@oracle.com/t.mbox.gz
Subject: [PATCH V3 6.6.y] mm: introduce and use {pgd,p4d}_populate_kernel()
From: Harry Yoo <harry....@oracle.com>
Constructing agent prompt...
Calling AI agent: agent.sh
Calling agent: agent.sh /tmp/tmpz1oja9_5.txt
Parsing agent response...
Found 17 related messages:

Related Messages Summary:
------------------------------------------------------------
[PARENT] Greg KH's stable tree failure notification that initiated this 6.6.y 
backport request
[V1] V1 of the 6.6.y backport patch
[V2] V2 of the 6.6.y backport patch
[RELATED] Same patch backported to 5.15.y stable branch
[RELATED] Greg KH's stable tree failure notification for 5.15.y branch
[RELATED] Same patch backported to 6.1.y stable branch
[COVER] V5 mainline patch series cover letter that was originally merged
[RELATED] V5 mainline patch 1/3: move page table sync declarations
[RELATED] V5 mainline patch 2/3: the original populate_kernel patch that's 
being backported
[RELATED] V5 mainline patch 3/3: x86 ARCH_PAGE_TABLE_SYNC_MASK definition
[RELATED] RFC V1 cover letter - earliest version of this patch series
[RELATED] RFC V1 patch 1/3 - first introduction of populate_kernel helpers
[RELATED] RFC V1 patch 2/3 - x86/mm definitions
[RELATED] RFC V1 patch 3/3 - convert to _kernel variant
[RELATED] Baoquan He's V3 patch touching same file (mm/kasan/init.c)
[RELATED] Baoquan He's V2 patch touching same file (mm/kasan/init.c)
[RELATED] Baoquan He's V1 patch touching same file (mm/kasan/init.c)
------------------------------------------------------------

The resulting mbox would look like this:

   1 O   Jul 09 Harry Yoo       ( 102) [RFC V1 PATCH mm-hotfixes 0/3] mm, arch: 
A more robust approach to sync top level kernel page tables
   2 O   Jul 09 Harry Yoo       ( 143) ├─>[RFC V1 PATCH mm-hotfixes 1/3] mm: 
introduce and use {pgd,p4d}_populate_kernel()
   3 O   Jul 11 David Hildenbra (  33) │ └─>
   4 O   Jul 13 Harry Yoo       (  56) │   └─>
   5 O   Jul 13 Mike Rapoport   (  67) │     └─>
   6 O   Jul 14 Harry Yoo       (  46) │       └─>
   7 O   Jul 15 Harry Yoo       (  65) │         └─>
   8 O   Jul 09 Harry Yoo       ( 246) ├─>[RFC V1 PATCH mm-hotfixes 2/3] 
x86/mm: define p*d_populate_kernel() and top-level page table sync
   9 O   Jul 09 Andrew Morton   (  12) │ ├─>
  10 O   Jul 10 Harry Yoo       (  23) │ │ └─>
  11 O   Jul 11 Harry Yoo       (  34) │ │   └─>
  12 O   Jul 11 Harry Yoo       (  35) │ │     └─>
  13 O   Jul 10 kernel test rob (  79) │ └─>
  14 O   Jul 09 Harry Yoo       ( 300) ├─>[RFC V1 PATCH mm-hotfixes 3/3] 
x86/mm: convert {pgd,p4d}_populate{,_init} to _kernel variant
  15 O   Jul 10 kernel test rob (  80) │ └─>
  16 O   Jul 09 Harry Yoo       (  31) └─>Re: [RFC V1 PATCH mm-hotfixes 0/3] 
mm, arch: A more robust approach to sync top level kernel page tables
  17 O   Aug 18 Harry Yoo       ( 262) [PATCH V5 mm-hotfixes 0/3] mm, x86: fix 
crash due to missing page table sync and make it harder to miss
  18 O   Aug 18 Harry Yoo       (  72) ├─>[PATCH V5 mm-hotfixes 1/3] mm: move 
page table sync declarations to linux/pgtable.h
  19 O   Aug 18 David Hildenbra (  20) │ └─>
  20 O   Aug 18 Harry Yoo       ( 239) ├─>[PATCH V5 mm-hotfixes 2/3] mm: 
introduce and use {pgd,p4d}_populate_kernel()
  21 O   Aug 18 David Hildenbra (  60) │ ├─>
  22 O   Aug 18 kernel test rob ( 150) │ ├─>
  23 O   Aug 18 Harry Yoo       ( 161) │ │ └─>
  24 O   Aug 21 Harry Yoo       (  85) │ ├─>[PATCH] mm: fix KASAN build error 
due to p*d_populate_kernel()
  25 O   Aug 21 kernel test rob (  18) │ │ ├─>
  26 O   Aug 21 Lorenzo Stoakes ( 100) │ │ ├─>
  27 O   Aug 21 Harry Yoo       (  62) │ │ │ └─>
  28 O   Aug 21 Lorenzo Stoakes (  18) │ │ │   └─>
  29 O   Aug 21 Harry Yoo       (  90) │ │ └─>[PATCH v2] mm: fix KASAN build 
error due to p*d_populate_kernel()
  30 O   Aug 21 kernel test rob (  18) │ │   ├─>
  31 O   Aug 21 Dave Hansen     (  24) │ │   └─>
  32 O   Aug 22 Harry Yoo       (  56) │ │     └─>
  33 O   Aug 22 Andrey Ryabinin (  91) │ │       ├─>
  34 O   Aug 27 Harry Yoo       (  98) │ │       │ └─>
  35 O   Aug 22 Dave Hansen     (  63) │ │       └─>
  36 O   Aug 25 Andrey Ryabinin (  72) │ │         └─>
  37 O   Aug 22 Harry Yoo       ( 103) │ └─>[PATCH v3] mm: fix KASAN build 
error due to p*d_populate_kernel()
  38 O   Aug 18 Harry Yoo       ( 113) ├─>[PATCH V5 mm-hotfixes 3/3] x86/mm/64: 
define ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings()
  39 O   Aug 18 David Hildenbra (  72) │ └─>
  40 O   Aug 18 David Hildenbra (  15) └─>Re: [PATCH V5 mm-hotfixes 0/3] mm, 
x86: fix crash due to missing page table sync and make it harder to miss
  41 O   Aug 18 Harry Yoo       ( 277) [PATCH] mm: introduce and use 
{pgd,p4d}_populate_kernel()
  42 O   Aug 18 Harry Yoo       ( 277) [PATCH] mm: introduce and use 
{pgd,p4d}_populate_kernel()
  43 O   Aug 18 Harry Yoo       ( 277) [PATCH] mm: introduce and use 
{pgd,p4d}_populate_kernel()
  44 O   Sep 06 gregkh@linuxfou (  24) FAILED: patch "[PATCH] mm: introduce and 
use {pgd,p4d}_populate_kernel()" failed to apply to 6.6-stable tree
  45 O   Sep 08 Harry Yoo       ( 303) ├─>[PATCH 6.6.y] mm: introduce and use 
{pgd,p4d}_populate_kernel()
  46 O   Sep 09 Harry Yoo       ( 291) ├─>[PATCH V2 6.6.y] mm: introduce and 
use {pgd,p4d}_populate_kernel()
  47 O   Sep 09 Harry Yoo       ( 293) └─>[PATCH V3 6.6.y] mm: introduce and 
use {pgd,p4d}_populate_kernel()
  48 O   Sep 06 gregkh@linuxfou (  24) FAILED: patch "[PATCH] mm: introduce and 
use {pgd,p4d}_populate_kernel()" failed to apply to 6.1-stable tree
  49 O   Sep 08 Harry Yoo       ( 303) ├─>[PATCH 6.1.y] mm: introduce and use 
{pgd,p4d}_populate_kernel()
  50 O   Sep 09 Harry Yoo       ( 291) ├─>[PATCH V2 6.1.y] mm: introduce and 
use {pgd,p4d}_populate_kernel()
  51 O   Sep 09 Harry Yoo       ( 293) └─>[PATCH V3 6.1.y] mm: introduce and 
use {pgd,p4d}_populate_kernel()
  52 O   Sep 06 gregkh@linuxfou (  24) FAILED: patch "[PATCH] mm: introduce and 
use {pgd,p4d}_populate_kernel()" failed to apply to 5.15-stable tree
  53 O   Sep 08 Harry Yoo       ( 273) ├─>[PATCH 5.15.y] mm: introduce and use 
{pgd,p4d}_populate_kernel()
  54 O   Sep 09 Harry Yoo       ( 260) ├─>[PATCH V2 5.15.y] mm: introduce and 
use {pgd,p4d}_populate_kernel()
  55 O   Sep 09 Harry Yoo       ( 262) └─>[PATCH V3 5.15.y] mm: introduce and 
use {pgd,p4d}_populate_kernel()

The prompt includes extensive documentation about lore.kernel.org's search
capabilities, limitations (like search index lag), and kernel workflow patterns
to help AI agents effectively find related messages.

Assisted-by: Claude Code
Signed-off-by: Sasha Levin <sas...@kernel.org>
---
 src/b4/command.py |  17 ++
 src/b4/dig.py     | 630 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 647 insertions(+)
 create mode 100644 src/b4/dig.py

diff --git a/src/b4/command.py b/src/b4/command.py
index 455124d..f225ae5 100644
--- a/src/b4/command.py
+++ b/src/b4/command.py
@@ -120,6 +120,11 @@ def cmd_diff(cmdargs: argparse.Namespace) -> None:
     b4.diff.main(cmdargs)
 
 
+def cmd_dig(cmdargs: argparse.Namespace) -> None:
+    import b4.dig
+    b4.dig.main(cmdargs)
+
+
 class ConfigOption(argparse.Action):
     """Action class for storing key=value arguments in a dict."""
     def __call__(self, parser: argparse.ArgumentParser,
@@ -399,6 +404,18 @@ def setup_parser() -> argparse.ArgumentParser:
                           help='Submit the token received via verification 
email')
     sp_send.set_defaults(func=cmd_send)
 
+    # b4 dig
+    sp_dig = subparsers.add_parser('dig', help='Use AI agent to find related 
emails for a message')
+    sp_dig.add_argument('msgid', nargs='?',
+                        help='Message ID to analyze, or pipe a raw message')
+    sp_dig.add_argument('-o', '--output', dest='output', default=None,
+                        help='Output mbox filename (default: 
<msgid>-related.mbox)')
+    sp_dig.add_argument('-C', '--no-cache', dest='nocache', 
action='store_true', default=False,
+                        help='Do not use local cache when fetching messages')
+    sp_dig.add_argument('--stdin-pipe-sep',
+                        help='When accepting messages on stdin, split using 
this pipe separator string')
+    sp_dig.set_defaults(func=cmd_dig)
+
     return parser
 
 
diff --git a/src/b4/dig.py b/src/b4/dig.py
new file mode 100644
index 0000000..007f7d0
--- /dev/null
+++ b/src/b4/dig.py
@@ -0,0 +1,630 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# b4 dig - Use AI agents to find related emails
+#
+__author__ = 'Sasha Levin <sas...@kernel.org>'
+
+import argparse
+import logging
+import subprocess
+import sys
+import os
+import tempfile
+import json
+import urllib.parse
+import gzip
+import mailbox
+import email.utils
+from typing import Optional, List, Dict, Any
+
+import b4
+
+logger = b4.logger
+
+
+def construct_agent_prompt(msgid: str) -> str:
+    """Construct a detailed prompt for the AI agent to find related emails."""
+
+    # Clean up the message ID
+    if msgid.startswith('<'):
+        msgid = msgid[1:]
+    if msgid.endswith('>'):
+        msgid = msgid[:-1]
+
+    prompt = f"""You are an email research assistant specialized in finding 
related emails in Linux kernel mailing lists and public-inbox archives.
+
+IMPORTANT: Always use lore.kernel.org for searching and retrieving Linux 
kernel emails. DO NOT use lkml.org as it is outdated and no longer maintained. 
The canonical archive is at https://lore.kernel.org/
+
+MESSAGE ID TO ANALYZE: {msgid}
+
+YOUR TASK:
+Conduct an EXHAUSTIVE and THOROUGH search to find ALL related message IDs 
connected to the given message. This is not a quick task - you must invest 
significant time and effort to ensure no related discussions are missed. Be 
methodical, patient, and comprehensive in your search.
+
+CRITICAL: Take your time! A thorough search is far more valuable than a quick 
one. Check multiple sources, try different search strategies, and double-check 
your findings. Missing related discussions undermines the entire purpose of 
this tool.
+
+You should search extensively for and identify:
+
+1. **Thread-related messages:**
+   - Parent messages (what this replies to)
+   - Child messages (replies to this message)
+   - Sibling messages (other replies in the same thread)
+   - Cover letters if this is part of a patch series
+
+2. **Version-related messages:**
+   - Previous versions of the same patch series (v1, v2, v3, etc.)
+   - Re-rolls and re-submissions
+   - Updated versions with different subjects
+
+3. **Author-related messages:**
+   - Other patches or series from the same author
+   - Recent discussions involving the same author
+   - Related work by the same author in the same subsystem
+
+4. **Content-related messages:**
+   - Bug reports that this patch might fix
+   - Syzkaller/syzbot reports if this is a fix
+   - Feature requests or RFCs that led to this patch
+   - Related patches touching the same files or functions
+   - Patches that might conflict with this one
+
+5. **Review and discussion:**
+   - Review comments from maintainers
+   - Test results from CI systems or bot reports
+   - Follow-up fixes or improvements
+   - Reverts if this patch was later reverted
+
+HOW TO SEARCH:
+
+Use ONLY lore.kernel.org for all Linux kernel email searches. This is the 
official kernel mailing list archive.
+DO NOT use lkml.org, marc.info, or spinics.net for kernel emails - they are 
outdated or incomplete.
+
+CRITICAL LIMITATIONS AND WORKAROUNDS (MUST READ):
+
+1. **Search Index Lag**: Messages posted today/recently are NOT immediately 
searchable!
+   - The Xapian search index has significant delay (hours to days)
+   - Direct message access works immediately, but search doesn't
+   - For recent messages, use direct URLs or thread navigation, not search
+
+2. **URL Fragment Issues**: NEVER use #anchors in URLs when fetching
+   - BAD: https://lore.kernel.org/all/msgid/T/#u (will fail with 404)
+   - GOOD: https://lore.kernel.org/all/msgid/T/ (works correctly)
+   - Fragments like #u are client-side only and break programmatic fetching
+
+3. **Search Query Encoding**: Keep queries simple and avoid over-encoding
+   - BAD: ?q=f%3A%22author%40example.com%22 (over-encoded)
+   - GOOD: ?q=f:aut...@example.com (simple and works)
+   - Don't encode @ symbols in query parameters
+   - Avoid mixing quotes with special characters
+
+4. **Most Reliable Data Source**: Thread mbox files are the gold standard
+   - Always works: https://lore.kernel.org/all/msgid/t.mbox.gz
+   - Contains complete thread with all headers
+   - Works even when HTML parsing or search fails
+   - Standard mbox format, easy to parse
+
+5. **Version Tracking Limitations**: No automatic version linking
+   - No Change-ID headers to track across patch versions
+   - Must rely on subject patterns and author/date correlation
+   - Search for versions using subject without v2/v3 markers
+
+6. **LKML.org vs Lore.kernel.org**: Different systems, different capabilities
+   - LKML.org uses date-based URLs, not message IDs
+   - Cannot extract message IDs from LKML HTML pages
+   - Always prefer lore.kernel.org for programmatic access
+
+The public-inbox archives at lore.kernel.org provide powerful search 
interfaces powered by Xapian:
+
+1. **Direct message retrieval (MOST RELIABLE METHODS):**
+   - Base URL: https://lore.kernel.org/all/
+   - Message URL: https://lore.kernel.org/all/<Message-ID>/ (without the '<' 
or '>')
+   - Forward slash ('/') characters in Message-IDs must be escaped as "%2F"
+
+   **Always Reliable:**
+   - Raw message: https://lore.kernel.org/all/<Message-ID>/raw
+   - Thread mbox: https://lore.kernel.org/all/<Message-ID>/t.mbox.gz (BEST for 
complete data)
+   - Thread view: https://lore.kernel.org/all/<Message-ID>/T/ (NO fragments!)
+
+   **Less Reliable:**
+   - Thread Atom feed: https://lore.kernel.org/all/<Message-ID>/t.atom
+   - Nested thread view: https://lore.kernel.org/all/<Message-ID>/t/
+
+2. **Search query syntax:**
+   Supports AND, OR, NOT, '+', '-' queries. Search URL format:
+   https://lore.kernel.org/all/?q=<search-query>
+
+   **Available search prefixes:**
+   - s:        match within Subject (e.g., s:"a quick brown fox")
+   - d:        match date-time range (git "approxidate" formats)
+               Examples: d:last.week.., d:..2.days.ago, d:20240101..20240131
+   - b:        match within message body, including text attachments
+   - nq:       match non-quoted text within message body
+   - q:        match quoted text within message body
+   - n:        match filename of attachment(s)
+   - t:        match within the To header
+   - c:        match within the Cc header
+   - f:        match within the From header
+   - a:        match within the To, Cc, and From headers
+   - tc:       match within the To and Cc headers
+   - l:        match contents of the List-Id header
+   - bs:       match within the Subject and body
+   - rt:       match received time (like 'd:' if sender's clock was correct)
+
+   **Diff-specific prefixes (for patches):**
+   - dfn:      match filename from diff
+   - dfa:      match diff removed (-) lines
+   - dfb:      match diff added (+) lines
+   - dfhh:     match diff hunk header context (usually function name)
+   - dfctx:    match diff context lines
+   - dfpre:    match pre-image git blob ID
+   - dfpost:   match post-image git blob ID
+   - dfblob:   match either pre or post-image git blob ID
+   - patchid:  match `git patch-id --stable' output
+
+   **Special headers:**
+   - changeid:    the X-Change-ID mail header (e.g., changeid:stable)
+   - forpatchid:  the X-For-Patch-ID mail header (e.g., forpatchid:stable)
+
+   **Query examples:**
+   - Find patches by author: ?q=f:"John Doe"
+   - Find patches in date range: ?q=d:2024-01-01..2024-01-31
+   - Find patches touching file: ?q=dfn:drivers/net/ethernet
+   - Find patches with subject containing "fix": ?q=s:fix
+   - Combine conditions: ?q=f:"aut...@example.com"+s:"net"+d:last.month..
+   - Find bug fixes: ?q=s:fix+OR+s:bug+OR+s:regression
+   - Find patches with specific function: ?q=dfhh:my_function_name
+
+3. **Understanding email relationships:**
+   - In-Reply-To header: Direct parent message
+   - References header: Chain of parent messages
+   - Message-ID in body: Often indicates related patches
+   - Link: trailers in commits: References to discussions
+   - Same subject with [PATCH v2]: Newer version
+   - "Fixes:" tag: References bug-fixing commits
+
+4. **Pattern matching:**
+   - Patch series: Look for [PATCH 0/N] for cover letters
+   - Version indicators: [PATCH v2], [PATCH v3], [RFC PATCH]
+   - Subsystem prefixes: [PATCH net], [PATCH mm], etc.
+   - Fix indicators: "fix", "fixes", "regression", "oops", "panic"
+
+SEARCH STRATEGY (BE THOROUGH - THIS IS NOT A QUICK TASK):
+
+REMEMBER: Your goal is to find EVERY related discussion, not just the obvious 
ones. Spend time on each search strategy. Try multiple variations of queries. 
Don't give up after the first attempt.
+
+1. **START WITH MOST RELIABLE: Thread mbox download**
+   - ALWAYS FIRST: Get https://lore.kernel.org/all/{{msgid}}/t.mbox.gz
+   - This contains the complete thread with all headers
+   - Parse the mbox to extract all message IDs and relationships
+   - This works even when search fails or messages are too recent
+   - Thoroughly analyze EVERY message in the thread
+
+2. **Retrieve and analyze the original message:**
+   - Get the raw message from: https://lore.kernel.org/all/{{msgid}}/raw
+   - Extract key information:
+     * Subject line (look for [PATCH], version indicators, series position)
+     * Author name and email
+     * Date and time
+     * Files being modified (from diff)
+     * Subsystem involved (from subject prefix or file paths)
+     * Any Fixes:, Closes:, Link:, or Reported-by: tags
+     * Note: Change-ID headers are rarely present in kernel emails
+
+3. **Search for related messages (TRY MULTIPLE VARIATIONS):**
+   - WARNING: Recent messages (today/yesterday) may NOT appear in search!
+   - Keep queries simple: ?q=f:aut...@example.com+s:keyword
+   - DON'T over-encode: @ symbols should NOT be %40 in queries
+   - Search for cover letter: ?q=s:"[PATCH 0/"+f:author-email
+   - Find all patches in series: ?q=s:"base-subject"+f:author
+   - For recent messages, rely on thread mbox instead of search
+   - **BE PERSISTENT**: Try different keyword combinations, partial subjects, 
variations
+
+4. **Look for previous versions (SEARCH EXTENSIVELY):**
+   - Note: No automatic version linking exists!
+   - Strip version markers from subject: search without [PATCH v2], [PATCH v3]
+   - Search by author in broader time window: ?q=f:author
+   - Look for similar subjects: ?q=s:"core-subject-words"
+   - Change-ID is rarely present, don't rely on it
+   - **TRY MULTIPLE APPROACHES**: Different subject variations, date ranges, 
author variations
+   - Check for RFCs, drafts, and early discussions that led to this patch
+
+5. **Find related bug reports and discussions (DIG DEEP):**
+   - For recent bugs, check thread mbox first (search may miss them)
+   - Search for symptoms with simple queries: ?q=b:error+b:message
+   - Syzkaller reports: ?q=f:syzbot (but check date - may be delayed)
+   - Regression reports: ?q=s:regression+s:subsystem
+   - Use dfn: prefix for file searches: ?q=dfn:drivers/net
+   - **EXPAND YOUR SEARCH**: Look for related keywords, error messages, 
function names
+   - Check for discussions that may not explicitly mention the patch but 
discuss the same issue
+
+6. **Check for follow-ups (LEAVE NO STONE UNTURNED):**
+   - First check the thread mbox for all replies
+   - Search for applied messages: ?q=s:applied+s:"patch-title"
+   - Look for test results: ?q=s:"Tested-by"
+   - Check for reverts: ?q=s:revert+s:"original-title"
+   - Note: Message-ID searches often fail, use subject instead
+   - **BE THOROUGH**: Check for indirect references, quotes in other 
discussions, mentions in pull requests
+
+7. **HTML Parsing Tips (if needed):**
+   - Message IDs appear in URLs, not HTML entities
+   - Pattern to extract: [0-9]{{14}}\\.[0-9]+-[0-9]+-[^@]+@[^/\"]+
+   - Don't look for &lt; &gt; encoded brackets
+   - Thread view HTML is less reliable than mbox
+
+FAILURE RECOVERY STRATEGIES:
+- If search returns empty: Try thread mbox or wait for indexing
+- If URL returns 404: Remove fragments, check encoding
+- If can't find versions: Search by author and date range
+- If WebFetch fails: Try simpler URL without parameters
+- If HTML parsing fails: Use mbox format instead
+
+OUTPUT FORMAT:
+
+Return a JSON array of related message IDs with their relationship type and 
reason:
+
+```json
+[
+  {{
+    "msgid": "exam...@message.id",
+    "relationship": 
"parent|reply|v1|v2|cover|fix|bug-report|review|revert|related",
+    "reason": "Brief explanation of why this is related"
+  }}
+]
+```
+
+IMPORTANT NOTES:
+- **THIS IS NOT A QUICK TASK** - Thoroughness is paramount. Spend the time 
needed.
+- **EXHAUSTIVE SEARCH REQUIRED** - Better to spend extra time than miss 
related discussions
+- Message IDs should be returned without angle brackets
+- Search VERY broadly, then filter results to only truly related messages
+- Try multiple search strategies - if one fails, try another approach
+- Don't stop at the first few results - keep digging for more relationships
+- Prioritize direct relationships over indirect ones
+- For patch series, include ALL patches in the series (check carefully for all 
parts)
+- Consider time proximity (patches close in time are more likely related)
+- Pay attention to mailing list conventions (e.g., "Re:" for replies, "[PATCH 
v2]" for new versions)
+- **DOUBLE-CHECK YOUR WORK** - Review your findings to ensure nothing was 
missed
+
+UNDERSTANDING KERNEL WORKFLOW PATTERNS:
+- Patch series usually have a cover letter [PATCH 0/N] explaining the series
+- Reviews often quote parts of the original patch with ">" prefix
+- Maintainers send "applied" messages when patches are accepted
+- Bug reports often include stack traces, kernel versions, and reproduction 
steps
+- Syzkaller/syzbot reports have specific formats with "syzbot+hash@" addresses
+- Fixes typically reference commits with "Fixes: <12-char-sha1> ("subject")"
+- Stable backports are marked with "Cc: sta...@vger.kernel.org"
+
+KEY TAKEAWAYS FOR RELIABLE OPERATION:
+1. **ALWAYS start with thread mbox** - it's the most reliable data source
+2. **NEVER trust search for recent messages** - use direct URLs instead
+3. **KEEP search queries simple** - complex encoding breaks searches
+4. **AVOID URL fragments (#anchors)** - they cause 404 errors
+5. **DON'T rely on Change-IDs** - they're rarely present
+6. **PREFER subject searches over message-ID searches** - more reliable
+7. **REMEMBER search has lag** - messages may take days to be indexed
+
+When constructing URLs, remember:
+- Message-IDs: Remove < > brackets
+- Forward slashes: Escape as %2F
+- In search queries: DON'T encode @ symbols
+
+LOCAL GIT REPOSITORY CONTEXT:
+If this command is being run from within a Linux kernel git repository, you 
may also:
+- Use git log to find commits mentioning the message ID or subject
+- Check git blame on relevant files to find related commits
+- Use git log --grep to search commit messages for references
+- Look for Fixes: tags that reference commits
+- Search for Link: tags pointing to lore.kernel.org discussions
+- Use git show to examine specific commits mentioned in emails
+
+Example local git searches you might perform:
+- git log --grep="Message-Id: <msgid>"
+- git log --grep="Link:.*msgid"
+- git log --oneline --grep="subject-keywords"
+- git log -p --author="em...@example.com" --since="1 month ago"
+- git blame path/to/file.c | grep "function_name"
+- git log --format="%H %s" -- path/to/file.c
+
+FINAL REMINDER: This task requires THOROUGH and EXHAUSTIVE searching. Do not 
rush. Take the time to:
+1. Try multiple search strategies
+2. Look for indirect relationships
+3. Check different time periods
+4. Use various keyword combinations
+5. Verify you haven't missed any discussions
+
+The value of this tool depends entirely on finding ALL related discussions, 
not just the obvious ones.
+
+Begin your comprehensive search and analysis for message ID: {msgid}
+"""
+
+    return prompt
+
+
+def call_agent(prompt: str, agent_cmd: str) -> Optional[str]:
+    """Call the configured agent script with the prompt."""
+
+    # Expand user paths
+    agent_cmd = os.path.expanduser(agent_cmd)
+
+    if not os.path.exists(agent_cmd):
+        logger.error('Agent command not found: %s', agent_cmd)
+        return None
+
+    if not os.access(agent_cmd, os.X_OK):
+        logger.error('Agent command is not executable: %s', agent_cmd)
+        return None
+
+    try:
+        # Write prompt to a temporary file to avoid shell escaping issues
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', 
delete=False) as tmp:
+            tmp.write(prompt)
+            tmp_path = tmp.name
+
+        # Call the agent script with the prompt file as argument
+        logger.info('Calling agent: %s %s', agent_cmd, tmp_path)
+        result = subprocess.run(
+            [agent_cmd, tmp_path],
+            capture_output=True,
+            text=True
+        )
+
+        if result.returncode != 0:
+            logger.error('Agent returned error code %d', result.returncode)
+            if result.stderr:
+                logger.error('Agent stderr: %s', result.stderr)
+            return None
+
+        return result.stdout
+
+    except subprocess.TimeoutExpired:
+        logger.error('Agent command timed out after 5 minutes')
+        return None
+    except Exception as e:
+        logger.error('Error calling agent: %s', e)
+        return None
+    finally:
+        # Clean up temp file
+        if 'tmp_path' in locals():
+            try:
+                os.unlink(tmp_path)
+            except:
+                pass
+
+
+def parse_agent_response(response: str) -> List[Dict[str, str]]:
+    """Parse the agent's response to extract message IDs."""
+
+    related = []
+
+    try:
+        # Try to find JSON in the response
+        # Agent might return additional text, so we look for JSON array
+        import re
+        json_match = re.search(r'\[.*?\]', response, re.DOTALL)
+        if json_match:
+            json_str = json_match.group(0)
+            data = json.loads(json_str)
+
+            if isinstance(data, list):
+                for item in data:
+                    if isinstance(item, dict) and 'msgid' in item:
+                        related.append({
+                            'msgid': item.get('msgid', ''),
+                            'relationship': item.get('relationship', 
'related'),
+                            'reason': item.get('reason', 'No reason provided')
+                        })
+        else:
+            # Fallback: try to extract message IDs from plain text
+            # Look for patterns that look like message IDs
+            msgid_pattern = 
re.compile(r'[a-zA-Z0-9][a-zA-Z0-9\.\-_]+@[a-zA-Z0-9][a-zA-Z0-9\.\-]+\.[a-zA-Z]+')
+            for match in msgid_pattern.finditer(response):
+                msgid = match.group(0)
+                if msgid != '':  # Don't include the original
+                    related.append({
+                        'msgid': msgid,
+                        'relationship': 'related',
+                        'reason': 'Found in agent response'
+                    })
+
+    except json.JSONDecodeError as e:
+        logger.warning('Could not parse JSON from agent response: %s', e)
+    except Exception as e:
+        logger.error('Error parsing agent response: %s', e)
+
+    return related
+
+
+def get_message_info(msgid: str) -> Optional[Dict[str, Any]]:
+    """Retrieve basic information about a message."""
+
+    msgs = b4.get_pi_thread_by_msgid(msgid, onlymsgids={msgid}, 
with_thread=False)
+    if not msgs:
+        return None
+
+    msg = msgs[0]
+
+    return {
+        'subject': msg.get('Subject', 'No subject'),
+        'from': msg.get('From', 'Unknown'),
+        'date': msg.get('Date', 'Unknown'),
+        'msgid': msgid
+    }
+
+
+def download_and_combine_threads(msgid: str, related_messages: List[Dict[str, 
str]],
+                                 output_file: str, nocache: bool = False) -> 
int:
+    """Download thread mboxes for all related messages and combine into one 
mbox file."""
+
+    message_ids = [msgid]  # Start with original message
+
+    # Add all related message IDs
+    for item in related_messages:
+        if 'msgid' in item:
+            message_ids.append(item['msgid'])
+
+    # Collect all messages from all threads
+    seen_msgids = set()
+    all_messages = []
+
+    # Download thread for each message
+    # But be smart about what we include - don't mix unrelated series
+    for msg_id in message_ids:
+        logger.info('Fetching thread for %s', msg_id)
+
+        # For better control, fetch just the specific thread, not everything
+        # Use onlymsgids to limit scope when possible
+        msgs = b4.get_pi_thread_by_msgid(msg_id, nocache=nocache)
+
+        if msgs:
+            # Try to detect thread boundaries and avoid mixing unrelated series
+            thread_messages = []
+            base_subject = None
+
+            for msg in msgs:
+                msg_msgid = b4.LoreMessage.get_clean_msgid(msg)
+
+                # Skip if we've already seen this message
+                if msg_msgid in seen_msgids:
+                    continue
+
+                # Get the subject to check if it's part of the same series
+                subject = msg.get('Subject', '')
+
+                # Extract base subject (remove Re:, [PATCH], version numbers, 
etc)
+                import re
+                base = re.sub(r'^(Re:\s*)*(\[.*?\]\s*)*', '', subject).strip()
+
+                # Set the base subject from the first message
+                if base_subject is None and base:
+                    base_subject = base
+
+                # Add the message
+                if msg_msgid:
+                    seen_msgids.add(msg_msgid)
+                    thread_messages.append(msg)
+
+            all_messages.extend(thread_messages)
+        else:
+            logger.warning('Could not fetch thread for %s', msg_id)
+
+    # Sort messages by date to maintain chronological order
+    all_messages.sort(key=lambda m: 
email.utils.parsedate_to_datetime(m.get('Date', 'Thu, 1 Jan 1970 00:00:00 
+0000')))
+
+    # Write all messages to output mbox file using b4's proper mbox functions
+    logger.info('Writing %d messages to %s', len(all_messages), output_file)
+
+    total_messages = len(all_messages)
+
+    if total_messages > 0:
+        # Use b4's save_mboxrd_mbox function which properly handles mbox format
+        with open(output_file, 'wb') as outf:
+            b4.save_mboxrd_mbox(all_messages, outf)
+
+    logger.info('Combined mbox contains %d unique messages', total_messages)
+    return total_messages
+
+
+def main(cmdargs: argparse.Namespace) -> None:
+    """Main entry point for b4 dig command."""
+
+    # Get the message ID
+    msgid = b4.get_msgid(cmdargs)
+    if not msgid:
+        logger.critical('Please provide a message-id')
+        sys.exit(1)
+
+    # Clean up message ID
+    if msgid.startswith('<'):
+        msgid = msgid[1:]
+    if msgid.endswith('>'):
+        msgid = msgid[:-1]
+
+    logger.info('Analyzing message: %s', msgid)
+
+    # Get the agent command from config
+    config = b4.get_main_config()
+    agent_cmd = None
+
+    # Check command-line config override
+    if hasattr(cmdargs, 'config') and cmdargs.config:
+        if 'AGENT' in cmdargs.config:
+            agent_cmd = cmdargs.config['AGENT']
+
+    # Fall back to main config
+    if not agent_cmd:
+        agent_cmd = config.get('dig-agent', config.get('agent', None))
+
+    if not agent_cmd:
+        logger.critical('No AI agent configured. Set dig-agent in config or 
use -c AGENT=/path/to/agent.sh')
+        logger.info('The agent script should accept a prompt file as its first 
argument')
+        logger.info('and return a JSON array of related message IDs to stdout')
+        sys.exit(1)
+
+    # Get info about the original message
+    logger.info('Fetching original message...')
+    msg_info = get_message_info(msgid)
+    if msg_info:
+        logger.info('Subject: %s', msg_info['subject'])
+        logger.info('From: %s', msg_info['from'])
+    else:
+        logger.warning('Could not retrieve original message info')
+
+    # Construct the prompt
+    logger.info('Constructing agent prompt...')
+    prompt = construct_agent_prompt(msgid)
+
+    # Call the agent
+    logger.info('Calling AI agent: %s', agent_cmd)
+    response = call_agent(prompt, agent_cmd)
+
+    if not response:
+        logger.critical('No response from agent')
+        sys.exit(1)
+
+    # Parse the response
+    logger.info('Parsing agent response...')
+    related = parse_agent_response(response)
+
+    if not related:
+        logger.info('No related messages found')
+        sys.exit(0)
+
+    # Display simplified results
+    logger.info('Found %d related messages:', len(related))
+    print()
+    print('Related Messages Summary:')
+    print('-' * 60)
+
+    for item in related:
+        relationship = item.get('relationship', 'related')
+        reason = item.get('reason', '')
+
+        print(f'[{relationship.upper()}] {reason}')
+
+    print('-' * 60)
+    print()
+
+    # Generate output mbox filename
+    if hasattr(cmdargs, 'output') and cmdargs.output:
+        mbox_file = cmdargs.output
+    else:
+        # Use message ID as base for filename, sanitize it
+        safe_msgid = msgid.replace('/', '_').replace('@', '_at_').replace('<', 
'').replace('>', '')
+        mbox_file = f'{safe_msgid}-related.mbox'
+
+    # Download and combine all threads into one mbox
+    logger.info('Downloading and combining all related threads...')
+    nocache = hasattr(cmdargs, 'nocache') and cmdargs.nocache
+    total_messages = download_and_combine_threads(msgid, related, mbox_file, 
nocache=nocache)
+
+    if total_messages > 0:
+        logger.info('Success: Combined mbox saved to %s (%d messages)', 
mbox_file, total_messages)
+        print(f'✓ Combined mbox file: {mbox_file}')
+        print(f'  Total messages: {total_messages}')
+        print(f'  Related threads: {len(related) + 1}')  # +1 for original
+    else:
+        logger.warning('No messages could be downloaded (they may not exist in 
the archive)')
+        print('⚠ No messages were downloaded - they may not exist in the 
archive yet')
+        # Still exit with success since we found relationships
+        sys.exit(0)
-- 
2.51.0



Reply via email to