[llvm-branch-commits] [llvm] [IR2Vec][llvm-ir2vec] Revamp triplet generation and add entity mapping mode (PR #149214)

2025-07-17 Thread S. VenkataKeerthy via llvm-branch-commits

https://github.com/svkeerthy updated 
https://github.com/llvm/llvm-project/pull/149214

>From 5d93b96d4bb6e6849b3ba293dce90b98b8bed468 Mon Sep 17 00:00:00 2001
From: svkeerthy 
Date: Wed, 16 Jul 2025 22:03:56 +
Subject: [PATCH 1/2] revamp-triplet-gen

---
 llvm/docs/CommandGuide/llvm-ir2vec.rst|  79 -
 llvm/test/tools/llvm-ir2vec/entities.ll   |  95 ++
 llvm/test/tools/llvm-ir2vec/triplets.ll   |  51 ++-
 llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp| 204 
 .../mlgo-utils/IR2Vec/generateTriplets.py | 291 ++
 5 files changed, 627 insertions(+), 93 deletions(-)
 create mode 100644 llvm/test/tools/llvm-ir2vec/entities.ll
 create mode 100644 llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py

diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst 
b/llvm/docs/CommandGuide/llvm-ir2vec.rst
index 13fe4996b968f..56ece4f509f6e 100644
--- a/llvm/docs/CommandGuide/llvm-ir2vec.rst
+++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst
@@ -13,17 +13,21 @@ DESCRIPTION
 
 :program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
 generates IR2Vec embeddings for LLVM IR and supports triplet generation 
-for vocabulary training. It provides two main operation modes:
+for vocabulary training. It provides three main operation modes:
 
-1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary
+1. **Triplet Mode**: Generates numeric triplets in train2id format for 
vocabulary
training from LLVM IR.
 
-2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
+2. **Entity Mode**: Generates entity mapping files (entity2id.txt) for 
vocabulary 
+   training.
+
+3. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
at different granularity levels (instruction, basic block, or function).
 
 The tool is designed to facilitate machine learning applications that work with
 LLVM IR by converting the IR into numerical representations that can be used by
-ML models.
+ML models. The triplet mode generates numeric IDs directly instead of string 
+triplets, streamlining the training data preparation workflow.
 
 .. note::
 
@@ -34,18 +38,46 @@ ML models.
 OPERATION MODES
 ---
 
+Triplet Generation and Entity Mapping Modes are used for preparing
+vocabulary and training data for knowledge graph embeddings. The Embedding Mode
+is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
+
+The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR
+by modeling the relationships between opcodes, types, and operands as a 
knowledge
+graph. For this purpose, Triplet Generation and Entity Mapping Modes generate
+triplets and entity mappings in the standard format used for knowledge graph
+embedding training (see 
+
 
+for details).
+
 Triplet Generation Mode
 ~~~
 
-In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets
-consisting of opcodes, types, and operands. These triplets can be used to train
-vocabularies for embedding generation.
+In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts numeric
+triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets 
+are generated in train2id format. The tool outputs numeric IDs directly using 
+the ir2vec::Vocabulary mapping infrastructure, eliminating the need for 
+string-to-ID preprocessing.
+
+Usage:
+
+.. code-block:: bash
+
+   llvm-ir2vec --mode=triplets input.bc -o triplets_train2id.txt
+
+Entity Mapping Generation Mode
+~~~
+
+In entity mode, :program:`llvm-ir2vec` generates the entity mappings supported 
by
+IR2Vec in entity2id format. This mode outputs all supported entities (opcodes, 
+types, and operands) with their corresponding numeric IDs, and is not specific 
for 
+an LLVM IR file.
 
 Usage:
 
 .. code-block:: bash
 
-   llvm-ir2vec --mode=triplets input.bc -o triplets.txt
+   llvm-ir2vec --mode=entities -o entity2id.txt
 
 Embedding Generation Mode
 ~~
@@ -67,6 +99,7 @@ OPTIONS
  Specify the operation mode. Valid values are:
 
  * ``triplets`` - Generate triplets for vocabulary training
+ * ``entities`` - Generate entity mappings for vocabulary training
  * ``embeddings`` - Generate embeddings using trained vocabulary (default)
 
 .. option:: --level=
@@ -115,7 +148,7 @@ OPTIONS
 
``--level``, ``--function``, ``--ir2vec-vocab-path``, 
``--ir2vec-opc-weight``, 
``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in 
embedding 
-   mode. These options are ignored in triplet mode.
+   mode. These options are ignored in triplet and entity modes.
 
 INPUT FILE FORMAT
 -
@@ -129,14 +162,34 @@ OUTPUT FORMAT
 Triplet Mode Output
 ~~~
 
-In triplet mode, the output consists of lines containing space-separated 
triplets:
+In triplet mode, the o

[llvm-branch-commits] [llvm] [IR2Vec][llvm-ir2vec] Revamp triplet generation and add entity mapping mode (PR #149214)

2025-07-17 Thread S. VenkataKeerthy via llvm-branch-commits

https://github.com/svkeerthy updated 
https://github.com/llvm/llvm-project/pull/149214

>From 1212c724f1e93daefada8ce591aba0b8390ea6d1 Mon Sep 17 00:00:00 2001
From: svkeerthy 
Date: Wed, 16 Jul 2025 22:03:56 +
Subject: [PATCH 1/2] revamp-triplet-gen

---
 llvm/docs/CommandGuide/llvm-ir2vec.rst|  79 -
 llvm/test/tools/llvm-ir2vec/entities.ll   |  95 ++
 llvm/test/tools/llvm-ir2vec/triplets.ll   |  51 ++-
 llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp| 204 
 .../mlgo-utils/IR2Vec/generateTriplets.py | 291 ++
 5 files changed, 627 insertions(+), 93 deletions(-)
 create mode 100644 llvm/test/tools/llvm-ir2vec/entities.ll
 create mode 100644 llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py

diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst 
b/llvm/docs/CommandGuide/llvm-ir2vec.rst
index 13fe4996b968f..56ece4f509f6e 100644
--- a/llvm/docs/CommandGuide/llvm-ir2vec.rst
+++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst
@@ -13,17 +13,21 @@ DESCRIPTION
 
 :program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
 generates IR2Vec embeddings for LLVM IR and supports triplet generation 
-for vocabulary training. It provides two main operation modes:
+for vocabulary training. It provides three main operation modes:
 
-1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary
+1. **Triplet Mode**: Generates numeric triplets in train2id format for 
vocabulary
training from LLVM IR.
 
-2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
+2. **Entity Mode**: Generates entity mapping files (entity2id.txt) for 
vocabulary 
+   training.
+
+3. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
at different granularity levels (instruction, basic block, or function).
 
 The tool is designed to facilitate machine learning applications that work with
 LLVM IR by converting the IR into numerical representations that can be used by
-ML models.
+ML models. The triplet mode generates numeric IDs directly instead of string 
+triplets, streamlining the training data preparation workflow.
 
 .. note::
 
@@ -34,18 +38,46 @@ ML models.
 OPERATION MODES
 ---
 
+Triplet Generation and Entity Mapping Modes are used for preparing
+vocabulary and training data for knowledge graph embeddings. The Embedding Mode
+is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
+
+The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR
+by modeling the relationships between opcodes, types, and operands as a 
knowledge
+graph. For this purpose, Triplet Generation and Entity Mapping Modes generate
+triplets and entity mappings in the standard format used for knowledge graph
+embedding training (see 
+
 
+for details).
+
 Triplet Generation Mode
 ~~~
 
-In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets
-consisting of opcodes, types, and operands. These triplets can be used to train
-vocabularies for embedding generation.
+In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts numeric
+triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets 
+are generated in train2id format. The tool outputs numeric IDs directly using 
+the ir2vec::Vocabulary mapping infrastructure, eliminating the need for 
+string-to-ID preprocessing.
+
+Usage:
+
+.. code-block:: bash
+
+   llvm-ir2vec --mode=triplets input.bc -o triplets_train2id.txt
+
+Entity Mapping Generation Mode
+~~~
+
+In entity mode, :program:`llvm-ir2vec` generates the entity mappings supported 
by
+IR2Vec in entity2id format. This mode outputs all supported entities (opcodes, 
+types, and operands) with their corresponding numeric IDs, and is not specific 
for 
+an LLVM IR file.
 
 Usage:
 
 .. code-block:: bash
 
-   llvm-ir2vec --mode=triplets input.bc -o triplets.txt
+   llvm-ir2vec --mode=entities -o entity2id.txt
 
 Embedding Generation Mode
 ~~
@@ -67,6 +99,7 @@ OPTIONS
  Specify the operation mode. Valid values are:
 
  * ``triplets`` - Generate triplets for vocabulary training
+ * ``entities`` - Generate entity mappings for vocabulary training
  * ``embeddings`` - Generate embeddings using trained vocabulary (default)
 
 .. option:: --level=
@@ -115,7 +148,7 @@ OPTIONS
 
``--level``, ``--function``, ``--ir2vec-vocab-path``, 
``--ir2vec-opc-weight``, 
``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in 
embedding 
-   mode. These options are ignored in triplet mode.
+   mode. These options are ignored in triplet and entity modes.
 
 INPUT FILE FORMAT
 -
@@ -129,14 +162,34 @@ OUTPUT FORMAT
 Triplet Mode Output
 ~~~
 
-In triplet mode, the output consists of lines containing space-separated 
triplets:
+In triplet mode, the o

[llvm-branch-commits] [llvm] [IR2Vec][llvm-ir2vec] Revamp triplet generation and add entity mapping mode (PR #149214)

2025-07-17 Thread S. VenkataKeerthy via llvm-branch-commits

https://github.com/svkeerthy updated 
https://github.com/llvm/llvm-project/pull/149214

>From 1212c724f1e93daefada8ce591aba0b8390ea6d1 Mon Sep 17 00:00:00 2001
From: svkeerthy 
Date: Wed, 16 Jul 2025 22:03:56 +
Subject: [PATCH 1/2] revamp-triplet-gen

---
 llvm/docs/CommandGuide/llvm-ir2vec.rst|  79 -
 llvm/test/tools/llvm-ir2vec/entities.ll   |  95 ++
 llvm/test/tools/llvm-ir2vec/triplets.ll   |  51 ++-
 llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp| 204 
 .../mlgo-utils/IR2Vec/generateTriplets.py | 291 ++
 5 files changed, 627 insertions(+), 93 deletions(-)
 create mode 100644 llvm/test/tools/llvm-ir2vec/entities.ll
 create mode 100644 llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py

diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst 
b/llvm/docs/CommandGuide/llvm-ir2vec.rst
index 13fe4996b968f..56ece4f509f6e 100644
--- a/llvm/docs/CommandGuide/llvm-ir2vec.rst
+++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst
@@ -13,17 +13,21 @@ DESCRIPTION
 
 :program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
 generates IR2Vec embeddings for LLVM IR and supports triplet generation 
-for vocabulary training. It provides two main operation modes:
+for vocabulary training. It provides three main operation modes:
 
-1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary
+1. **Triplet Mode**: Generates numeric triplets in train2id format for 
vocabulary
training from LLVM IR.
 
-2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
+2. **Entity Mode**: Generates entity mapping files (entity2id.txt) for 
vocabulary 
+   training.
+
+3. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
at different granularity levels (instruction, basic block, or function).
 
 The tool is designed to facilitate machine learning applications that work with
 LLVM IR by converting the IR into numerical representations that can be used by
-ML models.
+ML models. The triplet mode generates numeric IDs directly instead of string 
+triplets, streamlining the training data preparation workflow.
 
 .. note::
 
@@ -34,18 +38,46 @@ ML models.
 OPERATION MODES
 ---
 
+Triplet Generation and Entity Mapping Modes are used for preparing
+vocabulary and training data for knowledge graph embeddings. The Embedding Mode
+is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
+
+The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR
+by modeling the relationships between opcodes, types, and operands as a 
knowledge
+graph. For this purpose, Triplet Generation and Entity Mapping Modes generate
+triplets and entity mappings in the standard format used for knowledge graph
+embedding training (see 
+
 
+for details).
+
 Triplet Generation Mode
 ~~~
 
-In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets
-consisting of opcodes, types, and operands. These triplets can be used to train
-vocabularies for embedding generation.
+In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts numeric
+triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets 
+are generated in train2id format. The tool outputs numeric IDs directly using 
+the ir2vec::Vocabulary mapping infrastructure, eliminating the need for 
+string-to-ID preprocessing.
+
+Usage:
+
+.. code-block:: bash
+
+   llvm-ir2vec --mode=triplets input.bc -o triplets_train2id.txt
+
+Entity Mapping Generation Mode
+~~~
+
+In entity mode, :program:`llvm-ir2vec` generates the entity mappings supported 
by
+IR2Vec in entity2id format. This mode outputs all supported entities (opcodes, 
+types, and operands) with their corresponding numeric IDs, and is not specific 
for 
+an LLVM IR file.
 
 Usage:
 
 .. code-block:: bash
 
-   llvm-ir2vec --mode=triplets input.bc -o triplets.txt
+   llvm-ir2vec --mode=entities -o entity2id.txt
 
 Embedding Generation Mode
 ~~
@@ -67,6 +99,7 @@ OPTIONS
  Specify the operation mode. Valid values are:
 
  * ``triplets`` - Generate triplets for vocabulary training
+ * ``entities`` - Generate entity mappings for vocabulary training
  * ``embeddings`` - Generate embeddings using trained vocabulary (default)
 
 .. option:: --level=
@@ -115,7 +148,7 @@ OPTIONS
 
``--level``, ``--function``, ``--ir2vec-vocab-path``, 
``--ir2vec-opc-weight``, 
``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in 
embedding 
-   mode. These options are ignored in triplet mode.
+   mode. These options are ignored in triplet and entity modes.
 
 INPUT FILE FORMAT
 -
@@ -129,14 +162,34 @@ OUTPUT FORMAT
 Triplet Mode Output
 ~~~
 
-In triplet mode, the output consists of lines containing space-separated 
triplets:
+In triplet mode, the o

[llvm-branch-commits] [llvm] [IR2Vec][llvm-ir2vec] Revamp triplet generation and add entity mapping mode (PR #149214)

2025-07-17 Thread S. VenkataKeerthy via llvm-branch-commits

https://github.com/svkeerthy updated 
https://github.com/llvm/llvm-project/pull/149214

>From 83bba52eba431f776cdb1e051bad073b19aa9763 Mon Sep 17 00:00:00 2001
From: svkeerthy 
Date: Wed, 16 Jul 2025 22:03:56 +
Subject: [PATCH 1/2] revamp-triplet-gen

---
 llvm/docs/CommandGuide/llvm-ir2vec.rst|  79 -
 llvm/test/tools/llvm-ir2vec/entities.ll   |  95 ++
 llvm/test/tools/llvm-ir2vec/triplets.ll   |  51 ++-
 llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp| 204 
 .../mlgo-utils/IR2Vec/generateTriplets.py | 291 ++
 5 files changed, 627 insertions(+), 93 deletions(-)
 create mode 100644 llvm/test/tools/llvm-ir2vec/entities.ll
 create mode 100644 llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py

diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst 
b/llvm/docs/CommandGuide/llvm-ir2vec.rst
index 13fe4996b968f..56ece4f509f6e 100644
--- a/llvm/docs/CommandGuide/llvm-ir2vec.rst
+++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst
@@ -13,17 +13,21 @@ DESCRIPTION
 
 :program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
 generates IR2Vec embeddings for LLVM IR and supports triplet generation 
-for vocabulary training. It provides two main operation modes:
+for vocabulary training. It provides three main operation modes:
 
-1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary
+1. **Triplet Mode**: Generates numeric triplets in train2id format for 
vocabulary
training from LLVM IR.
 
-2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
+2. **Entity Mode**: Generates entity mapping files (entity2id.txt) for 
vocabulary 
+   training.
+
+3. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
at different granularity levels (instruction, basic block, or function).
 
 The tool is designed to facilitate machine learning applications that work with
 LLVM IR by converting the IR into numerical representations that can be used by
-ML models.
+ML models. The triplet mode generates numeric IDs directly instead of string 
+triplets, streamlining the training data preparation workflow.
 
 .. note::
 
@@ -34,18 +38,46 @@ ML models.
 OPERATION MODES
 ---
 
+Triplet Generation and Entity Mapping Modes are used for preparing
+vocabulary and training data for knowledge graph embeddings. The Embedding Mode
+is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
+
+The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR
+by modeling the relationships between opcodes, types, and operands as a 
knowledge
+graph. For this purpose, Triplet Generation and Entity Mapping Modes generate
+triplets and entity mappings in the standard format used for knowledge graph
+embedding training (see 
+
 
+for details).
+
 Triplet Generation Mode
 ~~~
 
-In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets
-consisting of opcodes, types, and operands. These triplets can be used to train
-vocabularies for embedding generation.
+In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts numeric
+triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets 
+are generated in train2id format. The tool outputs numeric IDs directly using 
+the ir2vec::Vocabulary mapping infrastructure, eliminating the need for 
+string-to-ID preprocessing.
+
+Usage:
+
+.. code-block:: bash
+
+   llvm-ir2vec --mode=triplets input.bc -o triplets_train2id.txt
+
+Entity Mapping Generation Mode
+~~~
+
+In entity mode, :program:`llvm-ir2vec` generates the entity mappings supported 
by
+IR2Vec in entity2id format. This mode outputs all supported entities (opcodes, 
+types, and operands) with their corresponding numeric IDs, and is not specific 
for 
+an LLVM IR file.
 
 Usage:
 
 .. code-block:: bash
 
-   llvm-ir2vec --mode=triplets input.bc -o triplets.txt
+   llvm-ir2vec --mode=entities -o entity2id.txt
 
 Embedding Generation Mode
 ~~
@@ -67,6 +99,7 @@ OPTIONS
  Specify the operation mode. Valid values are:
 
  * ``triplets`` - Generate triplets for vocabulary training
+ * ``entities`` - Generate entity mappings for vocabulary training
  * ``embeddings`` - Generate embeddings using trained vocabulary (default)
 
 .. option:: --level=
@@ -115,7 +148,7 @@ OPTIONS
 
``--level``, ``--function``, ``--ir2vec-vocab-path``, 
``--ir2vec-opc-weight``, 
``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in 
embedding 
-   mode. These options are ignored in triplet mode.
+   mode. These options are ignored in triplet and entity modes.
 
 INPUT FILE FORMAT
 -
@@ -129,14 +162,34 @@ OUTPUT FORMAT
 Triplet Mode Output
 ~~~
 
-In triplet mode, the output consists of lines containing space-separated 
triplets:
+In triplet mode, the o

[llvm-branch-commits] [llvm] [IR2Vec][llvm-ir2vec] Revamp triplet generation and add entity mapping mode (PR #149214)

2025-07-17 Thread S. VenkataKeerthy via llvm-branch-commits

https://github.com/svkeerthy updated 
https://github.com/llvm/llvm-project/pull/149214

>From db6db83e5ee2ce1503bd041cbb975b36c0fc59c9 Mon Sep 17 00:00:00 2001
From: svkeerthy 
Date: Wed, 16 Jul 2025 22:03:56 +
Subject: [PATCH 1/2] revamp-triplet-gen

---
 llvm/docs/CommandGuide/llvm-ir2vec.rst|  79 -
 llvm/test/tools/llvm-ir2vec/entities.ll   |  95 ++
 llvm/test/tools/llvm-ir2vec/triplets.ll   |  51 ++-
 llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp| 204 
 .../mlgo-utils/IR2Vec/generateTriplets.py | 291 ++
 5 files changed, 627 insertions(+), 93 deletions(-)
 create mode 100644 llvm/test/tools/llvm-ir2vec/entities.ll
 create mode 100644 llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py

diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst 
b/llvm/docs/CommandGuide/llvm-ir2vec.rst
index 13fe4996b968f..56ece4f509f6e 100644
--- a/llvm/docs/CommandGuide/llvm-ir2vec.rst
+++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst
@@ -13,17 +13,21 @@ DESCRIPTION
 
 :program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
 generates IR2Vec embeddings for LLVM IR and supports triplet generation 
-for vocabulary training. It provides two main operation modes:
+for vocabulary training. It provides three main operation modes:
 
-1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary
+1. **Triplet Mode**: Generates numeric triplets in train2id format for 
vocabulary
training from LLVM IR.
 
-2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
+2. **Entity Mode**: Generates entity mapping files (entity2id.txt) for 
vocabulary 
+   training.
+
+3. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
at different granularity levels (instruction, basic block, or function).
 
 The tool is designed to facilitate machine learning applications that work with
 LLVM IR by converting the IR into numerical representations that can be used by
-ML models.
+ML models. The triplet mode generates numeric IDs directly instead of string 
+triplets, streamlining the training data preparation workflow.
 
 .. note::
 
@@ -34,18 +38,46 @@ ML models.
 OPERATION MODES
 ---
 
+Triplet Generation and Entity Mapping Modes are used for preparing
+vocabulary and training data for knowledge graph embeddings. The Embedding Mode
+is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
+
+The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR
+by modeling the relationships between opcodes, types, and operands as a 
knowledge
+graph. For this purpose, Triplet Generation and Entity Mapping Modes generate
+triplets and entity mappings in the standard format used for knowledge graph
+embedding training (see 
+
 
+for details).
+
 Triplet Generation Mode
 ~~~
 
-In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets
-consisting of opcodes, types, and operands. These triplets can be used to train
-vocabularies for embedding generation.
+In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts numeric
+triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets 
+are generated in train2id format. The tool outputs numeric IDs directly using 
+the ir2vec::Vocabulary mapping infrastructure, eliminating the need for 
+string-to-ID preprocessing.
+
+Usage:
+
+.. code-block:: bash
+
+   llvm-ir2vec --mode=triplets input.bc -o triplets_train2id.txt
+
+Entity Mapping Generation Mode
+~~~
+
+In entity mode, :program:`llvm-ir2vec` generates the entity mappings supported 
by
+IR2Vec in entity2id format. This mode outputs all supported entities (opcodes, 
+types, and operands) with their corresponding numeric IDs, and is not specific 
for 
+an LLVM IR file.
 
 Usage:
 
 .. code-block:: bash
 
-   llvm-ir2vec --mode=triplets input.bc -o triplets.txt
+   llvm-ir2vec --mode=entities -o entity2id.txt
 
 Embedding Generation Mode
 ~~
@@ -67,6 +99,7 @@ OPTIONS
  Specify the operation mode. Valid values are:
 
  * ``triplets`` - Generate triplets for vocabulary training
+ * ``entities`` - Generate entity mappings for vocabulary training
  * ``embeddings`` - Generate embeddings using trained vocabulary (default)
 
 .. option:: --level=
@@ -115,7 +148,7 @@ OPTIONS
 
``--level``, ``--function``, ``--ir2vec-vocab-path``, 
``--ir2vec-opc-weight``, 
``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in 
embedding 
-   mode. These options are ignored in triplet mode.
+   mode. These options are ignored in triplet and entity modes.
 
 INPUT FILE FORMAT
 -
@@ -129,14 +162,34 @@ OUTPUT FORMAT
 Triplet Mode Output
 ~~~
 
-In triplet mode, the output consists of lines containing space-separated 
triplets:
+In triplet mode, the o

[llvm-branch-commits] [llvm] [IR2Vec][llvm-ir2vec] Revamp triplet generation and add entity mapping mode (PR #149214)

2025-07-17 Thread S. VenkataKeerthy via llvm-branch-commits

https://github.com/svkeerthy updated 
https://github.com/llvm/llvm-project/pull/149214

>From db6db83e5ee2ce1503bd041cbb975b36c0fc59c9 Mon Sep 17 00:00:00 2001
From: svkeerthy 
Date: Wed, 16 Jul 2025 22:03:56 +
Subject: [PATCH 1/2] revamp-triplet-gen

---
 llvm/docs/CommandGuide/llvm-ir2vec.rst|  79 -
 llvm/test/tools/llvm-ir2vec/entities.ll   |  95 ++
 llvm/test/tools/llvm-ir2vec/triplets.ll   |  51 ++-
 llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp| 204 
 .../mlgo-utils/IR2Vec/generateTriplets.py | 291 ++
 5 files changed, 627 insertions(+), 93 deletions(-)
 create mode 100644 llvm/test/tools/llvm-ir2vec/entities.ll
 create mode 100644 llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py

diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst 
b/llvm/docs/CommandGuide/llvm-ir2vec.rst
index 13fe4996b968f..56ece4f509f6e 100644
--- a/llvm/docs/CommandGuide/llvm-ir2vec.rst
+++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst
@@ -13,17 +13,21 @@ DESCRIPTION
 
 :program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
 generates IR2Vec embeddings for LLVM IR and supports triplet generation 
-for vocabulary training. It provides two main operation modes:
+for vocabulary training. It provides three main operation modes:
 
-1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary
+1. **Triplet Mode**: Generates numeric triplets in train2id format for 
vocabulary
training from LLVM IR.
 
-2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
+2. **Entity Mode**: Generates entity mapping files (entity2id.txt) for 
vocabulary 
+   training.
+
+3. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
at different granularity levels (instruction, basic block, or function).
 
 The tool is designed to facilitate machine learning applications that work with
 LLVM IR by converting the IR into numerical representations that can be used by
-ML models.
+ML models. The triplet mode generates numeric IDs directly instead of string 
+triplets, streamlining the training data preparation workflow.
 
 .. note::
 
@@ -34,18 +38,46 @@ ML models.
 OPERATION MODES
 ---
 
+Triplet Generation and Entity Mapping Modes are used for preparing
+vocabulary and training data for knowledge graph embeddings. The Embedding Mode
+is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
+
+The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR
+by modeling the relationships between opcodes, types, and operands as a 
knowledge
+graph. For this purpose, Triplet Generation and Entity Mapping Modes generate
+triplets and entity mappings in the standard format used for knowledge graph
+embedding training (see 
+
 
+for details).
+
 Triplet Generation Mode
 ~~~
 
-In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets
-consisting of opcodes, types, and operands. These triplets can be used to train
-vocabularies for embedding generation.
+In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts numeric
+triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets 
+are generated in train2id format. The tool outputs numeric IDs directly using 
+the ir2vec::Vocabulary mapping infrastructure, eliminating the need for 
+string-to-ID preprocessing.
+
+Usage:
+
+.. code-block:: bash
+
+   llvm-ir2vec --mode=triplets input.bc -o triplets_train2id.txt
+
+Entity Mapping Generation Mode
+~~~
+
+In entity mode, :program:`llvm-ir2vec` generates the entity mappings supported 
by
+IR2Vec in entity2id format. This mode outputs all supported entities (opcodes, 
+types, and operands) with their corresponding numeric IDs, and is not specific 
for 
+an LLVM IR file.
 
 Usage:
 
 .. code-block:: bash
 
-   llvm-ir2vec --mode=triplets input.bc -o triplets.txt
+   llvm-ir2vec --mode=entities -o entity2id.txt
 
 Embedding Generation Mode
 ~~
@@ -67,6 +99,7 @@ OPTIONS
  Specify the operation mode. Valid values are:
 
  * ``triplets`` - Generate triplets for vocabulary training
+ * ``entities`` - Generate entity mappings for vocabulary training
  * ``embeddings`` - Generate embeddings using trained vocabulary (default)
 
 .. option:: --level=
@@ -115,7 +148,7 @@ OPTIONS
 
``--level``, ``--function``, ``--ir2vec-vocab-path``, 
``--ir2vec-opc-weight``, 
``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in 
embedding 
-   mode. These options are ignored in triplet mode.
+   mode. These options are ignored in triplet and entity modes.
 
 INPUT FILE FORMAT
 -
@@ -129,14 +162,34 @@ OUTPUT FORMAT
 Triplet Mode Output
 ~~~
 
-In triplet mode, the output consists of lines containing space-separated 
triplets:
+In triplet mode, the o