[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files

2023-03-27 Thread dmalcolm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

--- Comment #9 from David Malcolm  ---
(In reply to Hans-Peter Nilsson from comment #8)
> (In reply to David Malcolm from comment #7)
> > The invalid UTF-8 in the patch seems to have broken the server-side script:
> 
> Maybe the not-really-utf8 files need to be marked in some way in the git
> repo to be safely handled for future checkout and updates, including the
> problematic scripting?  However, reading gitattributes(5) it's not obvious
> how.

Perhaps
  https://www.git-scm.com/docs/gitattributes#_marking_files_as_binary 
though it's not clear if we can do that for individual files (or if it's worth
bothering)

[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files

2023-03-24 Thread hp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

--- Comment #8 from Hans-Peter Nilsson  ---
(In reply to David Malcolm from comment #7)
> The invalid UTF-8 in the patch seems to have broken the server-side script:

Maybe the not-really-utf8 files need to be marked in some way in the git repo
to be safely handled for future checkout and updates, including the problematic
scripting?  However, reading gitattributes(5) it's not obvious how.

[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files

2023-03-24 Thread dmalcolm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

David Malcolm  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from David Malcolm  ---
Should be fixed on trunk by r13-6861-gd495ea2b232f3e:

https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=d495ea2b232f3eb50155d7c7362c09a744766746

https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff_plain;h=d495ea2b232f3eb50155d7c7362c09a744766746

The invalid UTF-8 in the patch seems to have broken the server-side script:

Enumerating objects: 51, done.
Counting objects: 100% (51/51), done.
Delta compression using up to 64 threads
Compressing objects: 100% (29/29), done.
Writing objects: 100% (29/29), 7.74 KiB | 1.29 MiB/s, done.
Total 29 (delta 22), reused 0 (delta 0), pack-reused 0
remote: Traceback (most recent call last):
remote:   File "hooks/post_receive.py", line 118, in 
remote: post_receive(refs_data, args.submitter_email)
remote:   File "hooks/post_receive.py", line 65, in post_receive
remote: submitter_email)
remote:   File "hooks/post_receive.py", line 47, in post_receive_one
remote: update.send_email_notifications()
remote:   File
"/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 189,
in send_email_notifications
remote: self.__email_new_commits()
remote:   File
"/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line
1031, in __email_new_commits
remote: commit, self.get_standard_commit_email(commit))
remote:   File
"/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line
1011, in __send_commit_email
remote: default_diff=email.diff)
remote:   File
"/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 946,
in __maybe_get_email_custom_contents
remote: hook_input=json.dumps(hooks_data),
remote:   File "/usr/lib64/python2.7/json/__init__.py", line 244, in dumps
remote: return _default_encoder.encode(obj)
remote:   File "/usr/lib64/python2.7/json/encoder.py", line 207, in encode
remote: chunks = self.iterencode(o, _one_shot=True)
remote:   File "/usr/lib64/python2.7/json/encoder.py", line 270, in iterencode
remote: return _iterencode(o, 0)
remote: UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position
13147: invalid start byte
To git+ssh://gcc.gnu.org/git/gcc.git
   13ec81eb4c3..d495ea2b232  master -> master

[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files

2023-03-13 Thread joseph at codesourcery dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

--- Comment #6 from joseph at codesourcery dot com  ---
For diagnosis of non-UTF-8 in strings / comments, see commit 
0b8c57ed40f19086e30ce54faec3222ac21cc0df, "libcpp: Add -Winvalid-utf8 
warning [PR106655]" (implementing a new C++ requirement).

[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files

2023-03-10 Thread hp at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

Hans-Peter Nilsson  changed:

   What|Removed |Added

 CC||hp at gcc dot gnu.org

--- Comment #5 from Hans-Peter Nilsson  ---
While considering UTF-8 in SARIF files, please also have a look at PR105959.

[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files

2023-03-10 Thread dmalcolm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

--- Comment #4 from David Malcolm  ---
(In reply to Andrew Pinski from comment #2)
> So I think there is a bug in that code ...

The issue is in sarif_builder::maybe_make_artifact_content_object, which uses;

 char *text_utf8 = maybe_read_file (filename);

where there's no guarantee that "text_utf8" is (ahem) actually utf-8.  Sorry
about that.

Working on a fix to make it use the input.cc source-quoting machinery, which
should handle encoding.

[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files

2023-03-10 Thread dmalcolm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

--- Comment #3 from David Malcolm  ---
(In reply to Andrew Pinski from comment #1)
> I would have assumed you need -finput-charset= for the non-utf8 ones really
> if your LANG/LANGUAGE is not set to C/UTF8 really.

Yeah, but when complaining about encoding issues, the error message we emit
should at least be properly encoded :/

It's a major pain for my integration testing where two(?) bad bytes in one
source file lead to an unparseable .sarif file (out of thousands).

When quoting source in the .sarif output, we should ensure that the final JSON
output is all valid UTF-8, perhaps falling back to not quoting source for cases
where e.g.
- the source file isn't validly encoded, or
- the -finput-charset= is wrong, or   
- the -finput-charset= is missing or
- where the source file (erroneously) uses a mixture of different encodings in
different 
parts of itself

Probably should also check we do something sane for trojan source attacks

[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files

2023-03-10 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

--- Comment #2 from Andrew Pinski  ---
https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/Preprocessor-Options.html#index-finput-charset

Even has the following:

-finput-charset=charset
Set the input character set, used for translation from the character set of the
input file to the source character set used by GCC. If the locale does not
specify, or GCC cannot get this information from the locale, the default is
UTF-8. This can be overridden by either the locale or this command-line option.
Currently the command-line option takes precedence if there’s a conflict.
charset can be any encoding supported by the system’s iconv library routine.

So I think there is a bug in that code ...

[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files

2023-03-10 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

--- Comment #1 from Andrew Pinski  ---
I would have assumed you need -finput-charset= for the non-utf8 ones really if
your LANG/LANGUAGE is not set to C/UTF8 really.

[Bug analyzer/109098] Encoding errors on SARIF output for non-UTF-8 source files

2023-03-10 Thread dmalcolm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109098

David Malcolm  changed:

   What|Removed |Added

   Last reconfirmed||2023-03-11
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1