[bug #64253] Suggestion - Add support for libmagic and xattr

2023-06-05 Thread raf
Follow-up Comment #7, bug #64253 (project findutils):

Fair enough. I've added it to rawhide. I'm just accepting that errors become
the text being matched, and document the fact. But yeah, it is wierd.
Debugging search criteria could require using %w/%W to look for error
messages.

Maybe it was just designed for use by file(1) itself. But there are wrappers
for libmagic in python etc., so it might have many clients. Maybe it was a
deliberate choice so clients didn't need to worry about errors. The assumption
might be that the user always sees them. But it would be nice if it were also
possible for clients to be able to choose how they want to handle errors.


___

Reply to this item at:

  

___
Message sent via Savannah
https://savannah.gnu.org/




[bug #64253] Suggestion - Add support for libmagic and xattr

2023-06-03 Thread Bernhard Voelker
Update of bug #64253 (project findutils):

  Status: In Progress => None   

___

Follow-up Comment #6:

No, I don't plan to add a -printf format for mime/magic.
The output doesn't sort in well in the other output formats anyway,
because it's quite verbose.

If I would think about adding such a format, I'd go starting to make
use of the "%{...}" syntax which is currently reserved for future; hence
a "%{magic}" and "%{mime}" would fit - not sure about rawhide, though.

Re. magic/mime implementation:

First of all, it's the first time find looks at file content, and
that processing (open/read/lookup/close) is by magnitudes slower than
the other tests in find(1).

Having played with it for some time now, I have major qualms to
add libmagic support:

a) file(1) has some more options than the -i (for mime output) option.
Of course, they're all available via flags in libmagic.
But it would be strange to have to add further flags or knobs
in find(1) to support these options as well.
But people will require it - "just that one little thing".
That are discussions we have to avoid.

b) error handling:
While libmagic has the flag MAGIC_ERROR to indicate an error
via return value NULL instead of placing the error string
into the magic result buffer, that does not work for
all cases, e.g. the simple open/EPERM case: we still get the error
message "regular file; no read permission" as magic string
instead of NULL.
Likewise file(1):

$ file -E /etc/sudoers; echo $?
/etc/sudoers: regular file, no read permission
0

I looked into file/libmagic code, and found various such places.
Also the library function _magic_error_ does not indicate
an unreadable file as error.
We'd have to single out every such error by string matching,
which I'm not willing to do.  Proper error handling seems to
be tough with libmagic.  I'm not sure how and which other
projects are using libmagic, but the current state of error
handling doesn't work for how find(1) would need it.

c)
It's not really find's business to look at the content of files,
and there are already ways to do the filtering with file(1) as shown
below (*): searching for magic strings or mime types can already be
done "the UNIX way" (i.e., one tool for one purpose).

Even if one likes to continue after the "magic check" with post-processing
via the find(1) command again, it is safe with the -files0-from option
for any kind of exotic file names incl. control or newline chars:


$ find -type f -size -4c -mtime -10 -exec file -00 '{}' + \
| sed -nz 'h;n;/^C source/{g;p}' \
| find -files0-from - -printf "* %p\n  size: %s\n  inode: %i\n"
* ./find/defs.h
  size: 19707
  inode: 216534
* ./find/util.c
  size: 29571
  inode: 217027
* ./find/pred.c
  size: 37310
  inode: 152256


Obviously, that would be much easier if file(1) would provide options
to filter by certain magic/mime strings (as it does to exclude some tests).

I was quite enthusiastic about adding libmagic in the beginning,
but with the issues described above - above all the problematic error
handling -, I'm afraid I can't add libmagic support now.
I'm inclined to abandon or throw away my local work.

(*) The other day, maybe another one comes up with the idea that there's
a little library to get the content of cell A1 of a spreadsheet file,
or the title of a PDF file.  I don't believe it's a good idea to link
all those libraries, but instead encourage people to write tools which
fit well into UNIX pipes and transport remaining file names with
safe and Zero-terminated strings.



___

Reply to this item at:

  

___
Message sent via Savannah
https://savannah.gnu.org/




[bug #64253] Suggestion - Add support for libmagic and xattr

2023-06-02 Thread raf
Follow-up Comment #5, bug #64253 (project findutils):

Are there going to be new corresponding -printf % format conversions? If so,
what are the letters going to be (if they are single letters)? I'd like to use
the same notation in my rawhide program if possible.

The only good choices for me are:

%o %O
%q %Q
%w %W

They are the only pairs of letters I have left, and I think it makes sense to
try to use an uppercase and lowercase version of the same letter for the
(magic number) file type and mime type.

I have a preference for %w=magic and %W=mime, but only because "w" is the only
available letter where I can think of a good mnemonic (i.e. "what"). Although
"q" for "qualities" is OK too.


___

Reply to this item at:

  

___
Message sent via Savannah
https://savannah.gnu.org/




Re: [bug #64253] Suggestion - Add support for libmagic and xattr

2023-06-02 Thread Andreas Metzler
On 2023-06-02 raf  wrote:
> On Thu, Jun 01, 2023 at 07:14:55PM +0200, Andreas Metzler  
> wrote:
[...]
> > file reads from $(datadir)/misc and libmagic-mgc ships
> > /usr/lib/file/magic.mgc which is symlinked to /usr/share/misc/magic.mgc.
> > 
> > cu Andreas
> 
> Thanks. On Debian the symlink is in the other direction.
> 
>   > l /usr/share/misc/magic.mgc /usr/share/file/magic.mgc 
> /usr/lib/file/magic.mgc
>   /usr/lib/file/magic.mgc
>   /usr/share/file/magic.mgc -> ../../lib/file/magic.mgc
>   /usr/share/misc/magic.mgc -> ../../lib/file/magic.mgc

Eh, no. ;-) That is exactly what I described: "libmagic-mgc ships
/usr/lib/file/magic.mgc" (i.e. that is the file) and "it symlinked to" ...
(i.e. here is the symlink).

cu Andreas
-- 
`What a good friend you are to him, Dr. Maturin. His other friends are
so grateful to you.'
`I sew his ears on from time to time, sure'



Re: [bug #64253] Suggestion - Add support for libmagic and xattr

2023-06-01 Thread raf
On Thu, Jun 01, 2023 at 07:14:55PM +0200, Andreas Metzler  wrote:

> On 2023-06-01 raf  wrote:
> [...]
> > but both locations are empty. /usr/share/misc/magic is
> > a symlink to /usr/share/file/magic which is empty. I
> > wonder how it works.
> [...]
> 
> file reads from $(datadir)/misc and libmagic-mgc ships
> /usr/lib/file/magic.mgc which is symlinked to /usr/share/misc/magic.mgc.
> 
> cu Andreas

Thanks. On Debian the symlink is in the other direction.

  > l /usr/share/misc/magic.mgc /usr/share/file/magic.mgc 
/usr/lib/file/magic.mgc
  /usr/lib/file/magic.mgc
  /usr/share/file/magic.mgc -> ../../lib/file/magic.mgc
  /usr/share/misc/magic.mgc -> ../../lib/file/magic.mgc

cheers,
raf




Re: [bug #64253] Suggestion - Add support for libmagic and xattr

2023-06-01 Thread Andreas Metzler
On 2023-06-01 raf  wrote:
[...]
> but both locations are empty. /usr/share/misc/magic is
> a symlink to /usr/share/file/magic which is empty. I
> wonder how it works.
[...]

file reads from $(datadir)/misc and libmagic-mgc ships
/usr/lib/file/magic.mgc which is symlinked to /usr/share/misc/magic.mgc.

cu Andreas



Re: [bug #64253] Suggestion - Add support for libmagic and xattr

2023-05-31 Thread raf
On Wed, May 31, 2023 at 11:19:27PM +0200, Bernhard Voelker 
 wrote:

> On 5/27/23 01:39, raf wrote:
> > I could be wrong, but I don't think a -mime predicate adds much value. Since
> > mimetypes are determined by file name extension anyway, [...]
> 
> I don't think (proper) tools determine the mime type of a file by its 
> extension.
> Instead, they (should) always look at the content of the file.
> 
> Therefore, if we add -magic, then also adding -mime makes sense, because the
> latter yields output which seems to be standardized.
> 
> Have a nice day,
> Berny

If you say so. But my understanding is that mime types
are determined by looking up file name extensions in
the /etc/mime.types file, and using the corresponding
mime type. At least, that's very much how it looks on
Debian. That file must exist for a reason. But it does
look like libmagic can return mime types too, according
to libmagic(3). That's good.

But on my Debian vm, running strings on libmagic.a
and libmagic.so shows /etc/magic:/usr/share/misc/magic
which looks like the place to store mime type strings,
but both locations are empty. /usr/share/misc/magic is
a symlink to /usr/share/file/magic which is empty. I
wonder how it works.

According to magic(5), it looks like users can specify
mime information along with new magic data. Presumably,
the main magic database must have the mime information
inside it, and those other paths are just for
site-local additions.

Mind you, I'm having trouble finding the "main" magic
database. It must be around here somewhere...
Maybe it is magic. :-)

cheers,
raf




Re: [bug #64253] Suggestion - Add support for libmagic and xattr

2023-05-31 Thread Bernhard Voelker

On 5/27/23 01:39, raf wrote:

I could be wrong, but I don't think a -mime predicate adds much value. Since
mimetypes are determined by file name extension anyway, [...]


I don't think (proper) tools determine the mime type of a file by its extension.
Instead, they (should) always look at the content of the file.

Therefore, if we add -magic, then also adding -mime makes sense, because the
latter yields output which seems to be standardized.

Have a nice day,
Berny



Re: [bug #64253] Suggestion - Add support for libmagic and xattr

2023-05-31 Thread Bernhard Voelker

Without commenting here about -magic/-mime, i.e. just to discuss the given
statements on what is possible today.

On 5/25/23 21:18, anonymous wrote:

Currently - with find : We need xargs and sed and so have to worry about
whitespace paths and filenames, we are also spawning several sub-commands.


find -type f |
  xargs file |
   sed -n 's/:.*PE32 executable.*/p' |
xargs my_command


With find(1), one does not have to "worry about whitespace". There are several
safe ways to stay on the safe side:
- executing per file (which may be inefficient):
$ find ... -exec $TOOL '{}' ';'
- bulk execution:
$ find ... -exec $TOOL '{}' +
- if $TOOL understands Zero-separated input (e.g. like grep):
$ find ... -print0 | $TOOL -z
- else
$ find ... -print0 | xargs -r0 $TOOL

Re. file(1): unfortunately, this tool - although it has a --files-from option - 
does
not allow Zero-separated input.  For the search case, it would also come handy 
if
file(1) would have a --filter=PATTERN option, and furthermore allow to only 
print
the file name matching the pattern for safe post-processing in other tools.

Today, one could efficiently and safely use something like this to find files
where file(1) returns a magic string matching PATTERN :

  $ find ... -exec file -00 '{}' + \
  | sed -nz 'h;n; /PATTERN/{g;p}' \
  | xargs -0 my_command

Here's an example to filter on regular files smaller than 4 bytes, then 
letting
the "file ...|sed ..." pipe filter the wanted magic string "C source", and 
finally
continue the search in a subsequent find(1) command.

  $ find -type f -size -4c -mtime -1 -exec file -00 '{}' + \
  | sed -nz 'h;n;/^C source/{g;p}' \
  | find/find -files0-from - -ls

Obviously, the file(1) run is always by far the most expensive part, because it
has to read all the files, but at least it is only spawned as less as possible,
which hence saves the number of times the magic file has to be loaded.

Have a nice day,
Berny



[bug #64253] Suggestion - Add support for libmagic and xattr

2023-05-27 Thread Sam James
Follow-up Comment #4, bug #64253 (project findutils):

Speaking for a distro which tends to expose a lot of this choice to users
(Gentoo), it should be fine as long as there's explicit configure args for
each.

Lots of other packages do stuff like this too. It's only really a problem if
it's "automagic" (use-if-installed, no option to control it) or mandatory.
Truly optional stuff with --with-x or --enable-x is fine (or
--without/--disable, you get  the idea).

Also, thanks! This sounds great.


___

Reply to this item at:

  

___
Message sent via Savannah
https://savannah.gnu.org/




[bug #64253] Suggestion - Add support for libmagic and xattr

2023-05-27 Thread Bernhard Voelker
Update of bug #64253 (project findutils):

  Status:None => In Progress
 Assigned to:None => berny  

___

Follow-up Comment #3:

The request for xattr is duplicate to several others already.


___

Reply to this item at:

  

___
Message sent via Savannah
https://savannah.gnu.org/




[bug #64253] Suggestion - Add support for libmagic and xattr

2023-05-27 Thread Bernhard Voelker
Follow-up Comment #2, bug #64253 (project findutils):

I started working on -[i]magic/-[i]mime.

Obviously, this potentially pulls in - depending on the build configuration
of libmagic - additional dependencies to libzstd, liblzma, libbz2, and libz.
As find(1) is used in many bootstrapping scenarios, this might mean that
those environments may need/want to build two flavors of findutils:
a) one small one without libmagic for the bootstrapping part, and
b) a larger one with libmagic for the final system.
At least the downstream maintainer has a choice ...



___

Reply to this item at:

  

___
Message sent via Savannah
https://savannah.gnu.org/




[bug #64253] Suggestion - Add support for libmagic and xattr

2023-05-26 Thread raf
Follow-up Comment #1, bug #64253 (project findutils):

Hi, As far as I know, all the versions of find that have an -xattr predicate
only allow searching by the name of the extended attribute (or perhaps just
the fact of their existence?).

I wrote a find alternative called rawhide (rh) that supports searching by
extended attribute names and values (glob or regex). So if you can't wait for
it in find, you can use rawhide. If the find developers think this is a good
idea, they are welcome to plunder rawhide for its extended attribute code. It
supports extended attributes on Linux, FreeBSD, macOS, Solaris, and Cygwin.

It doesn't support libmagic but I might add it. I wonder how useful it is. If
it is useful, it would make queries much faster than a separate file(1)
process per candidate file (but that works too). I'd like to think of some
good examples first to motivate it.

I could be wrong, but I don't think a -mime predicate adds much value. Since
mimetypes are determined by file name extension anyway, the same queries can
be done with normal globbing, and I suspect the resulting find commands would
often be shorter that way.

P.S. I don't think the emergence of .zip domains will have the effect on
operating systems that you anticipate. The use of misleading double extensions
has been around for years (e.g., "somethinginteresting.jpg
   .exe"). This tld development doesn't seem very different.


___

Reply to this item at:

  

___
Message sent via Savannah
https://savannah.gnu.org/




[bug #64253] Suggestion - Add support for libmagic and xattr

2023-05-25 Thread anonymous
URL:
  

 Summary: Suggestion - Add support for libmagic and xattr
   Group: findutils
   Submitter: None
   Submitted: Thu 25 May 2023 07:18:47 PM UTC
Category: find
Severity: 3 - Normal
  Item Group: None
  Status: None
 Privacy: Public
 Assigned to: None
 Originator Name: Jay
Originator Email: the-m...@github.com
 Open/Closed: Open
 Release: None
 Discussion Lock: Any
   Fixed Release: None


___

Follow-up Comments:


---
Date: Thu 25 May 2023 07:18:47 PM UTC By: Anonymous
I've gone through past patches, bugs and suggestions and I was surprised I
could not find any mention of the obvious idea of adding support for libmagic
(magic | file | etc), so thought that it might be a useful feature to find,
here are some ideas.

Also while I think about this, and with the growing use of extended attributes
by applications, it may also make sense to think about including some sort of
xattr filter too. 
Most filesystems on which find is er..., found, have xattr capability and it
has been present in almost all contemporary operating systems kernels for the
best part of two decades.
[https://man7.org/linux/man-pages/man7/xattr.7.html man xattr]


Homepage here:
[https://www.darwinsys.com/file/ file | magic | libmagic]

Google have convinced ICANN that file extension like TLDs such as '.zip' are a
good idea. This sets a ball rolling, others will folow. This means OSs will
finally have to adapt and accept that deciding what a file is for, just based
on a parts of the given file name is naive and will have to make use of the
actual contents (as IMO as they should have a long time ago).  Very soon
extensions mean little other than a hint to the human.
[https://www.wired.com/story/google-zip-mov-domains-phishing-risks/  WIRED]

Features like these would allow searching of folders for type rather than
extension, without extra levels of scripting.

e.g.
Currently - without find : this is inefficient as we can't add filter without
adding code and we are already spawning thousands of find instances.


for f in ./*; do
if [[ $(file -b $f) == ".*PE32 executable.*" ]]; then
my_command $f; 
fi; 
done



Currently - with find : We need xargs and sed and so have to worry about
whitespace paths and filenames, we are also spawning several sub-commands.


find -type f |
 xargs file |
  sed -n 's/:.*PE32 executable.*/p' |
   xargs my_command 


Conceptual new usage (syntax usage tbd)

# For libmagic
find . -magic ".*ELF.*x86_64.*" -not -path "./bin/*" -exec my_command  {} \;
find . -mime ".*application/x-dosexec.*" -not -path "./bin/*" -mv {}
/Quaranteen/

# For xattr
find . -xattr 1 app.browser.url -xattr-substr 1 "http://download.org; -delete
find . -xattr 1 os.hash.blake2 -not -xattr-re 1 "^ERROR:bad hash.*" -exec hash
whirlpool {} ;\









___

Reply to this item at:

  

___
Message sent via Savannah
https://savannah.gnu.org/