Bug#995406: [Debian-med-packaging] Bug#995406: bbmap: package does not ship resource files

2021-10-04 Thread Étienne Mollier
Hi Robert,

Thank you for enlightening me on bbmap usage!  While accessing
datasets over the Internet has a history of rendering tests
flaky, and injecting lots of data in the debian/tests/ directory
does not play well with the disk usage of the Debian archive,
thanks to your explanations, I could identify a dataset already
available in the archive, and seemingly suitable for autopkgtest
following your indications.

I am wrapping up a quite synthetic test suite which should be
representative enough of processing steps covered by bbmap
tools to catch issues automatically hopefully.  I should be able
to push something in the coming days.

Thank you again!

Have a nice day,  :)
-- 
Étienne Mollier 
Fingerprint:  8f91 b227 c7d6 f2b1 948c  8236 793c f67e 8f0d 11da
Sent from /dev/pts/3, please excuse my verbosity.


signature.asc
Description: PGP signature


Bug#995406: [Debian-med-packaging] Bug#995406: bbmap: package does not ship resource files

2021-10-04 Thread Robert
Hi,

With respect to the package test: the two fastq input files have to
match, since the "reads" i.e. fastq records typically come in pairs,
so one file has the forward reads and the other the respective reverse
reads in the same order.  I've tried sending this reply with a
suitable set of files attached (2-3 MB attachment) but it seems the
email didn't make it through.  However you can easily get publicly
available data, deposited at ncbi.nlm.nih.gov in the Sequence Read
Archive (SRA) as follows on a debian system: First install the
sra-toolkit package and then run

fasterq-dump SRR492190

which gives you two 435 MB decompressed fastq files.  To reduce
resources, maybe take the first 4 lines of each file (so pipe
through "head -n 4", has to be a multiple of 4).  So there would
be 1 "paired reads". Good enough for a test.  A somewhat typical
quality control run would then be

bbduk.sh in1=SRR492190_1.1.fastq in2=SRR492190_2.1.fastq
qtrim=rl trimq=15 minlen=75 out=out.fastq

which should take a second or so.  You can have the files gzip
compressed and save on storage and run this as:

bbduk.sh in1=SRR492190_1.1.fastq.gz
in2=SRR492190_2.1.fastq.gz qtrim=rl trimq=15 minlen=75
out=out.fastq.gz

I'd recommend to do that for testing the package.  With the bbduk
command I used in the original bug report it'll complain about some
missing reference data:

**  WARNING! A KMER OPERATION WAS CHOSEN BUT NO KMERS WERE
LOADED.  **
**  YOU NEED TO SPECIFY A REFERENCE FILE OR LITERAL
SEQUENCE.   **

that you would normally supply with the ref= option while the command
above doesn't need that and will run without any such warnings.

--Robert



Bug#995406: [Debian-med-packaging] Bug#995406: bbmap: package does not ship resource files

2021-09-30 Thread Étienne Mollier
Control: found -1 38.90+dfsg-1
Control: tag -1 confirmed

Hi all,

Andreas Tille, on 2021-09-30:
> Am Thu, Sep 30, 2021 at 01:22:23PM -0400 schrieb Robert:
> > The bbmap package does not ship the needed resource files which causes some 
> > of
> > the included tools not to work, e.g. bbduk when trying to process some fastq
> > data, crashes with output like [1].
> 
> Thanks a lot for the report.  Its extremely helpful since several of our
> maintainers are not using this software and we really need to rely on
> user input.

Thank you Robert!  Your report is very useful indeed!

[…]
> > $ bbduk.sh in1=fwd.fastq in2=rev.fastq ktrim=r k=21 mink=8 hdist=2 ftm=5 
> > tpe tbo threads=48 out=out.fastq
> > java -ea -Xmx76702m -Xms76702m -cp /usr/share/java/bbmap.jar jgi.BBDuk 
> > in1=fwd.fastq in2=rev.fastq ktrim=r k=21 mink=8 hdist=2 ftm=5 tpe tbo 
> > threads=48 out=out.fastq
> > Executing jgi.BBDuk [in1=fwd.fastq, in2=rev.fastq, ktrim=r, k=21, mink=8, 
> > hdist=2, ftm=5, tpe, tbo, threads=48, out=out.fastq]
> > Version 38.90
> > 
> > Set threads to 48
> > maskMiddle was disabled because useShortKmers=true
> > Warning!  Cannot find primes.txt.gz 
> > /tmp/bbduk_test/file:/usr/share/java/bbmap.jar!/primes.txt.gz
> > at jgi.BBDuk.main(BBDuk.java:78)
> 
> If we could turn this into a test I could upload including test.

Andreas, I pulled some data files from python-biopython-doc,
and I think I managed to reproduce the problem on my end:

$ bbduk.sh \

in1=/usr/share/doc/python-biopython-doc/Tests/Quality/example.fastq \

in2=/usr/share/doc/python-biopython-doc/Tests/Quality/solexa_example.fastq \
ktrim=r k=21 mink=8 hdist=2 ftm=5 tpe tbo threads=48 \
out=out.fastq
java -ea -Xmx7195m -Xms7195m -cp /usr/share/java/bbmap.jar jgi.BBDuk 
in1=/usr/share/doc/python-biopython-doc/Tests/Quality/example.fastq 
in2=/usr/share/doc/python-biopython-doc/Tests/Quality/solexa_example.fastq 
ktrim=r k=21 mink=8 hdist=2 ftm=5 tpe tbo threads=48 out=out.fastq
Executing jgi.BBDuk 
[in1=/usr/share/doc/python-biopython-doc/Tests/Quality/example.fastq, 
in2=/usr/share/doc/python-biopython-doc/Tests/Quality/solexa_example.fastq, 
ktrim=r, k=21, mink=8, hdist=2, ftm=5, tpe, tbo, threads=48, out=out.fastq]
Version 38.93

Set threads to 48
maskMiddle was disabled because useShortKmers=true
Warning!  Cannot find primes.txt.gz 
/home/emollier/tmp/bbduk_test/file:/usr/share/java/bbmap.jar!/primes.txt.gz
java.lang.Exception
at dna.Data.findPath(Data.java:1247)
at dna.Data.findPath(Data.java:1194)
at shared.Primes.fetchPrimes(Primes.java:167)
at shared.Primes.(Primes.java:177)
at kmer.ScheduleMaker.(ScheduleMaker.java:155)
at jgi.BBDuk.(BBDuk.java:964)
at jgi.BBDuk.main(BBDuk.java:78)
Exception in thread "main" java.lang.ExceptionInInitializerError
at kmer.ScheduleMaker.(ScheduleMaker.java:155)
at jgi.BBDuk.(BBDuk.java:964)
at jgi.BBDuk.main(BBDuk.java:78)
Caused by: java.lang.NullPointerException
at fileIO.ByteFile.(ByteFile.java:43)
at fileIO.ByteFile1.(ByteFile1.java:98)
at fileIO.ByteFile1.(ByteFile1.java:94)
at shared.Primes.fetchPrimes(Primes.java:169)
at shared.Primes.(Primes.java:177)
... 3 more

I tested the patch from Robert and applied by Andreas, and it
seems I could get much further in the processing.  For the
autopkgtest, note that I had to pick an appropriate dataset with
same dimensions in both files, otherwise the processing fails,
because of intrinsic data inconsistencies I presume:

$ bbduk.sh \

in1=/usr/share/doc/python-biopython-doc/Tests/Quality/wrapping_as_sanger.fastq \

in2=/usr/share/doc/python-biopython-doc/Tests/Quality/wrapping_as_solexa.fastq \
ktrim=r k=21 mink=8 hdist=2 ftm=5 tpe tbo threads=48 \
out=out.fastq
java -ea -Xmx7140m -Xms7140m -cp /usr/share/java/bbmap.jar jgi.BBDuk 
in1=/usr/share/doc/python-biopython-doc/Tests/Quality/wrapping_as_sanger.fastq 
in2=/usr/share/doc/python-biopython-doc/Tests/Quality/wrapping_as_solexa.fastq 
ktrim=r k=21 mink=8 hdist=2 ftm=5 tpe tbo threads=48 out=out.fastq
Executing jgi.BBDuk 
[in1=/usr/share/doc/python-biopython-doc/Tests/Quality/wrapping_as_sanger.fastq,
 
in2=/usr/share/doc/python-biopython-doc/Tests/Quality/wrapping_as_solexa.fastq, 
ktrim=r, k=21, mink=8, hdist=2, ftm=5, tpe, tbo, threads=48, out=out.fastq]
Version 38.93

Set threads to 48
maskMiddle was disabled because useShortKmers=true
0.018 seconds.
Initial:
Memory: max=7486m, total=7486m, free=7467m, used=19m

**  WARNING! A KMER OPERATION