Angelo Can you please share an example? That will help in fixing this issue.
Sent from my iPhone > On Feb 22, 2019, at 09:40, Angelo Mantellini <[email protected]> wrote: > > Hi, > I tried the patch, but I see that the lines are always corrupted after the > first exception. > So, if my corrupted line is in the first row, the rest of the file is > corrupted. > > > > On 10/02/2019, 18:01, "Charles Givre" <[email protected]> wrote: > > Actually, some good news here… > I ran some test queries on the corrupted file and it seemed to work pretty > well. I didn’t get any exceptions! > > jdbc:drill:zk=local> select src_ip, COUNT(*) as packet_count from > dfs.test.`testv1.pcap`WHERE is_corrupt=1 GROUP BY src_ip ORDER BY > packet_count DESC > . . . . . . .semicolon> LIMIT 10; > +-----------------------------------------+---------------+ > | src_ip | packet_count | > +-----------------------------------------+---------------+ > | 150.249.255.161 | 176 | > | 150.249.255.24 | 28 | > | 131.38.3.15 | 26 | > | 111.248.196.128 | 25 | > | 202.13.230.242 | 20 | > | 163.28.217.199 | 19 | > | 27.18.36.151 | 18 | > | 2001:320f:c2ed:8693:1dff:f8f8:500:f1ed | 17 | > | 203.70.190.81 | 16 | > | 203.70.182.104 | 13 | > +-----------------------------------------+---------------+ > 10 rows selected (0.944 seconds) > > > select src_ip, dst_ip from dfs.test.`testv1.pcap`WHERE is_corrupt=1 LIMIT > 10; > +------------------+------------------+ > | src_ip | dst_ip | > +------------------+------------------+ > | 118.233.244.60 | 150.249.255.161 | > | 150.249.255.161 | 165.63.110.188 | > | 150.249.255.161 | 165.63.110.188 | > | 172.40.96.180 | 131.39.133.22 | > | 150.249.255.161 | 165.63.110.188 | > | 150.249.255.161 | 165.63.110.188 | > | 150.249.255.161 | 165.63.110.188 | > | 150.249.255.161 | 165.63.110.188 | > | 150.249.162.60 | 180.32.119.25 | > | 150.249.255.161 | 165.63.110.188 | > +------------------+------------------+ > 10 rows selected (1.031 seconds) > > > 0: jdbc:drill:zk=local> SELECT src_port , dst_port , src_mac_address , > dst_mac_address > . . . . . . .semicolon> FROM dfs.test.`testv1.pcap` > . . . . . . .semicolon> WHERE is_corrupt =1 LIMIT 10; > +-----------+-----------+--------------------+--------------------+ > | src_port | dst_port | src_mac_address | dst_mac_address | > +-----------+-----------+--------------------+--------------------+ > | 57058 | 443 | 00:0C:DB:1F:72:41 | 88:E0:F3:7A:66:F0 | > | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | > | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | > | 443 | 55972 | 00:0C:DB:1F:72:41 | CC:4E:24:1F:4E:00 | > | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | > | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | > | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | > | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | > | 4016 | 7699 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | > | 80 | 20706 | 00:0C:DB:1F:72:41 | 00:12:E2:C0:3F:09 | > +-----------+-----------+--------------------+--------------------+ > 10 rows selected (0.751 seconds) > > SELECT getCountryName(src_ip) AS country, COUNT(*) as packet_count FROM > dfs.test.`testv1.pcap` WHERE is_corrupt=1 GROUP BY getCountryName(src_ip) > ORDER BY packet_count DESC LIMIT 10; > +----------------+---------------+ > | country | packet_count | > +----------------+---------------+ > | Japan | 269 | > | Taiwan | 124 | > | United States | 105 | > | Unknown | 49 | > | China | 26 | > | South Korea | 8 | > | Australia | 4 | > | Germany | 3 | > | Hong Kong | 2 | > | Italy | 1 | > +----------------+---------------+ > 10 rows selected (1.519 seconds) > > SELECT is_corrupt, COUNT(*) as packet_count FROM dfs.test.`testv1.pcap` > GROUP BY is_corrupt; > +-------------+---------------+ > | is_corrupt | packet_count | > +-------------+---------------+ > | 0 | 6408 | > | 1 | 592 | > +-------------+---------------+ > 2 rows selected (0.931 seconds) > > > This PCAP file worked well with Superset also. > > >> On Feb 10, 2019, at 10:59, Charles Givre <[email protected]> wrote: >> >> If I can get some more examples of corrupted files I’ll test more >> thoroughly. Also, we’ll need to apply the same methodology to PCAP-NG, so >> I’ll need some examples there as well. My strategy is going to be get as >> much data as possible out of the corrupt packet. >> — C >> >> >> >>> On Feb 10, 2019, at 10:54, Ted Dunning <[email protected]> wrote: >>> >>> I think that accessing fields in corrupted packets will also cause >>> exceptions. But this is a great start. Conditionalizing field access on >>> !is_corrupt() might be sufficient for the next step. >>> >>> >>> >>>> On Sun, Feb 10, 2019 at 4:58 AM Charles Givre <[email protected]> wrote: >>>> >>>> All, >>>> I posted the following PR for this issue: >>>> https://github.com/apache/drill/pull/1637 < >>>> https://github.com/apache/drill/pull/1637> >>>> >>>> Basically this PR does two things. >>>> 1. It creates a boolean column called is_corrupt and >>>> 2. If the PCAP file has a corrupt row, it marks that row as corrupt by >>>> setting is_corrupt to true and keeps going >>>> >>>> WIth the example from Giovanni, I was able to find 590 or so corrupt rows >>>> out of 7000 in that PCAP file. It was late and I don’t know if that was >>>> what ti was supposed to find, but it worked and was able to query that. >>>> If you guys could send a few more examples, I’d like to test this on other >>>> files to make sure it works with them. We’re also going to have to do the >>>> same thing for the PCAP-NG format I would assume. >>>> >>>>> On Feb 10, 2019, at 03:07, Ted Dunning <[email protected]> wrote: >>>>> >>>>> On Sat, Feb 9, 2019 at 2:25 PM Bob Rudis <[email protected]> wrote: >>>>> >>>>>> ... >>>>>> And, I did indeed find a few and am just waiting for a formal review so >>>> I >>>>>> can submit them for the Drill dev & tests. >>>>>> >>>>> >>>>> Awesome! >>>> >>>> >> > > > >
