Owen, I am on travel, but I will try to send that over as soon as I am back.
It should not be a problem. Edmon On Mon, Apr 18, 2016 at 1:12 PM, Owen O'Malley <[email protected]> wrote: > Edmon, > I'd love to help figure out what it going on. A couple of questions: > > * What file system are you reading from? HDFS? one of the S3-based ones? > local? > * Would it be possible to send me ([email protected]) the file's > metadata from orcfiledump? > * Do you know if MLlib is having the reader seek? At 100MB, it should just > read the file into memory. > > Thanks, > Owen > > On Thu, Apr 14, 2016 at 9:50 PM, Edmon Begoli <[email protected]> wrote: > >> Hi, >> >> We are running some experiments with Spark and ORC, Parquet and plain CSV >> files and we are observing some interesting effects. >> >> The dataset we are initially looking into is smallish - ~100 MB (CSV) and >> we encode it into Parquet and ORC. >> >> When we run Spark SQL aggregate queries we get an insane performance >> speedup. close to 10x, and consistently 2-4x. >> >> General count queries are slower. >> >> When we run MLlib random trees, we get very unusual performance result. >> >> CSV run takes about 40 seconds, Parquet about ~25 and ORC 60 sec. >> >> I have few intuitions for why is performance on the aggregate queries so >> good (sub-indexing and row/column groups internal statistics), but I am not >> quite clear on the performance on random forrest. >> >> Is ORC decoding algorithm or data retrieval inefficient for these kinds >> of ML jobs? >> >> This is for a performance study, so any insight would be highly >> appreciated. >> > >
