Edmon, I'd love to help figure out what it going on. A couple of questions:
* What file system are you reading from? HDFS? one of the S3-based ones? local? * Would it be possible to send me ([email protected]) the file's metadata from orcfiledump? * Do you know if MLlib is having the reader seek? At 100MB, it should just read the file into memory. Thanks, Owen On Thu, Apr 14, 2016 at 9:50 PM, Edmon Begoli <[email protected]> wrote: > Hi, > > We are running some experiments with Spark and ORC, Parquet and plain CSV > files and we are observing some interesting effects. > > The dataset we are initially looking into is smallish - ~100 MB (CSV) and > we encode it into Parquet and ORC. > > When we run Spark SQL aggregate queries we get an insane performance > speedup. close to 10x, and consistently 2-4x. > > General count queries are slower. > > When we run MLlib random trees, we get very unusual performance result. > > CSV run takes about 40 seconds, Parquet about ~25 and ORC 60 sec. > > I have few intuitions for why is performance on the aggregate queries so > good (sub-indexing and row/column groups internal statistics), but I am not > quite clear on the performance on random forrest. > > Is ORC decoding algorithm or data retrieval inefficient for these kinds of > ML jobs? > > This is for a performance study, so any insight would be highly > appreciated. >
