Accumulator map

2015-06-05 Thread Cosmin Cătălin Sanda
Hi,

I am trying to gather some statistics from an RDD using accumulators.
Essentially, I am counting how many times specific segments appear in each
row of the RDD. This works fine and I get the expected results, but the
problem is that each time I add a new segment to look for, I have to
explicitly create an Accumulator for it and explicitly use the Accumulator
in the foreach method.

Is there a way to use a dynamic list of Accumulators in Spark? I want to
control the segments from a single place in the code and the accumulators
to be dynamically created and used based on the metrics list.

BR,

*Cosmin Catalin SANDA*
Software Systems Engineer
Phone: +45.27.30.60.35


Error when saving as parquet to S3

2015-04-30 Thread Cosmin Cătălin Sanda
After repartitioning a DataFrame in Spark 1.3.0 I get a .parquet exception
when saving toAmazon's S3. The data that I try to write is 10G.

logsForDate
.repartition(10)
.saveAsParquetFile(destination) // <-- Exception here

The exception I receive is:

java.io.IOException: The file being written is in an invalid state.
Probably caused by an error thrown previously. Current state: COLUMN
at parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:137)
at
parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:129)
at parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:173)
at
parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:152)
at
parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
at org.apache.spark.sql.parquet.ParquetRelation2.org
$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:635)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I would like to know what is the problem and how to solve it.


*Cosmin Catalin SANDA*
Software Systems Engineer
Phone: +45.27.30.60.35


Error when saving as parquet to S3 from Spark

2015-04-30 Thread Cosmin Cătălin Sanda
After repartitioning a *DataFrame* in *Spark 1.3.0* I get a *.parquet*
exception
when saving to*Amazon's S3*. The data that I try to write is 10G.

logsForDate
.repartition(10)
.saveAsParquetFile(destination) // <-- Exception here

The exception I receive is:

java.io.IOException: The file being written is in an invalid state.
Probably caused by an error thrown previously. Current state: COLUMN
at parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:137)
at parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:129)
at parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:173)
at 
parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:152)
at 
parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
at org.apache.spark.sql.parquet.ParquetRelation2.org
$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:635)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I would like to know what is the problem and how to solve it.

*Cosmin Catalin SANDA*
Software Systems Engineer
Phone: +45.27.30.60.35


Disable partition discovery

2015-04-27 Thread Cosmin Cătălin Sanda
How can one disable *Partition discovery* in *Spark 1.3.0 *when using
*sqlContext.parquetFile*?

Alternatively, is there a way to load *.parquet* files without *Partition
discovery*?

Cosmin