Hi Robert,
I just use textFile. Here is the simple code:
val fs3File=sc.textFile("s3n://my bucket/myfolder/")
fs3File.count
do you suggest I should use sc.parallelize?
many thanks
From: Robert Collich [mailto:[email protected]]
Sent: Monday, February 01, 2016 6:54 PM
To: Lin, Hao; user
Subject: Re: try to read multiple bz2 files in s3
Hi Hao,
Could you please post the corresponding code? Are you using textFile or
sc.parallelize?
On Mon, Feb 1, 2016 at 2:36 PM Lin, Hao
<[email protected]<mailto:[email protected]>> wrote:
When I tried to read multiple bz2 files from s3, I have the following warning
messages. What is the problem here?
16/02/01 22:30:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
10.162.67.248): java.lang.ArrayIndexOutOfBoundsException: -1844424343
at
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode0(CBZip2InputStream.java:1014)
at
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:829)
at
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:504)
at
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
at
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:399)
at
org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:483)
at java.io.InputStream.read(InputStream.java:101)
at
org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.fillBuffer(CompressedSplitLineReader.java:130)
at
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at
org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.readLine(CompressedSplitLineReader.java:159)
at
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1553)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Confidentiality Notice:: This email, including attachments, may include
non-public, proprietary, confidential or legally privileged information. If you
are not an intended recipient or an authorized agent of an intended recipient,
you are hereby notified that any dissemination, distribution or copying of the
information contained in or transmitted with this e-mail is unauthorized and
strictly prohibited. If you have received this email in error, please notify
the sender by replying to this message and permanently delete this e-mail, its
attachments, and any copies of it immediately. You should not retain, copy or
use this e-mail or any attachment for any purpose, nor disclose all or any part
of the contents to any other person. Thank you.
Confidentiality Notice:: This email, including attachments, may include
non-public, proprietary, confidential or legally privileged information. If
you are not an intended recipient or an authorized agent of an intended
recipient, you are hereby notified that any dissemination, distribution or
copying of the information contained in or transmitted with this e-mail is
unauthorized and strictly prohibited. If you have received this email in
error, please notify the sender by replying to this message and permanently
delete this e-mail, its attachments, and any copies of it immediately. You
should not retain, copy or use this e-mail or any attachment for any purpose,
nor disclose all or any part of the contents to any other person. Thank you.