date:20151208

what's the way to access the last element from another partition

2015-12-08 Thread Zhiliang Zhu

In some given partition, it seems difficult to access the last element in 
another partition, but in my application I need do as that.Exactly how to do it 
? 
Just by repartition /shuffle  the rdd into one partition and get the specific 
"last" element ? Will this will change the previous order among the elements, 
and will it also not work ?
Thanks very much in advance!  

On Monday, December 7, 2015 11:32 AM, Zhiliang Zhu  
wrote:

On Monday, December 7, 2015 10:37 AM, DB Tsai  wrote:

 Only beginning and ending part of data. The rest in the partition can
be compared without shuffle.

Would you help write a few  pseudo-code about it...It seems that there is not 
shuffle related  API , or repartition ?
Thanks a lot in advance!

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Sun, Dec 6, 2015 at 6:27 PM, Zhiliang Zhu  wrote:
>
>
>
>
> On Saturday, December 5, 2015 3:00 PM, DB Tsai  wrote:
>
>
> This is tricky. You need to shuffle the ending and beginning elements
> using mapPartitionWithIndex.
>
>
> Does this mean that I need to shuffle the all elements in different
> partitions into one partition, then compare them by way of any two adjacent
> elements?
> It seems good, if it is like that.
>
> One more issue, will it loss parallelism since there become only one
> partition ...
>
> Thanks very much in advance!
>
>
>
>
>
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Fri, Dec 4, 2015 at 10:30 PM, Zhiliang Zhu  wrote:
>> Hi All,
>>
>> I would like to compare any two adjacent elements in one given rdd, just
>> as
>> the single machine code part:
>>
>> int a[N] = {...};
>> for (int i=0; i < N - 1; ++i) {
>>    compareFun(a[i], a[i+1]);
>> }
>> ...
>>
>> mapPartitions may work for some situations, however, it could not compare
>> elements in different  partitions.
>> foreach also seems not work.
>>
>> Thanks,
>> Zhiliang
>
>>
>>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark with MapDB

2015-12-08 Thread Fengdong Yu


what’s your data format？ ORC or CSV or others?

val keys = sqlContext.read.orc(“your previous batch data 
path”).select($”uniq_key”).collect
val broadCast = sc.broadCast(keys)

val rdd = your_current_batch_data
rdd.filter( line => line.key  not in broadCase.value)






> On Dec 8, 2015, at 4:44 PM, Ramkumar V  wrote:
> 
> Im running spark batch job in cluster mode every hour and it runs for 15 
> minutes. I have certain unique keys in the dataset. i dont want to process 
> those keys during my next hour batch.
> 
> Thanks,
> 
>   
> 
> 
> On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu  > wrote:
> Can you detail your question?  what looks like your previous batch and the 
> current batch?
> 
> 
> 
> 
> 
>> On Dec 8, 2015, at 3:52 PM, Ramkumar V > > wrote:
>> 
>> Hi,
>> 
>> I'm running java over spark in cluster mode. I want to apply filter on 
>> javaRDD based on some previous batch values. if i store those values in 
>> mapDB, is it possible to apply filter during the current batch ?
>> 
>> Thanks,
>> 
>>   
>> 
> 
>

Re: Spark with MapDB

2015-12-08 Thread Ramkumar V

Im running spark batch job in cluster mode every hour and it runs for 15
minutes. I have certain unique keys in the dataset. i dont want to process
those keys during my next hour batch.

*Thanks*,

On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu 
wrote:

> Can you detail your question?  what looks like your previous batch and the
> current batch?
>
>
>
>
>
> On Dec 8, 2015, at 3:52 PM, Ramkumar V  wrote:
>
> Hi,
>
> I'm running java over spark in cluster mode. I want to apply filter on
> javaRDD based on some previous batch values. if i store those values in
> mapDB, is it possible to apply filter during the current batch ?
>
> *Thanks*,
> 
>
>
>

Re: Spark with MapDB

2015-12-08 Thread Ramkumar V

Pipe separated value. I know broadcast and join works. but i would like to
know mapDB works or not ?

*Thanks*,



On Tue, Dec 8, 2015 at 2:22 PM, Fengdong Yu 
wrote:

>
> what’s your data format？ ORC or CSV or others?
>
> val keys = sqlContext.read.orc(“your previous batch data
> path”).select($”uniq_key”).collect
> val broadCast = sc.broadCast(keys)
>
> val rdd = your_current_batch_data
> rdd.filter( line => line.key  not in broadCase.value)
>
>
>
>
>
>
> On Dec 8, 2015, at 4:44 PM, Ramkumar V  wrote:
>
> Im running spark batch job in cluster mode every hour and it runs for 15
> minutes. I have certain unique keys in the dataset. i dont want to process
> those keys during my next hour batch.
>
> *Thanks*,
> 
>
>
> On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu 
> wrote:
>
>> Can you detail your question?  what looks like your previous batch and
>> the current batch?
>>
>>
>>
>>
>>
>> On Dec 8, 2015, at 3:52 PM, Ramkumar V  wrote:
>>
>> Hi,
>>
>> I'm running java over spark in cluster mode. I want to apply filter on
>> javaRDD based on some previous batch values. if i store those values in
>> mapDB, is it possible to apply filter during the current batch ?
>>
>> *Thanks*,
>> 
>>
>>
>>
>
>

Cc'd Parquet dev list. At first I expected to discuss this issue on
Parquet dev list but sent to the wrong mailing list. However, I think
it's OK to discuss it here since lots of Spark users are using Parquet
and this information should be generally useful here.

Comments inlined.

On 12/7/15 10:34 PM, Shushant Arora wrote:

how to read it using parquet tools.
When I did
hadoop parquet.tools.Main meta prquetfilename

I didn't get any info of min and max values.
Didn't realize that you meant to inspect min/max values since what you
asked was how to inspect the version of Parquet library that is used to
generate the Parquet file.

Currently parquet-tools doesn't print min/max statistics information.
I'm afraid you'll have to do it programmatically.
How can I see parquet version of my file.Is min max respective to some
parquet version or available since beginning?
AFAIK, it was added in 1.5.0
https://github.com/apache/parquet-mr/blob/parquet-1.5.0/parquet-column/src/main/java/parquet/column/statistics/Statistics.java

But I failed to find corresponding JIRA ticket or pull request for this.

On Mon, Dec 7, 2015 at 6:51 PM, Singh, Abhijeet
> wrote:

Yes, Parquet has min/max.

*From:*Cheng Lian [mailto:l...@databricks.com
]
*Sent:* Monday, December 07, 2015 11:21 AM
*To:* Ted Yu
*Cc:* Shushant Arora; user@spark.apache.org

*Subject:* Re: parquet file doubts

Oh sorry... At first I meant to cc spark-user list since Shushant
and I had been discussed some Spark related issues before. Then I
realized that this is a pure Parquet issue, but forgot to change
the cc list. Thanks for pointing this out! Please ignore this thread.

Cheng

On 12/7/15 12:43 PM, Ted Yu wrote:

Cheng:

I only see user@spark in the CC.

FYI

On Sun, Dec 6, 2015 at 8:01 PM, Cheng Lian
> wrote:

cc parquet-dev list (it would be nice to always do so for
these general questions.)

Cheng

On 12/6/15 3:10 PM, Shushant Arora wrote:

I have few doubts on parquet file format.

1.Does parquet keeps min max statistics like in ORC. how can I
see parquet version(whether its1.1,1.2or1.3) for parquet file
generated using hive or custom MR or AvroParquetoutputFormat.

Yes, Parquet also keeps row group statistics. You may check
the Parquet file using the parquet-meta CLI tool in
parquet-tools (see
https://github.com/Parquet/parquet-mr/issues/321 for details),
then look for the "creator" field of the file. For
programmatic access, check for
o.a.p.hadoop.metadata.FileMetaData.createdBy.

2.how to sort parquet records while generating parquet file
using avroparquetoutput format?

AvroParquetOutputFormat is not a format. It's just responsible
for converting Avro records to Parquet records. How are you
using AvroParquetOutputFormat? Any example snippets?

74 matches

Mail list logo